Source code for halotools.utils.group_member_generator

""" Module containing the `group_member_generator`,
the primary engine of the group aggregation calculations.
"""

import numpy as np
from .array_utils import array_is_monotonic

__all__ = ('group_member_generator', )



[docs]
def group_member_generator(data, grouping_key, requested_columns):
    """
    Generator used to loop over grouped data and yield
    requested properties of members of a group.
    When running a for loop over `group_member_generator`,
    you will be repeatedly sent arrays storing
    properties of data entries sharing a common ``grouping_key``.
    This enables you to perform whatever intra-group calculation
    you wish for each iteration through the number of total groups.
    The generator also sends you the indices of the input ``data``
    corresponding to the yielded group members, allowing you to
    create new columns for your data table storing the results
    of your intra-group calculations.

    Before calling `group_member_generator`, the input ``data``
    must be sorted by the ``grouping_key`` so that
    ``data[grouping_key]`` is monotonic.

    Common applications of `group_member_generator` include
    subhalo analysis (e.g., calculating host halo mass) and
    galaxy group analysis (e.g., calculating total stellar mass
    or group-centric position). The Examples section below shows basic usage.
    There are also three tutorials demonstrating common applications in more detail:

        1. :ref:`galaxy_catalog_analysis_tutorial1`

        2. :ref:`halo_catalog_analysis_tutorial1`

        3. :ref:`example_merger_tree_analysis`

    Parameters
    ------------
    data : Structured Numpy `~numpy.ndarray` or Astropy `~astropy.table.Table`

    grouping_key : string
        Name of the column that defines how the input ``data`` are grouped,
        e.g., ``group_id`` or ``halo_hostid``.
        The input ``data`` must be sorted such that
        the array stored in ``data[grouping_key]`` is monotonic.

    requested_columns : list of strings
        List of column names that will be yielded by the generator.
        As you loop over the generator, for every string entry in
        ``requested_columns`` there will be an array that is yielded.
        It is permissible for ``requested_columns`` to be an empty list,
        in which case the ``group_data_list`` yielded at each iteration
        will also be an empty list.

    Returns
    ---------
    first_idx, last_idx : int
        These two integers provide the indices of the rows of
        the input ``data`` yielded at each iteration.

    group_data_list : list
        List of arrays storing the requested group member properties.
        There will be one element of ``group_data_list`` for every
        element of the input ``requested_columns``. Each element is a
        Numpy `~numpy.ndarray` with a length equal to the number of
        members of the group.


    Examples
    ----------
    First let's retrieve a Halotools-formatted halo catalog storing
    some randomly generated data.

    >>> from halotools.sim_manager import FakeSim
    >>> halocat = FakeSim()
    >>> halos = halocat.halo_table

    As described in :ref:`rockstar_subhalo_nomenclature`,
    the ``halo_hostid`` is a natural grouping key for a halo table.
    Let's use this key to calculate the host halo mass of all halos in
    the data table.

    First we build the generator:

    >>> halos.sort(['halo_hostid', 'halo_upid'])
    >>> grouping_key = 'halo_hostid'
    >>> requested_columns = ['halo_mvir']
    >>> group_gen = group_member_generator(halos, grouping_key, requested_columns)

    Then we loop over it:

    >>> result = np.zeros(len(halos))
    >>> for first, last, member_props in group_gen:
    ...     masses = member_props[0]
    ...     host_mass = masses[0]
    ...     result[first:last] = host_mass
    >>> halos['halo_mvir_host_halo'] = result

    Inside the scope of the loop, the first two yielded integers
    allow us to access the appropriate slice of the array being calculated.
    The ``member_props`` list only stores a single element, the
    *masses* array storing the value of ``halo_mvir``
    of each member of the host + subhalo system.
    Because we have sorted the halos by *both* ``halo_hostid`` and
    ``halo_upid``, then within each ``halo_hostid`` grouping,
    the host system will appear first because -1 is smaller than any
    value for ``halo_upid`` stored by a subhalo. Thus by selecting the
    first element of the *masses* array, we select the virial mass
    of the host halo.

    We can also use the `group_member_generator` to compute more complicated quantities.
    For example, let's calculate the mean mass-weighted spin of all halo members.
    Note that our halo table is already sorted, so we save CPU time by not re-sorting it.

    >>> grouping_key = 'halo_hostid'
    >>> requested_columns = ['halo_mvir', 'halo_spin']
    >>> group_gen = group_member_generator(halos, grouping_key, requested_columns)

    >>> result = np.zeros(len(halos))
    >>> for first, last, member_props in group_gen:
    ...     masses = member_props[0]
    ...     spins = member_props[1]
    ...     mass_weighted_avg_spin = np.sum(masses*spins)/float(len(masses))
    ...     result[first:last] = mass_weighted_avg_spin
    >>> halos['halo_mass_weighted_avg_spin'] = result

    """

    try:
        available_columns = data.dtype.names
    except AttributeError:
        msg = ("The input ``data`` must be an Astropy Table or Numpy Structured Array")
        raise TypeError(msg)

    try:
        assert grouping_key in available_columns
    except AssertionError:
        msg = ("Input ``grouping_key`` must be a column name of the input ``data``")
        raise KeyError(msg)

    try:
        _ = iter(requested_columns)
        for colname in requested_columns:
            assert colname in available_columns
    except TypeError:
        msg = ("\nThe input ``requested_columns`` must be an iterable sequence\n")
        raise TypeError(msg)
    except AssertionError:
        if type(requested_columns) in (str, str):
            msg = ("\n Your input ``requested_columns`` should be a \n"
                "list of strings, not a single string\n")
        else:
            msg = ("\nEach element of the input ``requested_columns`` must be \n"
                "an existing column name of the input ``data``.\n")
        raise KeyError(msg)

    group_id_array = np.copy(data[grouping_key])
    try:
        assert array_is_monotonic(group_id_array, strict=False) != 0
    except AssertionError:
        msg = ("Your input ``data`` must be sorted so that the ``data[grouping_key]`` is monotonic")
        raise ValueError(msg)

    result = np.unique(group_id_array, return_index=True, return_counts=True)
    group_ids_data, idx_groups_data, group_richness_data = result

    requested_array_list = [data[key].data for key in requested_columns]
    for igroup, host_halo_id in enumerate(group_ids_data):
        first_igroup_idx = idx_groups_data[igroup]
        last_igroup_idx = first_igroup_idx + group_richness_data[igroup]
        group_data_list = [arg[first_igroup_idx:last_igroup_idx] for arg in requested_array_list]
        yield first_igroup_idx, last_igroup_idx, group_data_list
Navigation

Source code for halotools.utils.group_member_generator