Source code for halotools.utils.group_member_generator
""" Module containing the `group_member_generator`,
the primary engine of the group aggregation calculations.
"""
import numpy as np
from .array_utils import array_is_monotonic
__all__ = ('group_member_generator', )
[docs]
def group_member_generator(data, grouping_key, requested_columns):
"""
Generator used to loop over grouped data and yield
requested properties of members of a group.
When running a for loop over `group_member_generator`,
you will be repeatedly sent arrays storing
properties of data entries sharing a common ``grouping_key``.
This enables you to perform whatever intra-group calculation
you wish for each iteration through the number of total groups.
The generator also sends you the indices of the input ``data``
corresponding to the yielded group members, allowing you to
create new columns for your data table storing the results
of your intra-group calculations.
Before calling `group_member_generator`, the input ``data``
must be sorted by the ``grouping_key`` so that
``data[grouping_key]`` is monotonic.
Common applications of `group_member_generator` include
subhalo analysis (e.g., calculating host halo mass) and
galaxy group analysis (e.g., calculating total stellar mass
or group-centric position). The Examples section below shows basic usage.
There are also three tutorials demonstrating common applications in more detail:
1. :ref:`galaxy_catalog_analysis_tutorial1`
2. :ref:`halo_catalog_analysis_tutorial1`
3. :ref:`example_merger_tree_analysis`
Parameters
------------
data : Structured Numpy `~numpy.ndarray` or Astropy `~astropy.table.Table`
grouping_key : string
Name of the column that defines how the input ``data`` are grouped,
e.g., ``group_id`` or ``halo_hostid``.
The input ``data`` must be sorted such that
the array stored in ``data[grouping_key]`` is monotonic.
requested_columns : list of strings
List of column names that will be yielded by the generator.
As you loop over the generator, for every string entry in
``requested_columns`` there will be an array that is yielded.
It is permissible for ``requested_columns`` to be an empty list,
in which case the ``group_data_list`` yielded at each iteration
will also be an empty list.
Returns
---------
first_idx, last_idx : int
These two integers provide the indices of the rows of
the input ``data`` yielded at each iteration.
group_data_list : list
List of arrays storing the requested group member properties.
There will be one element of ``group_data_list`` for every
element of the input ``requested_columns``. Each element is a
Numpy `~numpy.ndarray` with a length equal to the number of
members of the group.
Examples
----------
First let's retrieve a Halotools-formatted halo catalog storing
some randomly generated data.
>>> from halotools.sim_manager import FakeSim
>>> halocat = FakeSim()
>>> halos = halocat.halo_table
As described in :ref:`rockstar_subhalo_nomenclature`,
the ``halo_hostid`` is a natural grouping key for a halo table.
Let's use this key to calculate the host halo mass of all halos in
the data table.
First we build the generator:
>>> halos.sort(['halo_hostid', 'halo_upid'])
>>> grouping_key = 'halo_hostid'
>>> requested_columns = ['halo_mvir']
>>> group_gen = group_member_generator(halos, grouping_key, requested_columns)
Then we loop over it:
>>> result = np.zeros(len(halos))
>>> for first, last, member_props in group_gen:
... masses = member_props[0]
... host_mass = masses[0]
... result[first:last] = host_mass
>>> halos['halo_mvir_host_halo'] = result
Inside the scope of the loop, the first two yielded integers
allow us to access the appropriate slice of the array being calculated.
The ``member_props`` list only stores a single element, the
*masses* array storing the value of ``halo_mvir``
of each member of the host + subhalo system.
Because we have sorted the halos by *both* ``halo_hostid`` and
``halo_upid``, then within each ``halo_hostid`` grouping,
the host system will appear first because -1 is smaller than any
value for ``halo_upid`` stored by a subhalo. Thus by selecting the
first element of the *masses* array, we select the virial mass
of the host halo.
We can also use the `group_member_generator` to compute more complicated quantities.
For example, let's calculate the mean mass-weighted spin of all halo members.
Note that our halo table is already sorted, so we save CPU time by not re-sorting it.
>>> grouping_key = 'halo_hostid'
>>> requested_columns = ['halo_mvir', 'halo_spin']
>>> group_gen = group_member_generator(halos, grouping_key, requested_columns)
>>> result = np.zeros(len(halos))
>>> for first, last, member_props in group_gen:
... masses = member_props[0]
... spins = member_props[1]
... mass_weighted_avg_spin = np.sum(masses*spins)/float(len(masses))
... result[first:last] = mass_weighted_avg_spin
>>> halos['halo_mass_weighted_avg_spin'] = result
"""
try:
available_columns = data.dtype.names
except AttributeError:
msg = ("The input ``data`` must be an Astropy Table or Numpy Structured Array")
raise TypeError(msg)
try:
assert grouping_key in available_columns
except AssertionError:
msg = ("Input ``grouping_key`` must be a column name of the input ``data``")
raise KeyError(msg)
try:
_ = iter(requested_columns)
for colname in requested_columns:
assert colname in available_columns
except TypeError:
msg = ("\nThe input ``requested_columns`` must be an iterable sequence\n")
raise TypeError(msg)
except AssertionError:
if type(requested_columns) in (str, str):
msg = ("\n Your input ``requested_columns`` should be a \n"
"list of strings, not a single string\n")
else:
msg = ("\nEach element of the input ``requested_columns`` must be \n"
"an existing column name of the input ``data``.\n")
raise KeyError(msg)
group_id_array = np.copy(data[grouping_key])
try:
assert array_is_monotonic(group_id_array, strict=False) != 0
except AssertionError:
msg = ("Your input ``data`` must be sorted so that the ``data[grouping_key]`` is monotonic")
raise ValueError(msg)
result = np.unique(group_id_array, return_index=True, return_counts=True)
group_ids_data, idx_groups_data, group_richness_data = result
requested_array_list = [data[key].data for key in requested_columns]
for igroup, host_halo_id in enumerate(group_ids_data):
first_igroup_idx = idx_groups_data[igroup]
last_igroup_idx = first_igroup_idx + group_richness_data[igroup]
group_data_list = [arg[first_igroup_idx:last_igroup_idx] for arg in requested_array_list]
yield first_igroup_idx, last_igroup_idx, group_data_list