Tutorial on the algorithm for HOD-based mock-making

This section of the documentation provides detailed notes for how the HodMockFactory populates halo catalogs with synthetic galaxy populations. The HodMockFactory uses composite models built with the HodModelFactory, which is documented in the Source code notes on HodModelFactory.

The bookkeeping of every step of the mock-generation is described in detail below. The design of the algorithm was built around the following considerations:

  1. The HodMockFactory should be able to correctly control mock-generation for an interchangeable set of (possibly inter-dependent) mappings from a halo table to a galaxy table.

  2. It should be easy for users to include just a single additional mapping without having to worry about any of the tedious bookkeeping of how memory is allocated for other galaxy properties or galaxy types.

These two goals together drive the HodMockFactory to be written with a high-level of abstraction. So the price paid for the modeling flexibility and the user-friendliness of designing new features is that the HodMockFactory gluing these behaviors together is complex. In this tutorial we describe every step in this abstract process, including a detailed look at the source code.


We will start in Basic syntax for making HOD-style mocks with a high-level overview of the functionality of the HodMockFactory class. We provide detailed notes on the source code of the mock factory in Algorithm for populating HOD-style mocks.

Basic syntax for making HOD-style mocks

The most common way to interact with instances of the HodMockFactory is as an attribute of the composite model you are using to generate the mock. For example, the code snippet below shows how the populate_mock method creates a mock object to the composite model, which in this case will be a model based on Zheng et al. (2007):

from halotools.empirical_models import PrebuiltHodModelFactory
zheng07_model = PrebuiltHodModelFactory('zheng07')

from halotools.sim_manager import CachedHaloCatalog
default_halocat = CachedHaloCatalog()


The final line of code above creates the zheng07_model.mock attribute, an instance of HodMockFactory.

The HodMockFactory is responsible for one task: using a Halotools composite model to populate a simulation with mock galaxies. When the HodModelFactory.populate_mock method first creates a model.mock instance, the instantiation of HodMockFactory triggers the pre-processing phase of mock population. Briefly, this phase does as many tasks in advance of actual mock population as possible to improve the efficiency of MCMCs (see below for details).

By default, instantiating the mock factory also triggers the HodMockFactory.populate method to be called. This is the method that actually creates the galaxy population. By calling the HodMockFactory.populate method, a new galaxy_table attribute is created and bound to the model.mock instance. The galaxy_table attribute stores an Astropy Table object with one row per mock galaxy and one column for every property assigned by the chosen composite model.

Algorithm for populating HOD-style mocks

Basics of numpy.repeat: the core function in HOD mock-making

Before going into the details of the HodMockFactory.populate method, in this section we will first cover a basic introduction to the numpy.repeat function, which is the single-most important Numpy function used by Halotools to make mock catalogs.

First let’s demonstrate basic usage:

>>> import numpy as np
>>> num_halos = 5
>>> halo_mass = np.logspace(11, 15, num_halos)
>>> halo_occupations = np.array([2, 0, 1, 0, 3])
>>> galaxy_host_halo_mass = np.repeat(halo_mass, halo_occupations)

The numpy.repeat function takes as input two arrays of equal length, in this case halo_mass and halo_occupations. The second array is interpreted as the number of times the corresponding entry in the first array should be repeated. A visualization of how this function behaves is shown in the diagram below.


This behavior is exactly what is needed to create a mock galaxy catalog with an HOD-style model. The numpy.repeat function is blazingly fast and ideally suited to the task at hand. The core function of any HOD model is to specify how many galaxies reside in a given halo. The task of assigning galaxy occupations to halos is controlled by the mc_occupation function of whatever sub-class of OccupationComponent you select as your model. The HodMockFactory does not need to know how occupation statistics are modeled - the only thing the factory needs to do is call the mc_occupation function to fill the occupations array in the diagram. The HodMockFactory simply calls numpy.repeat, passing (in this example) the halo_mvir column of the halo catalog as the first argument. This will create a length-Ngals array storing the host halo mass of every mock galaxy, which will then be included in the galaxy_table. If you want to include additional halo properties as columns in your galaxy_table, the HodMockFactory only needs to pass in additional halo catalog columns to numpy.repeat. In this way, the factory does not need to know anything about how the occupations array comes into existence. The factory only needs to repeatedly call numpy.repeat with the appropriate inputs that are determined by the model.

Pre-processing phase

The preliminary tasks of HOD mock-making are carried out by the preprocess_halo_catalog method. This first thing this function does is to throw out subhalos from the halo catalog, and to apply a halo mass completeness cut. You can control the completeness cut with the Num_ptcl_requirement keyword argument passed to the HodMockFactory constructor (Num_ptcl_requirement is also an optional keyword argument that may be passed to the HodModelFactory.populate_mock method, which in turn passes this argument on to the HodMockFactory constructor.)

New columns are then added to the halo_table according to any entries in the new_haloprop_func_dict; any such columns will automatically be included in the galaxy_table. See The new_haloprop_func_dict mechanism for further details.

Memory allocation phase

After pre-processing the halo catalog, memory must be allocated to store the galaxy_table. This is controlled by the allocate_memory method, the first few lines of which appear below.

def allocate_memory(self):

    self.galaxy_table = Table() # ``self`` refers to the ``model.mock`` object
    self._remaining_methods_to_call = copy(self.model._mock_generation_calling_sequence)

After initializing the galaxy_table, the allocate_memory method creates the _remaining_methods_to_call attribute; this will be used to keep a running list of the names of the composite model methods that the HodMockFactory should call; the factory calls the methods in this list one by one, deleting the corresponding name after the call; when the list is exhausted, mock-making is complete.

For every gal_type in a composite model, e.g., centrals or satellites, there is a corresponding mc_occupation_gal_type method that is responsible for determining how many galaxies of that type belong in each halo, e.g., mc_occupation_centrals or mc_occupation_satellites. As described in the previous section and elaborated upon in detail below, these mc_occupation methods work together with numpy.repeat to build the galaxy_table.

In HOD-style models, there is a natural division between the component model methods that get called by the HodMockFactory:

  1. methods that are called prior to the mc_occupation methods

  2. the mc_occupation methods themselves

  3. methods that are called after the mc_occupation methods.

For the first set of methods, memory has not yet been allocated for the galaxy_table as the length of the table is determined by the result of the call to the mc_occupation methods, which has stochastic behavior. For the third set of methods, the length of the galaxy_table has already been determined, and so these methods merely fill existing (empty) columns of the galaxy_table.

Galaxy properties assigned prior to the mc_occupation methods

As described above, the functions whose names appear in the _remaining_methods_to_call list are called in sequence by the HodMockFactory. For all functions that appear in this list prior to the first appearance of the mc_occupation_ functions, the data structure passed to the functions via the table keyword is the (post-processed) mock.halo_table. Thus whatever values these functions compute, the result of the functions will be stored in mock.halo_table. The reason that the functions in this phase are not directly passed mock.galaxy_table is because we do not know how many elements this table will have until calling the mc_occupation_ methods. Moreover, functions in this phase are by definition modeling some property of the galaxy population that does not even need to know which galaxies the property pertains to. Thus model functions that are called prior to the mc_occupation_ methods can only depend on the underlying halo, and so these functions should have sufficient knowledge to map their results on an input halo catalog.

All the same, any function in the _mock_generation_calling_sequence should have the results of its computations applied to the galaxy_table. As we will see in the next section, this transfer of information can be accomplished with numpy.repeat. Before seeing how that’s done, let’s look at the source code that generates the behavior we just described.

galprops_assigned_to_halo_table = []
for func_name in self.model._mock_generation_calling_sequence:
    if 'mc_occupation' in func_name:
        # exit when we encounter a ``mc_occupation_`` function
        func = getattr(self.model, func_name)
        func(table = self.halo_table)
        galprops_assigned_to_halo_table_by_func = func._galprop_dtypes_to_allocate.names


After executing the above for loop, all functions in the _mock_generation_calling_sequence that appear prior to the mc_occupation_ methods will have been executed after being passed the halo_table as the data structure bound to the table keyword argument. Additionally, the galprops_assigned_to_halo_table_by_func list works with The galprop_dtypes_to_allocate mechanism mechanism to keep track of which data columns we will pass on from the halo_table to the galaxy_table after allocating memory for the galaxy_table with the mc_occupation_ methods. After calling the mc_occupation_ methods, we will use numpy.repeat to transfer all columns appearing in additional_haloprops from the halo_table to the galaxy_table.

Calling the mc_occupation_ methods

The HodMockFactory marches through the mock-generation functions listed in _mock_generation_calling_sequence until the first time a function with name mc_occupation_ appears, triggering a new phase of the mock generation in which memory is allocated to the galaxy_table.

The basic way this next phase works is as follows. The halo_table is passed to each of the mc_occupation_ functions; these functions are called in the sequence in which their corresponding gal_type appears in self.gal_types. For example, if self.gal_types = ['centrals', 'satellites'], then the calling sequence of this phase will be mc_occupation_centrals followed by mc_occupation_satellites. For each gal_type in the composite model, the mc_occupation_gal_type function call will calculate a length-Nhalos array of integers that determines how many galaxies of that type belong in each halo. This integer array is stored in a private python dictionary self._occupation; there are num_gal_types keys in self._occupation, one key for the array of occupations associated with each gal_type.

Discussion of the mc_occupation_ phase continues below the following source code that controls the behavior we are describing.

self._occupation = {}
self._total_abundance = {}
self._gal_type_indices = {}

first_galaxy_index = 0
for gal_type in self.gal_types:
    occupation_func_name = 'mc_occupation_'+gal_type
    occupation_func = getattr(self.model, occupation_func_name)
    # Call the component model to get a Monte Carlo
    # realization of the abundance of gal_type galaxies
    self._occupation[gal_type] = occupation_func(table=self.halo_table)

    # Now use the above result to set up the indexing scheme
    self._total_abundance[gal_type] = (
    last_galaxy_index = first_galaxy_index + self._total_abundance[gal_type]
    # Build a bookkeeping device to keep track of
    # which array elements pertain to which gal_type.
    self._gal_type_indices[gal_type] = slice(
        first_galaxy_index, last_galaxy_index)
    first_galaxy_index = last_galaxy_index
    # Remove the mc_occupation function from the list of methods to call

self.Ngals = np.sum(list(self._total_abundance.values()))

After calling the occupation_func for a given gal_type, we know exactly how many galaxies of that type will appear in our mock. The total abundance of galaxies of a given type is stored in the self._total_abundance dictionary. The keys and integers stored in this dictionary determine the memory layout of the galaxy_table. Suppose we have three galaxy types, centrals, satellites and splashbacks. The first num_centrals = self._total_abundance['centrals'] rows of the galaxy_table store the information about our population of centrals; the next num_satellites rows are dedicated to the satellites; the final num_splashbacks rows stores the properties of the splashback galaxies.

Once these bookkeeping dictionaries have been built, allocating memory for the galaxy_table is straightforward. We just need to initialize all the columns to which values will be assigned by the functions that still appear in self._remaining_methods_to_call. The initialization is just a numpy.zeros array of the appropriate data type. These arrays will be filled during the third and final stage of mock-making.

# Allocate memory for all additional halo properties
for halocatkey in self.additional_haloprops:
    self.galaxy_table[halocatkey] = np.zeros(self.Ngals,
        dtype = self.halo_table[halocatkey].dtype)

dt = self.model._galprop_dtypes_to_allocate
for key in dt.names:
    self.galaxy_table[key] = np.zeros(self.Ngals, dtype = dt[key].type)

Final stage: galaxy properties assigned after the mc_occupation methods

After the galaxy_table has been initialized, the allocate_memory method is complete and control is returned to the populate method. The first thing the populate method does is transfer all the necessary halo properties from the halo_table to the galaxy_table. Recall the galaxy_table memory layout described in the previous section when viewing the following code block:

for gal_type in self.gal_types:

    # Retrieve the indices of our pre-allocated arrays
    # that store the info pertaining to gal_type galaxies
    gal_type_slice = self._gal_type_indices[gal_type]

    # Fill the relevant slice of the ``gal_type`` column with the appropriate string
    self.galaxy_table['gal_type'][gal_type_slice] = (
        np.repeat(gal_type, self._total_abundance[gal_type],axis=0))

    # Store all other relevant host halo properties into their
    # appropriate pre-allocated array
    for halocatkey in self.additional_haloprops:
        self.galaxy_table[halocatkey][gal_type_slice] = np.repeat(
            self.halo_table[halocatkey], self._occupation[gal_type], axis=0)

At this point, the galaxy_table stores all the halo properties needed by the composite model, and so when we proceed to call the remaining mock-generation functions, we can safely pass these functions the appropriate slice of the galaxy_table and rest assured that the functions will have all the information they need to assign whatever property they are responsible for.

The functions responsible for modeling the intra-halo distribution of galaxies (e.g., an NFWProfile for satellites and a TrivialProfile for centrals) model host-centric positions, so we initialize the positions of each galaxy to be at its halo center.

self.galaxy_table['x'] = self.galaxy_table['halo_x']
self.galaxy_table['y'] = self.galaxy_table['halo_y']
self.galaxy_table['z'] = self.galaxy_table['halo_z']
self.galaxy_table['vx'] = self.galaxy_table['halo_vx']
self.galaxy_table['vy'] = self.galaxy_table['halo_vy']
self.galaxy_table['vz'] = self.galaxy_table['halo_vz']

Finally, the loop below is where all the action happens with galaxy property assignment:

for method in self._remaining_methods_to_call:
    func = getattr(self.model, method)
    gal_type_slice = self._gal_type_indices[func.gal_type]
    func(table = self.galaxy_table[gal_type_slice])

We perform a loop over all functions whose names remain in self._remaining_methods_to_call. All functions appearing in this list have a gal_type attribute bound to them that was created by the HodModelFactory during the assembly of the composite model (see the Inheriting behaviors from the component models section of the Source code notes on HodModelFactory for further details about how the HodModelFactory binds the appropriate gal_type information to each mock-generation function). This gal_type determines the slice of the galaxy_table that will be passed to the function via the table keyword argument. This way, each component model does not need to manage any bookkeeping about which slice of an input table it assigns its properties to: the HodMockFactory is responsible for passing the slice of the galaxy_table that pertains to the gal_type of the component model.

The generation of all properties predicted by the model is now complete. The HodMockFactory.populate function terminates after enforcing the periodic boundary conditions of the simulation. This enforcement is controlled by the HodMockFactory because component models of halo profiles strictly assign halo-centric positions. Thus corrections need to be applied to cases where galaxies are assigned to regions of the halo that spilled over the edges of the simulation box.

Summary: which data table is passed to the table keyword argument in HOD mock-making?

As described in detail in Tutorial on building an HOD-style model, all methods in the _mock_generation_calling_sequence of any HOD-style composite model must accept a table keyword argument. As shown above, the actual data structure passed via this argument depends on the phase of the mock-making in which the function is called. We conclude this tutorial by quickly summarizing which data structure gets passed to which function.

  1. Mock-generation functions whose names appear prior to the first appearance of a mc_occupation_ function get passed the entire post-processed model.mock.halo_table to their table keyword argument.

  2. mc_occupation_ functions also receive the entire model.mock.halo_table via their table keyword argument.

  3. Mock-generation functions whose names appear after the first mc_occupation_ function get passed a slice of the model.mock.galaxy_table; the slice they are passed is the section of the table that pertains to the gal_type the function is associated with.