Tutorial on the algorithm for HOD-based mock-making¶
This section of the documentation provides detailed notes
for how the HodMockFactory
populates halo catalogs with synthetic galaxy populations.
The HodMockFactory
uses composite models built with the HodModelFactory
, which
is documented in the Source code notes on HodModelFactory.
The bookkeeping of every step of the mock-generation is described in detail below. The design of the algorithm was built around the following considerations:
The
HodMockFactory
should be able to correctly control mock-generation for an interchangeable set of (possibly inter-dependent) mappings from a halo table to a galaxy table.It should be easy for users to include just a single additional mapping without having to worry about any of the tedious bookkeeping of how memory is allocated for other galaxy properties or galaxy types.
These two goals together drive the HodMockFactory
to be written with a high-level of abstraction.
So the price paid for the modeling flexibility and the user-friendliness of designing
new features is that the HodMockFactory
gluing these behaviors together is complex.
In this tutorial we describe every step in this abstract process,
including a detailed look at the source code.
Outline¶
We will start in Basic syntax for making HOD-style mocks with a high-level overview of the functionality
of the HodMockFactory
class. We provide detailed
notes on the source code of the mock factory in Algorithm for populating HOD-style mocks.
Basic syntax for making HOD-style mocks¶
The most common way to interact with
instances of the HodMockFactory
is as an attribute of the composite model you
are using to generate the mock. For example, the code snippet below shows how
the populate_mock
method creates a mock
object to
the composite model, which in this case will be a model based on Zheng et al. (2007):
from halotools.empirical_models import PrebuiltHodModelFactory
zheng07_model = PrebuiltHodModelFactory('zheng07')
from halotools.sim_manager import CachedHaloCatalog
default_halocat = CachedHaloCatalog()
zheng07_model.populate_mock(default_halocat)
The final line of code above creates the zheng07_model.mock
attribute,
an instance of HodMockFactory
.
The HodMockFactory
is responsible for one task: using a Halotools composite model
to populate a simulation with mock galaxies.
When the HodModelFactory.populate_mock
method first creates a model.mock
instance,
the instantiation of HodMockFactory
triggers the pre-processing phase of mock population.
Briefly, this phase does as many tasks in advance of actual mock population as possible
to improve the efficiency of MCMCs (see below for details).
By default, instantiating the mock factory also triggers
the HodMockFactory.populate
method to be called. This is the method that actually creates
the galaxy population. By calling the HodMockFactory.populate
method,
a new galaxy_table
attribute is created and bound to the model.mock
instance.
The galaxy_table
attribute stores an Astropy Table
object with one row
per mock galaxy and one column for every property assigned by the chosen composite model.
Algorithm for populating HOD-style mocks¶
Basics of numpy.repeat
: the core function in HOD mock-making¶
Before going into the details of the HodMockFactory.populate
method,
in this section we will first cover a basic introduction to the numpy.repeat
function,
which is the single-most important Numpy function used by Halotools to make mock catalogs.
First let’s demonstrate basic usage:
>>> import numpy as np
>>> num_halos = 5
>>> halo_mass = np.logspace(11, 15, num_halos)
>>> halo_occupations = np.array([2, 0, 1, 0, 3])
>>> galaxy_host_halo_mass = np.repeat(halo_mass, halo_occupations)
The numpy.repeat
function takes as input two arrays of equal length,
in this case halo_mass
and halo_occupations
. The second array is interpreted
as the number of times the corresponding entry in the first array should be repeated.
A visualization of how this function behaves is shown in the diagram below.
This behavior is exactly what is needed to create a mock galaxy catalog with an HOD-style model.
The numpy.repeat
function is blazingly fast and ideally suited to the task at hand.
The core function of any HOD model is to specify how many galaxies reside in a given halo.
The task of assigning galaxy occupations to halos is controlled by
the mc_occupation
function
of whatever sub-class of OccupationComponent
you select as your model.
The HodMockFactory
does not need to know how occupation statistics
are modeled - the only thing the factory needs to do is call
the mc_occupation
function to fill the
occupations
array in the diagram.
The HodMockFactory
simply calls numpy.repeat
,
passing (in this example) the halo_mvir
column of the halo catalog as the first argument.
This will create a length-Ngals array storing the host halo mass of every mock galaxy,
which will then be included in the galaxy_table
.
If you want to include additional halo properties as columns in your galaxy_table
,
the HodMockFactory
only needs to pass in additional
halo catalog columns to numpy.repeat
. In this way, the factory does not need to know anything
about how the occupations
array comes into existence. The factory
only needs to repeatedly call numpy.repeat
with the appropriate inputs that are
determined by the model.
Pre-processing phase¶
The preliminary tasks of HOD mock-making are carried out by the
preprocess_halo_catalog
method. This first thing this function does
is to throw out subhalos from the halo catalog, and to apply a halo mass completeness cut.
You can control the completeness cut with the Num_ptcl_requirement
keyword argument passed
to the HodMockFactory
constructor (Num_ptcl_requirement
is also an optional keyword argument
that may be passed to the HodModelFactory.populate_mock
method, which in turn passes this argument
on to the HodMockFactory
constructor.)
New columns are then added to the halo_table
according to any entries in the new_haloprop_func_dict
; any such columns will automatically be included
in the galaxy_table
. See The new_haloprop_func_dict mechanism for further details.
Memory allocation phase¶
After pre-processing the halo catalog, memory must be allocated to store the galaxy_table
.
This is controlled by the allocate_memory
method, the first few lines of which
appear below.
def allocate_memory(self):
self.galaxy_table = Table() # ``self`` refers to the ``model.mock`` object
self._remaining_methods_to_call = copy(self.model._mock_generation_calling_sequence)
After initializing the galaxy_table
,
the allocate_memory
method creates the _remaining_methods_to_call
attribute; this will be used to keep a running list of the names of the
composite model methods that the HodMockFactory
should call;
the factory calls the methods in this list one by one, deleting the corresponding name
after the call; when the list is exhausted, mock-making is complete.
For every gal_type
in a composite model, e.g., centrals
or satellites
,
there is a corresponding mc_occupation_gal_type
method that is responsible for
determining how many galaxies of that type belong in each halo, e.g.,
mc_occupation_centrals
or mc_occupation_satellites
.
As described in the previous section and elaborated upon in detail below,
these mc_occupation
methods work together with
numpy.repeat
to build the galaxy_table
.
In HOD-style models, there is a natural division between the component model methods that
get called by the HodMockFactory
:
methods that are called prior to the
mc_occupation
methodsthe
mc_occupation
methods themselvesmethods that are called after the
mc_occupation
methods.
For the first set of methods, memory has not yet been allocated for the galaxy_table
as the length of the table is determined by the result of the call to the mc_occupation
methods, which has stochastic behavior.
For the third set of methods, the length of the galaxy_table
has already been
determined, and so these methods merely fill existing (empty) columns of the galaxy_table
.
Galaxy properties assigned prior to the mc_occupation
methods¶
As described above, the functions whose names appear in the
_remaining_methods_to_call
list are called in sequence by the HodMockFactory
.
For all functions that appear in this list prior to the first appearance of the
mc_occupation_
functions, the data structure passed to the functions
via the table keyword is the (post-processed) mock.halo_table
.
Thus whatever values these functions compute, the result of the functions
will be stored in mock.halo_table
. The reason that the functions in this phase
are not directly passed mock.galaxy_table
is because we do not know how many
elements this table will have until calling the mc_occupation_
methods.
Moreover, functions in this phase are by definition modeling some
property of the galaxy population that does not even need to know
which galaxies the property pertains to. Thus model functions that are called
prior to the mc_occupation_
methods can only depend on
the underlying halo, and so these functions should have sufficient knowledge to map
their results on an input halo catalog.
All the same, any function in the _mock_generation_calling_sequence
should have
the results of its computations applied to the galaxy_table
. As we will see
in the next section, this transfer of information can be accomplished with
numpy.repeat
. Before seeing how that’s done, let’s look at the source code that generates
the behavior we just described.
galprops_assigned_to_halo_table = []
for func_name in self.model._mock_generation_calling_sequence:
if 'mc_occupation' in func_name:
# exit when we encounter a ``mc_occupation_`` function
break
else:
func = getattr(self.model, func_name)
func(table = self.halo_table)
galprops_assigned_to_halo_table_by_func = func._galprop_dtypes_to_allocate.names
galprops_assigned_to_halo_table.extend(galprops_assigned_to_halo_table_by_func)
self._remaining_methods_to_call.remove(func_name)
self.additional_haloprops.extend(galprops_assigned_to_halo_table)
After executing the above for loop, all functions in the _mock_generation_calling_sequence
that appear prior to the mc_occupation_
methods will have been executed after being
passed the halo_table
as the data structure bound to the table
keyword argument.
Additionally, the galprops_assigned_to_halo_table_by_func
list works with
The galprop_dtypes_to_allocate mechanism mechanism to keep track of
which data columns we will pass on from the halo_table
to the galaxy_table
after allocating memory for the galaxy_table
with the mc_occupation_
methods.
After calling the mc_occupation_
methods, we will use numpy.repeat
to transfer all columns
appearing in additional_haloprops
from the halo_table
to the galaxy_table
.
Calling the mc_occupation_
methods¶
The HodMockFactory
marches through the mock-generation functions listed
in _mock_generation_calling_sequence
until the first time a function with name
mc_occupation_
appears, triggering a new phase of the mock generation in which
memory is allocated to the galaxy_table
.
The basic way this next phase works is as follows. The halo_table
is passed
to each of the mc_occupation_
functions; these functions are called in the sequence
in which their corresponding gal_type
appears in self.gal_types
.
For example, if self.gal_types = ['centrals', 'satellites']
, then the
calling sequence of this phase will be mc_occupation_centrals
followed by
mc_occupation_satellites
.
For each gal_type
in the composite model,
the mc_occupation_gal_type
function call will calculate a length-Nhalos array
of integers that determines how many galaxies of that type belong in each halo.
This integer array is stored in a private python dictionary self._occupation
;
there are num_gal_types keys in self._occupation
,
one key for the array of occupations associated with each gal_type
.
Discussion of the mc_occupation_
phase continues below the following source code
that controls the behavior we are describing.
self._occupation = {}
self._total_abundance = {}
self._gal_type_indices = {}
first_galaxy_index = 0
for gal_type in self.gal_types:
occupation_func_name = 'mc_occupation_'+gal_type
occupation_func = getattr(self.model, occupation_func_name)
# Call the component model to get a Monte Carlo
# realization of the abundance of gal_type galaxies
self._occupation[gal_type] = occupation_func(table=self.halo_table)
# Now use the above result to set up the indexing scheme
self._total_abundance[gal_type] = (
self._occupation[gal_type].sum()
)
last_galaxy_index = first_galaxy_index + self._total_abundance[gal_type]
# Build a bookkeeping device to keep track of
# which array elements pertain to which gal_type.
self._gal_type_indices[gal_type] = slice(
first_galaxy_index, last_galaxy_index)
first_galaxy_index = last_galaxy_index
# Remove the mc_occupation function from the list of methods to call
self._remaining_methods_to_call.remove(occupation_func_name)
self.additional_haloprops.append('halo_num_'+gal_type)
self.Ngals = np.sum(list(self._total_abundance.values()))
After calling the occupation_func
for a given gal_type
, we know exactly how many
galaxies of that type will appear in our mock. The total abundance of galaxies of a given
type is stored in the self._total_abundance
dictionary. The keys and integers stored
in this dictionary determine the memory layout of the galaxy_table
. Suppose we have
three galaxy types, centrals
, satellites
and splashbacks
. The first
num_centrals = self._total_abundance['centrals']
rows of the galaxy_table
store the information about our population of centrals; the next num_satellites
rows
are dedicated to the satellites; the final num_splashbacks
rows stores the properties
of the splashback galaxies.
Once these bookkeeping dictionaries have been built, allocating memory for the galaxy_table
is straightforward. We just need to initialize all the columns to which values will be
assigned by the functions that still appear in self._remaining_methods_to_call
.
The initialization is just a numpy.zeros
array of the appropriate data type.
These arrays will be filled during the third and final stage of mock-making.
# Allocate memory for all additional halo properties
for halocatkey in self.additional_haloprops:
self.galaxy_table[halocatkey] = np.zeros(self.Ngals,
dtype = self.halo_table[halocatkey].dtype)
dt = self.model._galprop_dtypes_to_allocate
for key in dt.names:
self.galaxy_table[key] = np.zeros(self.Ngals, dtype = dt[key].type)
Final stage: galaxy properties assigned after the mc_occupation
methods¶
After the galaxy_table
has been initialized, the allocate_memory
method
is complete and control is returned to the populate
method. The first
thing the populate
method does is transfer all the necessary halo properties
from the halo_table
to the galaxy_table
. Recall the galaxy_table
memory layout
described in the previous section when viewing the following code block:
for gal_type in self.gal_types:
# Retrieve the indices of our pre-allocated arrays
# that store the info pertaining to gal_type galaxies
gal_type_slice = self._gal_type_indices[gal_type]
# Fill the relevant slice of the ``gal_type`` column with the appropriate string
self.galaxy_table['gal_type'][gal_type_slice] = (
np.repeat(gal_type, self._total_abundance[gal_type],axis=0))
# Store all other relevant host halo properties into their
# appropriate pre-allocated array
for halocatkey in self.additional_haloprops:
self.galaxy_table[halocatkey][gal_type_slice] = np.repeat(
self.halo_table[halocatkey], self._occupation[gal_type], axis=0)
At this point, the galaxy_table
stores all the halo properties needed by
the composite model, and so when we proceed to call the remaining mock-generation
functions, we can safely pass these functions the appropriate slice of the
galaxy_table
and rest assured that the functions will have all the information
they need to assign whatever property they are responsible for.
The functions responsible for modeling the intra-halo distribution of galaxies
(e.g., an NFWProfile
for satellites and a TrivialProfile
for centrals)
model host-centric positions, so we initialize the positions of each galaxy
to be at its halo center.
self.galaxy_table['x'] = self.galaxy_table['halo_x']
self.galaxy_table['y'] = self.galaxy_table['halo_y']
self.galaxy_table['z'] = self.galaxy_table['halo_z']
self.galaxy_table['vx'] = self.galaxy_table['halo_vx']
self.galaxy_table['vy'] = self.galaxy_table['halo_vy']
self.galaxy_table['vz'] = self.galaxy_table['halo_vz']
Finally, the loop below is where all the action happens with galaxy property assignment:
for method in self._remaining_methods_to_call:
func = getattr(self.model, method)
gal_type_slice = self._gal_type_indices[func.gal_type]
func(table = self.galaxy_table[gal_type_slice])
We perform a loop over all functions whose names remain in
self._remaining_methods_to_call
. All functions appearing in this list
have a gal_type
attribute bound to them that was created by the
HodModelFactory
during the assembly of the composite model
(see the Inheriting behaviors from the component models section of the
Source code notes on HodModelFactory for further details about how the HodModelFactory
binds the appropriate gal_type
information to each mock-generation function).
This gal_type
determines the slice of the galaxy_table
that will be passed to the
function via the table
keyword argument. This way, each component model
does not need to manage any bookkeeping about which slice of an input table
it assigns its properties to: the HodMockFactory
is responsible for passing
the slice of the galaxy_table
that pertains to the gal_type
of the component model.
The generation of all properties predicted by the model is now complete.
The HodMockFactory.populate
function terminates after enforcing the
periodic boundary conditions of the simulation. This enforcement is controlled
by the HodMockFactory
because component models of halo profiles strictly
assign halo-centric positions. Thus corrections need to be applied to
cases where galaxies are assigned to regions of the halo
that spilled over the edges of the simulation box.
Summary: which data table is passed to the table
keyword argument in HOD mock-making?¶
As described in detail in Tutorial on building an HOD-style model, all methods in the
_mock_generation_calling_sequence
of any HOD-style composite model
must accept a table
keyword argument. As shown above, the actual data structure
passed via this argument depends on the phase of the mock-making in which
the function is called. We conclude this tutorial by quickly summarizing
which data structure gets passed to which function.
Mock-generation functions whose names appear prior to the first appearance of a
mc_occupation_
function get passed the entire post-processedmodel.mock.halo_table
to theirtable
keyword argument.mc_occupation_
functions also receive the entiremodel.mock.halo_table
via theirtable
keyword argument.Mock-generation functions whose names appear after the first
mc_occupation_
function get passed a slice of themodel.mock.galaxy_table
; the slice they are passed is the section of the table that pertains to thegal_type
the function is associated with.