:orphan: .. currentmodule:: halotools.empirical_models .. _hod_mock_factory_source_notes: ******************************************************************** Tutorial on the algorithm for HOD-based mock-making ******************************************************************** This section of the documentation provides detailed notes for how the `HodMockFactory` populates halo catalogs with synthetic galaxy populations. The `HodMockFactory` uses composite models built with the `HodModelFactory`, which is documented in the :ref:`hod_model_factory_source_notes`. The bookkeeping of every step of the mock-generation is described in detail below. The design of the algorithm was built around the following considerations: 1. The `HodMockFactory` should be able to correctly control mock-generation for an interchangeable set of (possibly inter-dependent) mappings from a halo table to a galaxy table. 2. It should be easy for users to include just a single additional mapping without having to worry about any of the tedious bookkeeping of how memory is allocated for other galaxy properties or galaxy types. These two goals together drive the `HodMockFactory` to be written with a high-level of abstraction. So the price paid for the modeling flexibility and the user-friendliness of designing new features is that the `HodMockFactory` gluing these behaviors together is complex. In this tutorial we describe every step in this abstract process, including a detailed look at the source code. Outline ======== We will start in :ref:`basic_syntax_hod_mocks` with a high-level overview of the functionality of the `HodMockFactory` class. We provide detailed notes on the source code of the mock factory in :ref:`hod_mock_algorithm`. .. _basic_syntax_hod_mocks: Basic syntax for making HOD-style mocks =============================================== The most common way to interact with instances of the `HodMockFactory` is as an attribute of the composite model you are using to generate the mock. For example, the code snippet below shows how the `~HodModelFactory.populate_mock` method creates a ``mock`` object to the composite model, which in this case will be a model based on Zheng et al. (2007): .. code-block:: python from halotools.empirical_models import PrebuiltHodModelFactory zheng07_model = PrebuiltHodModelFactory('zheng07') from halotools.sim_manager import CachedHaloCatalog default_halocat = CachedHaloCatalog() zheng07_model.populate_mock(default_halocat) The final line of code above creates the ``zheng07_model.mock`` attribute, an instance of `HodMockFactory`. The `HodMockFactory` is responsible for one task: using a Halotools composite model to populate a simulation with mock galaxies. When the `HodModelFactory.populate_mock` method first creates a ``model.mock`` instance, the instantiation of `HodMockFactory` triggers the pre-processing phase of mock population. Briefly, this phase does as many tasks in advance of actual mock population as possible to improve the efficiency of MCMCs (see below for details). By default, instantiating the mock factory also triggers the `HodMockFactory.populate` method to be called. This is the method that actually creates the galaxy population. By calling the `HodMockFactory.populate` method, a new ``galaxy_table`` attribute is created and bound to the ``model.mock`` instance. The ``galaxy_table`` attribute stores an Astropy `~astropy.table.Table` object with one row per mock galaxy and one column for every property assigned by the chosen composite model. .. _hod_mock_algorithm: Algorithm for populating HOD-style mocks ================================================ .. _intro_to_np_repeat: Basics of `numpy.repeat`: the core function in HOD mock-making ---------------------------------------------------------------- Before going into the details of the `HodMockFactory.populate` method, in this section we will first cover a basic introduction to the `numpy.repeat` function, which is the single-most important Numpy function used by Halotools to make mock catalogs. First let's demonstrate basic usage: >>> import numpy as np >>> num_halos = 5 >>> halo_mass = np.logspace(11, 15, num_halos) >>> halo_occupations = np.array([2, 0, 1, 0, 3]) >>> galaxy_host_halo_mass = np.repeat(halo_mass, halo_occupations) The `numpy.repeat` function takes as input two arrays of equal length, in this case ``halo_mass`` and ``halo_occupations``. The second array is interpreted as the number of times the corresponding entry in the first array should be repeated. A visualization of how this function behaves is shown in the diagram below. .. image:: np_repeat_tutorial.png This behavior is exactly what is needed to create a mock galaxy catalog with an HOD-style model. The `numpy.repeat` function is blazingly fast and ideally suited to the task at hand. The core function of any HOD model is to specify how many galaxies reside in a given halo. The task of assigning galaxy occupations to halos is controlled by the `~halotools.empirical_models.OccupationComponent.mc_occupation` function of whatever sub-class of `~halotools.empirical_models.OccupationComponent` you select as your model. The `~halotools.empirical_models.HodMockFactory` does not need to know how occupation statistics are modeled - the only thing the factory needs to do is call the `~halotools.empirical_models.OccupationComponent.mc_occupation` function to fill the ``occupations`` array in the diagram. The `~halotools.empirical_models.HodMockFactory` simply calls `numpy.repeat`, passing (in this example) the ``halo_mvir`` column of the halo catalog as the first argument. This will create a length-*Ngals* array storing the host halo mass of every mock galaxy, which will then be included in the ``galaxy_table``. If you want to include additional halo properties as columns in your ``galaxy_table``, the `~halotools.empirical_models.HodMockFactory` only needs to pass in additional halo catalog columns to `numpy.repeat`. In this way, the factory does not need to know *anything* about how the ``occupations`` array comes into existence. The factory only needs to repeatedly call `numpy.repeat` with the appropriate inputs that are determined by the model. Pre-processing phase ---------------------- The preliminary tasks of HOD mock-making are carried out by the `~HodMockFactory.preprocess_halo_catalog` method. This first thing this function does is to throw out subhalos from the halo catalog, and to apply a halo mass completeness cut. You can control the completeness cut with the ``Num_ptcl_requirement`` keyword argument passed to the `HodMockFactory` constructor (``Num_ptcl_requirement`` is also an optional keyword argument that may be passed to the `HodModelFactory.populate_mock` method, which in turn passes this argument on to the `HodMockFactory` constructor.) New columns are then added to the ``halo_table`` according to any entries in the ``new_haloprop_func_dict``; any such columns will automatically be included in the ``galaxy_table``. See :ref:`new_haloprop_func_dict_mechanism` for further details. Memory allocation phase --------------------------- After pre-processing the halo catalog, memory must be allocated to store the ``galaxy_table``. This is controlled by the `~HodMockFactory.allocate_memory` method, the first few lines of which appear below. .. code-block:: python def allocate_memory(self): self.galaxy_table = Table() # ``self`` refers to the ``model.mock`` object self._remaining_methods_to_call = copy(self.model._mock_generation_calling_sequence) After initializing the ``galaxy_table``, the `~HodMockFactory.allocate_memory` method creates the ``_remaining_methods_to_call`` attribute; this will be used to keep a running list of the names of the composite model methods that the `HodMockFactory` should call; the factory calls the methods in this list one by one, deleting the corresponding name after the call; when the list is exhausted, mock-making is complete. For every ``gal_type`` in a composite model, e.g., ``centrals`` or ``satellites``, there is a corresponding ``mc_occupation_gal_type`` method that is responsible for determining how many galaxies of that type belong in each halo, e.g., ``mc_occupation_centrals`` or ``mc_occupation_satellites``. As described in the previous section and elaborated upon in detail below, these ``mc_occupation`` methods work together with `numpy.repeat` to build the ``galaxy_table``. In HOD-style models, there is a natural division between the component model methods that get called by the `HodMockFactory`: 1. methods that are called *prior* to the ``mc_occupation`` methods 2. the ``mc_occupation`` methods themselves 3. methods that are called *after* the ``mc_occupation`` methods. For the first set of methods, memory has not yet been allocated for the ``galaxy_table`` as the length of the table is determined by the result of the call to the ``mc_occupation`` methods, which has stochastic behavior. For the third set of methods, the length of the ``galaxy_table`` has already been determined, and so these methods merely fill existing (empty) columns of the ``galaxy_table``. .. _galprops_assigned_before_mc_occupation: Galaxy properties assigned prior to the ``mc_occupation`` methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ As described above, the functions whose names appear in the ``_remaining_methods_to_call`` list are called in sequence by the `HodMockFactory`. For all functions that appear in this list *prior* to the first appearance of the ``mc_occupation_`` functions, the data structure passed to the functions via the *table* keyword is the (post-processed) ``mock.halo_table``. Thus whatever values these functions compute, the result of the functions will be stored in ``mock.halo_table``. The reason that the functions in this phase are not directly passed ``mock.galaxy_table`` is because we do not know how many elements this table will have until calling the ``mc_occupation_`` methods. Moreover, functions in this phase are by definition modeling some property of the galaxy population that does not even need to know which galaxies the property pertains to. Thus model functions that are called prior to the ``mc_occupation_`` methods can only depend on the underlying halo, and so these functions should have sufficient knowledge to map their results on an input halo catalog. All the same, any function in the ``_mock_generation_calling_sequence`` should have the results of its computations applied to the ``galaxy_table``. As we will see in the next section, this transfer of information can be accomplished with `numpy.repeat`. Before seeing how that's done, let's look at the source code that generates the behavior we just described. .. code-block:: python galprops_assigned_to_halo_table = [] for func_name in self.model._mock_generation_calling_sequence: if 'mc_occupation' in func_name: # exit when we encounter a ``mc_occupation_`` function break else: func = getattr(self.model, func_name) func(table = self.halo_table) galprops_assigned_to_halo_table_by_func = func._galprop_dtypes_to_allocate.names galprops_assigned_to_halo_table.extend(galprops_assigned_to_halo_table_by_func) self._remaining_methods_to_call.remove(func_name) self.additional_haloprops.extend(galprops_assigned_to_halo_table) After executing the above for loop, all functions in the ``_mock_generation_calling_sequence`` that appear prior to the ``mc_occupation_`` methods will have been executed after being passed the ``halo_table`` as the data structure bound to the ``table`` keyword argument. Additionally, the ``galprops_assigned_to_halo_table_by_func`` list works with :ref:`galprop_dtypes_to_allocate_mechanism` mechanism to keep track of which data columns we will pass on from the ``halo_table`` to the ``galaxy_table`` after allocating memory for the ``galaxy_table`` with the ``mc_occupation_`` methods. After calling the ``mc_occupation_`` methods, we will use `numpy.repeat` to transfer all columns appearing in ``additional_haloprops`` from the ``halo_table`` to the ``galaxy_table``. Calling the ``mc_occupation_`` methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The `HodMockFactory` marches through the mock-generation functions listed in ``_mock_generation_calling_sequence`` until the first time a function with name ``mc_occupation_`` appears, triggering a new phase of the mock generation in which memory is allocated to the ``galaxy_table``. The basic way this next phase works is as follows. The ``halo_table`` is passed to each of the ``mc_occupation_`` functions; these functions are called in the sequence in which their corresponding ``gal_type`` appears in ``self.gal_types``. For example, if ``self.gal_types = ['centrals', 'satellites']``, then the calling sequence of this phase will be ``mc_occupation_centrals`` followed by ``mc_occupation_satellites``. For each ``gal_type`` in the composite model, the ``mc_occupation_gal_type`` function call will calculate a length-*Nhalos* array of integers that determines how many galaxies of that type belong in each halo. This integer array is stored in a private python dictionary ``self._occupation``; there are *num_gal_types* keys in ``self._occupation``, one key for the array of occupations associated with each ``gal_type``. Discussion of the ``mc_occupation_`` phase continues below the following source code that controls the behavior we are describing. .. code-block:: python self._occupation = {} self._total_abundance = {} self._gal_type_indices = {} first_galaxy_index = 0 for gal_type in self.gal_types: occupation_func_name = 'mc_occupation_'+gal_type occupation_func = getattr(self.model, occupation_func_name) # Call the component model to get a Monte Carlo # realization of the abundance of gal_type galaxies self._occupation[gal_type] = occupation_func(table=self.halo_table) # Now use the above result to set up the indexing scheme self._total_abundance[gal_type] = ( self._occupation[gal_type].sum() ) last_galaxy_index = first_galaxy_index + self._total_abundance[gal_type] # Build a bookkeeping device to keep track of # which array elements pertain to which gal_type. self._gal_type_indices[gal_type] = slice( first_galaxy_index, last_galaxy_index) first_galaxy_index = last_galaxy_index # Remove the mc_occupation function from the list of methods to call self._remaining_methods_to_call.remove(occupation_func_name) self.additional_haloprops.append('halo_num_'+gal_type) self.Ngals = np.sum(list(self._total_abundance.values())) After calling the ``occupation_func`` for a given ``gal_type``, we know exactly how many galaxies of that type will appear in our mock. The total abundance of galaxies of a given type is stored in the ``self._total_abundance`` dictionary. The keys and integers stored in this dictionary determine the memory layout of the ``galaxy_table``. Suppose we have three galaxy types, ``centrals``, ``satellites`` and ``splashbacks``. The first ``num_centrals = self._total_abundance['centrals']`` rows of the ``galaxy_table`` store the information about our population of centrals; the next ``num_satellites`` rows are dedicated to the satellites; the final ``num_splashbacks`` rows stores the properties of the splashback galaxies. Once these bookkeeping dictionaries have been built, allocating memory for the ``galaxy_table`` is straightforward. We just need to initialize all the columns to which values will be assigned by the functions that still appear in ``self._remaining_methods_to_call``. The initialization is just a `numpy.zeros` array of the appropriate data type. These arrays will be filled during the third and final stage of mock-making. .. code-block:: python # Allocate memory for all additional halo properties for halocatkey in self.additional_haloprops: self.galaxy_table[halocatkey] = np.zeros(self.Ngals, dtype = self.halo_table[halocatkey].dtype) dt = self.model._galprop_dtypes_to_allocate for key in dt.names: self.galaxy_table[key] = np.zeros(self.Ngals, dtype = dt[key].type) .. _mock_generation_functions_after_mc_occupation: Final stage: galaxy properties assigned after the ``mc_occupation`` methods ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After the ``galaxy_table`` has been initialized, the ``allocate_memory`` method is complete and control is returned to the ``populate`` method. The first thing the ``populate`` method does is transfer all the necessary halo properties from the ``halo_table`` to the ``galaxy_table``. Recall the ``galaxy_table`` memory layout described in the previous section when viewing the following code block: .. code-block:: python for gal_type in self.gal_types: # Retrieve the indices of our pre-allocated arrays # that store the info pertaining to gal_type galaxies gal_type_slice = self._gal_type_indices[gal_type] # Fill the relevant slice of the ``gal_type`` column with the appropriate string self.galaxy_table['gal_type'][gal_type_slice] = ( np.repeat(gal_type, self._total_abundance[gal_type],axis=0)) # Store all other relevant host halo properties into their # appropriate pre-allocated array for halocatkey in self.additional_haloprops: self.galaxy_table[halocatkey][gal_type_slice] = np.repeat( self.halo_table[halocatkey], self._occupation[gal_type], axis=0) At this point, the ``galaxy_table`` stores all the halo properties needed by the composite model, and so when we proceed to call the remaining mock-generation functions, we can safely pass these functions the appropriate slice of the ``galaxy_table`` and rest assured that the functions will have all the information they need to assign whatever property they are responsible for. The functions responsible for modeling the intra-halo distribution of galaxies (e.g., an `NFWProfile` for satellites and a `TrivialProfile` for centrals) model *host-centric* positions, so we initialize the positions of each galaxy to be at its halo center. .. code-block:: python self.galaxy_table['x'] = self.galaxy_table['halo_x'] self.galaxy_table['y'] = self.galaxy_table['halo_y'] self.galaxy_table['z'] = self.galaxy_table['halo_z'] self.galaxy_table['vx'] = self.galaxy_table['halo_vx'] self.galaxy_table['vy'] = self.galaxy_table['halo_vy'] self.galaxy_table['vz'] = self.galaxy_table['halo_vz'] Finally, the loop below is where all the action happens with galaxy property assignment: .. code-block:: python for method in self._remaining_methods_to_call: func = getattr(self.model, method) gal_type_slice = self._gal_type_indices[func.gal_type] func(table = self.galaxy_table[gal_type_slice]) We perform a loop over all functions whose names remain in ``self._remaining_methods_to_call``. All functions appearing in this list have a ``gal_type`` attribute bound to them that was created by the `HodModelFactory` during the assembly of the composite model (see the :ref:`hod_model_factory_inheriting_behaviors` section of the :ref:`hod_model_factory_source_notes` for further details about how the `HodModelFactory` binds the appropriate ``gal_type`` information to each mock-generation function). This ``gal_type`` determines the slice of the ``galaxy_table`` that will be passed to the function via the ``table`` keyword argument. This way, each component model does not need to manage any bookkeeping about which slice of an input table it assigns its properties to: the `HodMockFactory` is responsible for passing the slice of the ``galaxy_table`` that pertains to the ``gal_type`` of the component model. The generation of all properties predicted by the model is now complete. The `HodMockFactory.populate` function terminates after enforcing the periodic boundary conditions of the simulation. This enforcement is controlled by the `HodMockFactory` because component models of halo profiles strictly assign halo-centric positions. Thus corrections need to be applied to cases where galaxies are assigned to regions of the halo that spilled over the edges of the simulation box. .. _which_data_table_hod_mock_making: Summary: which data table is passed to the ``table`` keyword argument in HOD mock-making? =========================================================================================== As described in detail in :ref:`hod_modeling_tutorial0`, all methods in the ``_mock_generation_calling_sequence`` of any HOD-style composite model must accept a ``table`` keyword argument. As shown above, the actual data structure passed via this argument depends on the phase of the mock-making in which the function is called. We conclude this tutorial by quickly summarizing which data structure gets passed to which function. 1. Mock-generation functions whose names appear *prior* to the first appearance of a ``mc_occupation_`` function get passed the *entire* post-processed ``model.mock.halo_table`` to their ``table`` keyword argument. 2. ``mc_occupation_`` functions also receive the entire ``model.mock.halo_table`` via their ``table`` keyword argument. 3. Mock-generation functions whose names appear *after* the first ``mc_occupation_`` function get passed a *slice* of the ``model.mock.galaxy_table``; the slice they are passed is the section of the table that pertains to the ``gal_type`` the function is associated with.