TabularAsciiReader

class halotools.sim_manager.TabularAsciiReader(input_fname, columns_to_keep_dict, header_char='#', row_cut_min_dict={}, row_cut_max_dict={}, row_cut_eq_dict={}, row_cut_neq_dict={}, num_lines_header=None)[source] [edit on github]

Bases: object

Class providing a memory-efficient algorithm for reading a very large ascii file that stores tabular data of a data type that is known in advance.

When reading ASCII data with read_ascii, user-defined cuts on columns are applied on-the-fly using a python generator to yield only those columns whose indices appear in the input columns_to_keep_dict.

As the file is read, the data is generated in chunks, and a customizable mask is applied to each newly generated chunk. The only aggregated data from each chunk are those rows passing all requested cuts, so that the TabularAsciiReader only requires you to have enough RAM to store the cut catalog, not the entire ASCII file.

The primary method of the class is read_ascii. The output of this method is a structured Numpy array, which can then be stored in your preferred binary format using the built-in Numpy methods, h5py, etc. If you wish to store the catalog in the Halotools cache, you should instead use the RockstarHlistReader class.

The algorithm assumes that data of known, unchanging type is arranged in a consecutive sequence of lines within the ascii file, that the data stream begins with the first line that is not the header_char, and that the first subsequent appearance of an empty line demarcates the end of the data stream.

Parameters:

input_fname : string

Absolute path to the file storing the ASCII data.

columns_to_keep_dict : dict

Dictionary used to define which columns of the tabular ASCII data will be kept.

Each key of the dictionary will be the name of the column in the returned data table. The value bound to each key is a two-element tuple. The first tuple entry is an integer providing the index of the column to be kept, starting from 0. The second tuple entry is a string defining the Numpy dtype of the data in that column, e.g., ‘f4’ for a float, ‘f8’ for a double, or ‘i8’ for a long.

Thus an example columns_to_keep_dict could be {‘x’: (1, ‘f4’), ‘y’: (0, ‘i8’), ‘z’: (9, ‘f4’)}. In this case, the structured array returned by the read_ascii method would have three keys: x storing a float for the data in the second column of the ASCII file, y storing a long integer for the data in the first column of the ASCII file, and z storing a float for the data in the tenth column of the ASCII file.

header_char : str, optional

String to be interpreted as a header line at the beginning of the ascii file. Default is ‘#’. Can alternatively use num_lines_header optional argument.

num_lines_header : int, optional

Number of lines in the header. Default is None, in which case header length will be determined by header_char argument.

row_cut_min_dict : dict, optional

Dictionary used to place a lower-bound cut on the rows of the tabular ASCII data.

Each key of the dictionary must also be a key of the input columns_to_keep_dict; for purposes of good bookeeping, you are not permitted to place a cut on a column that you do not keep. The value bound to each key serves as the lower bound on the data stored in that row. A row with a smaller value than this lower bound for the corresponding column will not appear in the returned data table.

For example, if row_cut_min_dict = {‘mass’: 1e10}, then all rows of the returned data table will have a mass greater than 1e10.

row_cut_max_dict : dict, optional

Dictionary used to place an upper-bound cut on the rows of the tabular ASCII data.

Each key of the dictionary must also be a key of the input columns_to_keep_dict; for purposes of good bookeeping, you are not permitted to place a cut on a column that you do not keep. The value bound to each key serves as the upper bound on the data stored in that row. A row with a larger value than this upper bound for the corresponding column will not appear in the returned data table.

For example, if row_cut_min_dict = {‘mass’: 1e15}, then all rows of the returned data table will have a mass less than 1e15.

row_cut_eq_dict : dict, optional

Dictionary used to place an equality cut on the rows of the tabular ASCII data.

Each key of the dictionary must also be a key of the input columns_to_keep_dict; for purposes of good bookeeping, you are not permitted to place a cut on a column that you do not keep. The value bound to each key serves as the required value for the data stored in that row. Only rows with a value equal to this required value for the corresponding column will appear in the returned data table.

For example, if row_cut_eq_dict = {‘upid’: -1}, then all rows of the returned data table will have a upid of -1.

row_cut_neq_dict : dict, optional

Dictionary used to place an inequality cut on the rows of the tabular ASCII data.

Each key of the dictionary must also be a key of the input columns_to_keep_dict; for purposes of good bookeeping, you are not permitted to place a cut on a column that you do not keep. The value bound to each key serves as a forbidden value for the data stored in that row. Rows with a value equal to this forbidden value for the corresponding column will not appear in the returned data table.

For example, if row_cut_neq_dict = {‘upid’: -1}, then no rows of the returned data table will have a upid of -1.

Examples

Suppose you are only interested in reading the tenth and fifth columns of data of your ascii file, and that these columns store a float variable you want to call mass, and a long integer variable you want to call id, respectively. If you want a Numpy structured array storing all rows of these two columns:

>>> cols = {'mass': (9, 'f4'), 'id': (4, 'i8')}
>>> reader = TabularAsciiReader(fname, cols) 
>>> arr = reader.read_ascii() 

If you are only interested in rows where mass exceeds 1e10:

>>> row_cut_min_dict = {'mass': 1e10}
>>> reader = TabularAsciiReader(fname, cols, row_cut_min_dict = row_cut_min_dict) 
>>> arr = reader.read_ascii() 

Finally, suppose the fortieth column stores an integer called resolved, and in addition to the above mass cut, you do not wish to store any rows for which the resolved column value equals zero. As described above, you are not permitted to make a row-cut on a column that you do not keep, so in addition to defining the new row cut, you must also include the resolved column in your columns_to_keep_dict:

>>> cols = {'mass': (9, 'f4'), 'id': (4, 'i8'), 'resolved': (39, 'i4')}
>>> row_cut_neq_dict = {'resolved': 0}
>>> reader = TabularAsciiReader(fname, cols, row_cut_neq_dict = row_cut_neq_dict, row_cut_min_dict = row_cut_min_dict) 
>>> arr = reader.read_ascii() 

Methods Summary

apply_row_cut(array_chunk) Method applies a boolean mask to the input array based on the row-cuts determined by the dictionaries passed to the constructor.
data_chunk_generator(chunk_size, f) Python generator uses f.readline() to march through an input open file object to yield a chunk of data with length equal to the input chunk_size.
data_len() Number of rows of data in the input ASCII file.
header_len() Number of rows in the header of the ASCII file.
read_ascii([chunk_memory_size]) Method reads the input ascii and returns a structured Numpy array of the data that passes the row- and column-cuts.

Methods Documentation

apply_row_cut(array_chunk)[source] [edit on github]

Method applies a boolean mask to the input array based on the row-cuts determined by the dictionaries passed to the constructor.

Parameters:array_chunk : Numpy array
Returns:cut_array : Numpy array
data_chunk_generator(chunk_size, f)[source] [edit on github]

Python generator uses f.readline() to march through an input open file object to yield a chunk of data with length equal to the input chunk_size. The generator only yields columns that were included in the columns_to_keep_dict passed to the constructor.

Parameters:

chunk_size : int

Number of rows of data in the chunk being generated

f : File

Open file object being read

Returns:

chunk : tuple

Tuple of data from the ascii. Only data from column_indices_to_keep are yielded.

data_len()[source] [edit on github]

Number of rows of data in the input ASCII file.

Returns:

Nrows_data : int

Total number of rows of data.

Notes

The returned value is computed as the number of lines between the returned value of header_len and the next appearance of an empty line.

The data_len method is the particular section of code where where the following assumptions are made:

  1. The data begins with the first appearance of a non-empty line that does not begin with the character defined by self.header_char.
  2. The data ends with the next appearance of an empty line.
header_len()[source] [edit on github]

Number of rows in the header of the ASCII file.

Parameters:fname : string
Returns:Nheader : int

Notes

The header is assumed to be those characters at the beginning of the file that begin with self.header_char.

All empty lines that appear in header will be included in the count.

read_ascii(chunk_memory_size=500)[source] [edit on github]

Method reads the input ascii and returns a structured Numpy array of the data that passes the row- and column-cuts.

Parameters:

chunk_memory_size : int, optional

Determine the approximate amount of Megabytes of memory that will be processed in chunks. This variable must be smaller than the amount of RAM on your machine; choosing larger values typically improves performance. Default is 500 Mb.

Returns:

full_array : array_like

Structured Numpy array storing the rows and columns that pass the input cuts. The columns of this array are those selected by the column_indices_to_keep argument passed to the constructor.