TabularAsciiReader¶
- class halotools.sim_manager.TabularAsciiReader(input_fname, columns_to_keep_dict, header_char='#', row_cut_min_dict={}, row_cut_max_dict={}, row_cut_eq_dict={}, row_cut_neq_dict={}, num_lines_header=None)[source]¶
Bases:
object
Class providing a memory-efficient algorithm for reading a very large ascii file that stores tabular data of a data type that is known in advance.
When reading ASCII data with
read_ascii
, user-defined cuts on columns are applied on-the-fly using a python generator to yield only those columns whose indices appear in the inputcolumns_to_keep_dict
.As the file is read, the data is generated in chunks, and a customizable mask is applied to each newly generated chunk. The only aggregated data from each chunk are those rows passing all requested cuts, so that the
TabularAsciiReader
only requires you to have enough RAM to store the cut catalog, not the entire ASCII file.The primary method of the class is
read_ascii
. The output of this method is a structured Numpy array, which can then be stored in your preferred binary format using the built-in Numpy methods, h5py, etc. If you wish to store the catalog in the Halotools cache, you should instead use theRockstarHlistReader
class.The algorithm assumes that data of known, unchanging type is arranged in a consecutive sequence of lines within the ascii file, that the data stream begins with the first line that is not the
header_char
, and that the first subsequent appearance of an empty line demarcates the end of the data stream.- Parameters:
- input_fnamestring
Absolute path to the file storing the ASCII data.
- columns_to_keep_dictdict
Dictionary used to define which columns of the tabular ASCII data will be kept.
Each key of the dictionary will be the name of the column in the returned data table. The value bound to each key is a two-element tuple. The first tuple entry is an integer providing the index of the column to be kept, starting from 0. The second tuple entry is a string defining the Numpy dtype of the data in that column, e.g., ‘f4’ for a float, ‘f8’ for a double, or ‘i8’ for a long.
Thus an example
columns_to_keep_dict
could be {‘x’: (1, ‘f4’), ‘y’: (0, ‘i8’), ‘z’: (9, ‘f4’)}. In this case, the structured array returned by theread_ascii
method would have three keys:x
storing a float for the data in the second column of the ASCII file,y
storing a long integer for the data in the first column of the ASCII file, andz
storing a float for the data in the tenth column of the ASCII file.- header_charstr, optional
String to be interpreted as a header line at the beginning of the ascii file. Default is ‘#’. Can alternatively use
num_lines_header
optional argument.- num_lines_headerint, optional
Number of lines in the header. Default is None, in which case header length will be determined by
header_char
argument.- row_cut_min_dictdict, optional
Dictionary used to place a lower-bound cut on the rows of the tabular ASCII data.
Each key of the dictionary must also be a key of the input
columns_to_keep_dict
; for purposes of good bookeeping, you are not permitted to place a cut on a column that you do not keep. The value bound to each key serves as the lower bound on the data stored in that row. A row with a smaller value than this lower bound for the corresponding column will not appear in the returned data table.For example, if row_cut_min_dict = {‘mass’: 1e10}, then all rows of the returned data table will have a mass greater than 1e10.
- row_cut_max_dictdict, optional
Dictionary used to place an upper-bound cut on the rows of the tabular ASCII data.
Each key of the dictionary must also be a key of the input
columns_to_keep_dict
; for purposes of good bookeeping, you are not permitted to place a cut on a column that you do not keep. The value bound to each key serves as the upper bound on the data stored in that row. A row with a larger value than this upper bound for the corresponding column will not appear in the returned data table.For example, if row_cut_min_dict = {‘mass’: 1e15}, then all rows of the returned data table will have a mass less than 1e15.
- row_cut_eq_dictdict, optional
Dictionary used to place an equality cut on the rows of the tabular ASCII data.
Each key of the dictionary must also be a key of the input
columns_to_keep_dict
; for purposes of good bookeeping, you are not permitted to place a cut on a column that you do not keep. The value bound to each key serves as the required value for the data stored in that row. Only rows with a value equal to this required value for the corresponding column will appear in the returned data table.For example, if row_cut_eq_dict = {‘upid’: -1}, then all rows of the returned data table will have a upid of -1.
- row_cut_neq_dictdict, optional
Dictionary used to place an inequality cut on the rows of the tabular ASCII data.
Each key of the dictionary must also be a key of the input
columns_to_keep_dict
; for purposes of good bookeeping, you are not permitted to place a cut on a column that you do not keep. The value bound to each key serves as a forbidden value for the data stored in that row. Rows with a value equal to this forbidden value for the corresponding column will not appear in the returned data table.For example, if row_cut_neq_dict = {‘upid’: -1}, then no rows of the returned data table will have a upid of -1.
Examples
Suppose you are only interested in reading the tenth and fifth columns of data of your ascii file, and that these columns store a float variable you want to call mass, and a long integer variable you want to call id, respectively. If you want a Numpy structured array storing all rows of these two columns:
>>> cols = {'mass': (9, 'f4'), 'id': (4, 'i8')} >>> reader = TabularAsciiReader(fname, cols) >>> arr = reader.read_ascii()
If you are only interested in rows where mass exceeds 1e10:
>>> row_cut_min_dict = {'mass': 1e10} >>> reader = TabularAsciiReader(fname, cols, row_cut_min_dict = row_cut_min_dict) >>> arr = reader.read_ascii()
Finally, suppose the fortieth column stores an integer called resolved, and in addition to the above mass cut, you do not wish to store any rows for which the resolved column value equals zero. As described above, you are not permitted to make a row-cut on a column that you do not keep, so in addition to defining the new row cut, you must also include the resolved column in your columns_to_keep_dict:
>>> cols = {'mass': (9, 'f4'), 'id': (4, 'i8'), 'resolved': (39, 'i4')} >>> row_cut_neq_dict = {'resolved': 0} >>> reader = TabularAsciiReader(fname, cols, row_cut_neq_dict = row_cut_neq_dict, row_cut_min_dict = row_cut_min_dict) >>> arr = reader.read_ascii()
Methods Summary
apply_row_cut
(array_chunk)Method applies a boolean mask to the input array based on the row-cuts determined by the dictionaries passed to the constructor.
data_chunk_generator
(chunk_size, f)Python generator uses f.readline() to march through an input open file object to yield a chunk of data with length equal to the input
chunk_size
.data_len
()Number of rows of data in the input ASCII file.
Number of rows in the header of the ASCII file.
read_ascii
([chunk_memory_size])Method reads the input ascii and returns a structured Numpy array of the data that passes the row- and column-cuts.
Methods Documentation
- apply_row_cut(array_chunk)[source]¶
Method applies a boolean mask to the input array based on the row-cuts determined by the dictionaries passed to the constructor.
- Parameters:
- array_chunkNumpy array
- Returns:
- cut_arrayNumpy array
- data_chunk_generator(chunk_size, f)[source]¶
Python generator uses f.readline() to march through an input open file object to yield a chunk of data with length equal to the input
chunk_size
. The generator only yields columns that were included in thecolumns_to_keep_dict
passed to the constructor.- Parameters:
- chunk_sizeint
Number of rows of data in the chunk being generated
- fFile
Open file object being read
- Returns:
- chunktuple
Tuple of data from the ascii. Only data from
column_indices_to_keep
are yielded.
- data_len()[source]¶
Number of rows of data in the input ASCII file.
- Returns:
- Nrows_dataint
Total number of rows of data.
Notes
The returned value is computed as the number of lines between the returned value of
header_len
and the next appearance of an empty line.The
data_len
method is the particular section of code where where the following assumptions are made:The data begins with the first appearance of a non-empty line that does not begin with the character defined by
self.header_char
.The data ends with the next appearance of an empty line.
- header_len()[source]¶
Number of rows in the header of the ASCII file.
- Parameters:
- fnamestring
- Returns:
- Nheaderint
Notes
The header is assumed to be those characters at the beginning of the file that begin with
self.header_char
.All empty lines that appear in header will be included in the count.
- read_ascii(chunk_memory_size=500)[source]¶
Method reads the input ascii and returns a structured Numpy array of the data that passes the row- and column-cuts.
- Parameters:
- chunk_memory_sizeint, optional
Determine the approximate amount of Megabytes of memory that will be processed in chunks. This variable must be smaller than the amount of RAM on your machine; choosing larger values typically improves performance. Default is 500 Mb.
- Returns:
- full_arrayarray_like
Structured Numpy array storing the rows and columns that pass the input cuts. The columns of this array are those selected by the
column_indices_to_keep
argument passed to the constructor.
See also