A new high performance, memory-efficient file parser engine for pandas

TL;DR I’ve finally gotten around to building the high performance parser engine that pandas deserves. It hasn’t been released yet (it’s in a branch on GitHub) but will after I give it a month or so for any remaining buglets to shake out:

A project I’ve put off for a long time is building a high performance, memory efficient file parser for pandas. The existing code up through and including the imminent pandas 0.9.0 release has always been makeshift; the development focus has been on parser features over the more tedious (but actually much more straightforward) issue of creating a fast C table tokenizer. It’s been on the pandas roadmap for a long time:

http://github.com/pydata/pandas/issues/821

pandas.read_csv from pandas 0.5.0 onward is actually very fast– faster than R and much faster than numpy.loadtxt– but it uses a lot of memory. I wrote about some of the implementation issues about a year ago here. The key problem with the existing code is this: all of the existing parsing solutions in pandas as well as NumPy first read the file data into pure Python data structures: a list of tuples or a list of lists. If you have a very large file, a list of 1 million or 10 million Python tuples has an extraordinary memory footprint– significantly greater than the size of the file on disk (can be 5x or more footprint, far too much). Some people have pointed out the large memory usage without correctly explaining why, but this is the one and only reason: too many intermediate Python data structures.

Building a good parser engine isn’t exactly rocket science; we’re talking optimizing the implementation of dirt simple O(n) algorithms here. The task is divided into several key pieces:

  • File tokenization: read bytes from the file, identify where fields begin and end and which column each belongs to. Python’s csv module is an example of a tokenizer. Things like quoting conventions need to be taken into account. Doing this well in C is about picking the right data structures and making the code lean and mean. To be clear: if you design the tokenizer data structure wrong, you’ve lost before you’ve begun.
  • NA value filtering: detect NA (missing) value sentinels and convert to the appropriate NA representation. Examples of NA sentinels are NA, #N/A or other bespoke sentinels like -999. Practically speaking this means keeping a hash set of strings considered NA and check whether each parsed token is in the set (and you can have different NA sets for each column, too!). If the number of sentinel values is small, you could use an array of C strings instead of a hash set.
  • Tolerating “bad” rows: Can aberrant rows be gracefully ignored with your consent? Is the error message informative?
  • Type inference / conversion: Converting the tokens in the file to the right C types (string, date, floating point, integer, boolean).
  • Skipping rows: Ignore certain rows in file or at end of file.
  • Date parsing / value conversion: Convert one or more columns into timestamps. In some cases concatenate date/time information spread across multiple columns.
  • Handling of “index” columns: Handle row names appropriately, yielding a DataFrame with the expected row index.

  • None of this is that hard; it’s made much more time consuming due to the proliferation of fine-grained options (and resulting “parameter hell”). Anyway, I finally mustered the energy to hack it out over a few intense days in late August and September. I’m hoping to ship it in a quick pandas 0.10 release (“version point-ten”) toward the end of October if possible. It would be nice to push this code upstream into NumPy to improve loadtxt and genfromtxt’s performance as well.

    Benchmarks against R, NumPy, Continuum’s IOPro

    Outside of parser features (i.e. “can the tool read my file correctly”), there are two performance areas of interest:

  • CPU Speed: how long does it take to parse the file?
  • Memory utilization: what’s the maximum amount of RAM used while the file is being parsed (including the final returned table)? There’s really nothing worse than your computer starting to swap when you try to parse a large file
  • I’ll compare the new pandas parser engine in a group of several tools that you can use to do the same job, including R’s parser functions:

  • R’s venerable read.csv and read.table functions
  • numpy.loadtxt: this is a pure Python parser, to be clear.
  • New pandas engine, via pandas.read_csv and read_table
  • A new commercial library, IOPro, from my good friends at Continuum Analytics.
  • To do the performance analysis, I’ll look at 5 representative data sets:

  • A 100,000 x 50 CSV matrix of randomly generated 0′s and 1′s. It looks like this:
  • A 1,000,000 x 10 CSV matrix of randomly generated normally distributed data. Looks like this:
  • The Federal election committee (FEC) data set as a CSV file. One of my favorite example data sets for talking about pandas. Here’s what it looks like when parsed with pandas.read_csv
  • Wikipedia page count data used for benchmarks in this blog post. It’s delimited by single spaces and has no column header:
  • A large numerical astronomy data set used for benchmarks in this blog post. Looks like this:
  • Here’s a link to an archive of all the datasets (warning: about 500 megabytes): Table datasets

    I don’t have time to compare features (which vary greatly across the tools).

    Oh, and my rig:

  • Core i7 950 @ 3.07 GHz
  • 24 GB of ram (so we won’t get close to swapping)
  • OCZ Vertex 3 Sata 3 SSD
  • (Because I have an SSD I would expect the benchmarks for spinning rust to differ roughly by a constant amount based on read times for slurping the bytes of the disk. In my case, the disk reads aren’t a major factor. In corporate environments with NFS servers under heavy load, you would expect similar reads to take a bit longer.)

    CPU Performance benchmarks

    So numpy.loadtxt is really slow, and I’m excluding it from the benchmarks. On the smallest and simplest file in these benchmarks, it’s more than 10 times slower than the new pandas parser:

    Here are the results for everybody else (see code at end of post):

    Here are the raw numbers in seconds:

    In [30]: results
    Out[30]:
                       iopro    pandas       R
    astro          17.646228  6.955254  37.030
    double-matrix   3.377430  1.279502   6.920
    fec             3.685799  2.306570  18.121
    wikipedia      11.752624  4.369659  42.250
    zero-matrix     0.673885  0.268830   0.616

    IOPro vs. new pandas parser: look closer

    But hey, wait a second. If you are intimately familiar with IOPro and pandas you will already be saying that I am not making an apples to apples comparison. True. Why not?

  • IOPro does not check for and substitute common NA sentinel values (I believe you can give it a list of values to check for– the documentation was a bit hard to work out in this regard)
  • IOPro returns NumPy arrays with structured dtype. Pandas DataFrame has a slightly different internal format, and strings are boxed as Python objects rather than stored in NumPy string dtype arrays
  • To level the playing field, I’ll disable the NA filtering logic (passing na_filter=False) in pandas, instruct the parser to return a structured array instead of a DataFrame (as_recarray=True). Secondly, let’s only look at the numerical datasets (exclude wikipedia and fec, for now) to exclude the impact of handling of string datatypes. Here is the resulting graph (with relative timings):

    It looks like the savings of not passing all the tokens through the NA filter is balanced by the cost of transferring the column arrays into the structured array (which is a raw array of bytes interpreted as a table by NumPy). This could very likely be made faster (more cache-efficient) than it currently is with some effort.

    Memory usage benchmarks

    Profiling peak memory usage is a tedious process. The canonical tool for the job is Massif from the Valgrind suite. I’m not yet done obsessing over memory allocation and data management inside the parser system, but here’s what the numbers look like compared with R and IOPro. I’m using the following valgrind commands (plus ms_print) to get this output (if this is not correct, please someone tell me):

    I’ll use the largest file in this post, the astro numerical dataset.

    First, IOPro advertises very low memory footprint. It does not, however, avoid having 2 copies of the data set in memory (I don’t either. It’s actually very difficult–and costly–to avoid this). Here is the final output of ms_print showing peak memory usage at the very end when the structured array is created and returned:

    Let’s look at R. Peak memory allocation comes in slightly under IOPro at 903MM bytes vs. 912MM:

    In the new pandas parser, I’ll look at 2 things: memory allocation by the parser engine before creation of the final DataFrame (which causes data-doubling as with IOPro) and the user-facing read_csv. First, the profile of using read_csv (which also creates a simple integer Index for the DataFrame) uses 1014MM bytes, about 10% more than either of the above:

    Considering only the parser engine (which returns a dict of arrays, i.e. no data doubling) uses only 570MM bytes:

    Memory usage with non-numerical data depends on a lot of issues surrounding the handling of string data. Let’s consider the FEC data set, where pandas does pretty well out of the box, using only 415MM bytes at peak (I realized why it was so high while writing this article…will reduce soon):

    IOPro out of the box uses 3 times more. This would obviously be completely undesirable:

    What about R? It may not be fast but it uses the least memory again:

    You might be wondering why IOPro uses so much memory? The problem is fixed-width string types:

    Oof. Dtypes of S38 or S76 means that field uses 76 bytes for every entry. This is not good, so let’s set a bunch of these fields to use Python objects like pandas:

    adap = iopro.text_adapter('P00000001-ALL.csv')
    adap.set_field_types({2: object, 3: object, 4: object,
                          7: object, 8: object, 13: object})
    arr = adap[:]

    Here’s the Massif peak usage which is reasonably inline with pandas:

    Conclusions

    I’m very happy to see this project to completion, finally. Python users have been suffering for years from parsers that have 1) few features, 2) are slow, and 3) use a lot of memory. In pandas I focused first on features, then on speed, and now on both speed and memory. I’m very pleased with how it turned out. I’m excited to see the code hopefully pushed upstream into NumPy when I can get some help with the integration and plumbing (and parameter hell).

    It will be a month or so before this code appears in a new release of pandas (we are about to release version 0.9.0) as I want to let folks on the bleeding edge find any bugs before releasing it to the masses.

    Future work and extensions

    Several things could (should) be added to the parser without too much effort comparatively:

  • Integrate a regular expression engine to tokenize lines with multi-character delimiters or regular expressions.
  • Code up the fixed-width-field version of the tokenizer
  • Add on-the-fly decompression of GZIP’d files
  • Code used for performance and memory benchmarks

    R code (just copy-pasted the output I got of each command). Version 2.14.0