Monthly Archives: October 2012

A new high performance, memory-efficient file parser engine for pandas

TL;DR I’ve finally gotten around to building the high performance parser engine that pandas deserves. It hasn’t been released yet (it’s in a branch on GitHub) but will after I give it a month or so for any remaining buglets to shake out:

A project I’ve put off for a long time is building a high performance, memory efficient file parser for pandas. The existing code up through and including the imminent pandas 0.9.0 release has always been makeshift; the development focus has been on parser features over the more tedious (but actually much more straightforward) issue of creating a fast C table tokenizer. It’s been on the pandas roadmap for a long time:

http://github.com/pydata/pandas/issues/821

pandas.read_csv from pandas 0.5.0 onward is actually very fast– faster than R and much faster than numpy.loadtxt– but it uses a lot of memory. I wrote about some of the implementation issues about a year ago here. The key problem with the existing code is this: all of the existing parsing solutions in pandas as well as NumPy first read the file data into pure Python data structures: a list of tuples or a list of lists. If you have a very large file, a list of 1 million or 10 million Python tuples has an extraordinary memory footprint– significantly greater than the size of the file on disk (can be 5x or more footprint, far too much). Some people have pointed out the large memory usage without correctly explaining why, but this is the one and only reason: too many intermediate Python data structures.

Building a good parser engine isn’t exactly rocket science; we’re talking optimizing the implementation of dirt simple O(n) algorithms here. The task is divided into several key pieces:

  • File tokenization: read bytes from the file, identify where fields begin and end and which column each belongs to. Python’s csv module is an example of a tokenizer. Things like quoting conventions need to be taken into account. Doing this well in C is about picking the right data structures and making the code lean and mean. To be clear: if you design the tokenizer data structure wrong, you’ve lost before you’ve begun.
  • NA value filtering: detect NA (missing) value sentinels and convert to the appropriate NA representation. Examples of NA sentinels are NA, #N/A or other bespoke sentinels like -999. Practically speaking this means keeping a hash set of strings considered NA and check whether each parsed token is in the set (and you can have different NA sets for each column, too!). If the number of sentinel values is small, you could use an array of C strings instead of a hash set.
  • Tolerating “bad” rows: Can aberrant rows be gracefully ignored with your consent? Is the error message informative?
  • Type inference / conversion: Converting the tokens in the file to the right C types (string, date, floating point, integer, boolean).
  • Skipping rows: Ignore certain rows in file or at end of file.
  • Date parsing / value conversion: Convert one or more columns into timestamps. In some cases concatenate date/time information spread across multiple columns.
  • Handling of “index” columns: Handle row names appropriately, yielding a DataFrame with the expected row index.

  • None of this is that hard; it’s made much more time consuming due to the proliferation of fine-grained options (and resulting “parameter hell”). Anyway, I finally mustered the energy to hack it out over a few intense days in late August and September. I’m hoping to ship it in a quick pandas 0.10 release (“version point-ten”) toward the end of October if possible. It would be nice to push this code upstream into NumPy to improve loadtxt and genfromtxt’s performance as well.

    Benchmarks against R, NumPy, Continuum’s IOPro

    Outside of parser features (i.e. “can the tool read my file correctly”), there are two performance areas of interest:

  • CPU Speed: how long does it take to parse the file?
  • Memory utilization: what’s the maximum amount of RAM used while the file is being parsed (including the final returned table)? There’s really nothing worse than your computer starting to swap when you try to parse a large file
  • I’ll compare the new pandas parser engine in a group of several tools that you can use to do the same job, including R’s parser functions:

  • R’s venerable read.csv and read.table functions
  • numpy.loadtxt: this is a pure Python parser, to be clear.
  • New pandas engine, via pandas.read_csv and read_table
  • A new commercial library, IOPro, from my good friends at Continuum Analytics.
  • To do the performance analysis, I’ll look at 5 representative data sets:

  • A 100,000 x 50 CSV matrix of randomly generated 0′s and 1′s. It looks like this:
  • 1
    2
    3
    4
    5
    6
    In [3]: !head -n 5 parser_examples/zeros.csv
    0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49
    1,1,0,1,1,0,0,1,1,0,1,0,1,1,0,0,1,0,0,0,0,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,0,1,1,0,1,0,0,0,1,1,0,0,0,0
    1,0,0,1,1,1,0,0,1,1,1,0,1,1,0,0,1,0,1,1,0,1,1,1,1,1,1,0,1,1,1,0,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0,1,1,1
    1,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0,1,1,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,1,1,1,0,1,0,0,1,1,1,1,1,1,1,1,0,0
    0,1,0,0,1,1,0,0,0,0,0,0,0,0,1,0,1,1,0,1,0,1,0,1,0,0,1,1,0,1,1,0,0,0,0,0,1,0,1,1,1,0,0,0,1,0,1,0,0,1

  • A 1,000,000 x 10 CSV matrix of randomly generated normally distributed data. Looks like this:
  • 1
    2
    3
    4
    5
    6
    In [4]: !head -n 5 parser_examples/matrix.csv
    0,1,2,3,4,5,6,7,8,9
    0.609633439034,0.249525535926,0.180502465241,0.940871913454,-0.35702932376,1.12983701927,0.77045731318,-0.16976884026,-0.685520348835,0.216936429382
    0.76523368046,1.08405034644,1.2099841819,-0.858404123158,1.47061247583,-1.15728386054,-0.375685123416,-0.00475949800828,0.522530689417,0.485226447392
    -0.958266896007,-0.0583065555495,-0.17369448475,0.465274502954,0.92612769921,0.362029345941,-2.27118704972,0.944967722699,1.34525304565,1.60130304607
    -0.518406503139,-1.19158517434,0.064195872451,-2.244687656,0.947562908985,0.775078137538,0.160741686264,-0.706110036551,-0.780137216247,1.02794242373

  • The Federal election committee (FEC) data set as a CSV file. One of my favorite example data sets for talking about pandas. Here’s what it looks like when parsed with pandas.read_csv
  • 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    In [2]: df
    Out[2]:
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 1001731 entries, 0 to 1001730
    Data columns:
    cmte_id 1001731 non-null values
    cand_id 1001731 non-null values
    cand_nm 1001731 non-null values
    contbr_nm 1001731 non-null values
    contbr_city 1001716 non-null values
    contbr_st 1001727 non-null values
    contbr_zip 1001620 non-null values
    contbr_employer 994314 non-null values
    contbr_occupation 994433 non-null values
    contb_receipt_amt 1001731 non-null values
    contb_receipt_dt 1001731 non-null values
    receipt_desc 14166 non-null values
    memo_cd 92482 non-null values
    memo_text 97770 non-null values
    form_tp 1001731 non-null values
    file_num 1001731 non-null values
    dtypes: float64(1), int64(1), object(14)

  • Wikipedia page count data used for benchmarks in this blog post. It’s delimited by single spaces and has no column header:
  • 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    In [8]: df
    Out[8]:
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 6078103 entries, 0 to 6078102
    Data columns:
    X.1 6077987 non-null values
    X.2 6078090 non-null values
    X.3 6078103 non-null values
    X.4 6078103 non-null values
    dtypes: int64(2), object(2)
    In [9]: df.head()
    Out[9]:
    X.1 X.2 X.3 X.4
    0 aa.b Special%3aStatistics 1 18127
    1 aa.b Special:WhatLinksHere/User:Sir_Lestaty_de_Lion... 1 5325
    2 aa.b User:EVula 1 21080
    3 aa.b User:EVula/header 1 17332
    4 aa.b User:Manecke 1 21041

  • A large numerical astronomy data set used for benchmarks in this blog post. Looks like this:
  • 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    In [19]: df
    Out[19]:
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 6949386 entries, 0 to 6949385
    Data columns:
    objectid (long) 6949386 non-null values
    right ascension (float) 6949386 non-null values
    declination (float) 6949386 non-null values
    ultraviolet (double) 6949386 non-null values
    green (double) 6949386 non-null values
    red (double) 6949386 non-null values
    infrared (double) 6949386 non-null values
    z (double) 6949386 non-null values
    dtypes: float64(7), int64(1)
    In [20]: df[:2].T
    Out[20]:
    0 1
    objectid (long) 7.588828e+17 7.588828e+17
    right ascension (float) 2.634087e+02 2.634271e+02
    declination (float) 6.278961e+00 6.310742e+00
    ultraviolet (double) 2.459675e+01 2.330080e+01
    green (double) 2.347177e+01 2.275493e+01
    red (double) 2.169188e+01 2.188667e+01
    infrared (double) 2.118722e+01 2.066283e+01
    z (double) 2.043528e+01 2.135766e+01

    Here’s a link to an archive of all the datasets (warning: about 500 megabytes): Table datasets

    I don’t have time to compare features (which vary greatly across the tools).

    Oh, and my rig:

  • Core i7 950 @ 3.07 GHz
  • 24 GB of ram (so we won’t get close to swapping)
  • OCZ Vertex 3 Sata 3 SSD
  • (Because I have an SSD I would expect the benchmarks for spinning rust to differ roughly by a constant amount based on read times for slurping the bytes of the disk. In my case, the disk reads aren’t a major factor. In corporate environments with NFS servers under heavy load, you would expect similar reads to take a bit longer.)

    CPU Performance benchmarks

    So numpy.loadtxt is really slow, and I’m excluding it from the benchmarks. On the smallest and simplest file in these benchmarks, it’s more than 10 times slower than the new pandas parser:

    1
    2
    3
    4
    5
    6
    In [27]: timeit read_csv('zeros.csv')
    1 loops, best of 3: 415 ms per loop
    In [29]: %time arr = np.loadtxt('zeros.csv', delimiter=',', dtype=np.int_, skiprows=1)
    CPU times: user 4.88 s, sys: 0.04 s, total: 4.92 s
    Wall time: 4.92 s

    Here are the results for everybody else (see code at end of post):

    Here are the raw numbers in seconds:

    In [30]: results
    Out[30]:
                       iopro    pandas       R
    astro          17.646228  6.955254  37.030
    double-matrix   3.377430  1.279502   6.920
    fec             3.685799  2.306570  18.121
    wikipedia      11.752624  4.369659  42.250
    zero-matrix     0.673885  0.268830   0.616

    IOPro vs. new pandas parser: look closer

    But hey, wait a second. If you are intimately familiar with IOPro and pandas you will already be saying that I am not making an apples to apples comparison. True. Why not?

  • IOPro does not check for and substitute common NA sentinel values (I believe you can give it a list of values to check for– the documentation was a bit hard to work out in this regard)
  • IOPro returns NumPy arrays with structured dtype. Pandas DataFrame has a slightly different internal format, and strings are boxed as Python objects rather than stored in NumPy string dtype arrays
  • To level the playing field, I’ll disable the NA filtering logic (passing na_filter=False) in pandas, instruct the parser to return a structured array instead of a DataFrame (as_recarray=True). Secondly, let’s only look at the numerical datasets (exclude wikipedia and fec, for now) to exclude the impact of handling of string datatypes. Here is the resulting graph (with relative timings):

    It looks like the savings of not passing all the tokens through the NA filter is balanced by the cost of transferring the column arrays into the structured array (which is a raw array of bytes interpreted as a table by NumPy). This could very likely be made faster (more cache-efficient) than it currently is with some effort.

    Memory usage benchmarks

    Profiling peak memory usage is a tedious process. The canonical tool for the job is Massif from the Valgrind suite. I’m not yet done obsessing over memory allocation and data management inside the parser system, but here’s what the numbers look like compared with R and IOPro. I’m using the following valgrind commands (plus ms_print) to get this output (if this is not correct, please someone tell me):

    1
    2
    valgrind --tool=massif --depth=1 python -c command
    ms_print massif_output_file

    I’ll use the largest file in this post, the astro numerical dataset.

    First, IOPro advertises very low memory footprint. It does not, however, avoid having 2 copies of the data set in memory (I don’t either. It’s actually very difficult–and costly–to avoid this). Here is the final output of ms_print showing peak memory usage at the very end when the structured array is created and returned:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    65 91,209,383,985 467,332,576 467,252,659 79,917 0
    66 92,193,310,604 467,332,576 467,252,659 79,917 0
    67 93,154,564,618 912,093,904 908,774,712 3,319,192 0
    99.64% (908,774,712B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->48.76% (444,761,193B) 0x6E1F5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->48.76% (444,760,704B) 0x6E0BEAD: PyArray_Resize (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->01.15% (10,485,760B) 0x65391E1: open_text_adapter (text_adapter.c:58)
    |
    ->00.96% (8,767,055B) in 1+ places, all below ms_print's threshold (01.00%)

    Let’s look at R. Peak memory allocation comes in slightly under IOPro at 903MM bytes vs. 912MM:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    71 227,687,337,998 635,817,352 635,766,423 50,929 0
    72 227,723,703,780 681,469,608 681,418,671 50,937 0
    73 227,758,451,071 737,064,752 737,013,799 50,953 0
    74 227,803,513,492 792,659,896 792,608,927 50,969 0
    75 227,848,799,505 848,255,040 848,204,055 50,985 0
    76 227,893,861,838 903,850,184 903,799,183 51,001 0
    77 227,937,535,005 903,850,184 903,799,183 51,001 0
    99.99% (903,799,183B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->98.52% (890,458,496B) 0x4F45020: Rf_allocVector (memory.c:2388)
    |
    ->01.48% (13,340,687B) in 108 places, all below massif's threshold (01.00

    In the new pandas parser, I’ll look at 2 things: memory allocation by the parser engine before creation of the final DataFrame (which causes data-doubling as with IOPro) and the user-facing read_csv. First, the profile of using read_csv (which also creates a simple integer Index for the DataFrame) uses 1014MM bytes, about 10% more than either of the above:

    1
    2
    3
    4
    5
    6
    7
    8
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    70 56,100,696,245 1,014,234,384 965,776,492 48,457,892 0
    95.22% (965,776,492B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->93.19% (945,117,309B) 0x6B9E5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->02.04% (20,659,183B) in 145 places, all below massif's threshold (01.00%)

    Considering only the parser engine (which returns a dict of arrays, i.e. no data doubling) uses only 570MM bytes:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    56 39,618,471,594 569,619,528 521,140,509 48,479,019 0
    91.49% (521,140,509B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->87.84% (500,356,421B) 0x6B9E5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->01.95% (11,084,760B) in 128 places, all below massif's threshold (01.00%)
    |
    ->01.70% (9,699,328B) 0x4EADC3B: PyObject_Malloc (obmalloc.c:580)

    Memory usage with non-numerical data depends on a lot of issues surrounding the handling of string data. Let’s consider the FEC data set, where pandas does pretty well out of the box, using only 415MM bytes at peak (I realized why it was so high while writing this article…will reduce soon):

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    88 12,473,186,057 415,804,184 348,381,546 67,422,638 0
    83.79% (348,381,546B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->63.60% (264,457,869B) 0x6B9E5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->14.88% (61,865,984B) 0x4EADC3B: PyObject_Malloc (obmalloc.c:580)
    |
    ->02.34% (9,722,205B) in 149 places, all below massif's threshold (01.00%)
    |
    ->01.91% (7,959,808B) 0x4E9799C: fill_free_list (intobject.c:52)
    |
    ->01.05% (4,375,680B) 0x4EA5147: dictresize (dictobject.c:632)

    IOPro out of the box uses 3 times more. This would obviously be completely undesirable:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    76 15,039,940,724 1,232,035,520 828,653,666 403,381,854 0
    67.26% (828,653,666B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->32.85% (404,699,813B) 0x6E1F5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->32.85% (404,699,324B) 0x6E0BEAD: PyArray_Resize (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->01.56% (19,254,529B) in 114 places, all below massif's threshold (01.00%)

    What about R? It may not be fast but it uses the least memory again:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    75 51,948,622,306 259,260,296 258,997,057 263,239 0
    99.90% (258,997,057B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->74.63% (193,486,656B) 0x4F45020: Rf_allocVector (memory.c:2388)
    |
    ->23.57% (61,112,304B) 0x4F445C0: GetNewPage (memory.c:786)
    |
    ->01.14% (2,951,289B) 0x4FBDD9B: do_lazyLoadDBfetch (serialize.c:2335)
    |
    ->00.56% (1,446,808B) in 1+ places, all below ms_print's threshold (01.00%)

    You might be wondering why IOPro uses so much memory? The problem is fixed-width string types:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    In [2]: adap = iopro.text_adapter('P00000001-ALL.csv')
    In [3]: arr = adap[:]
    In [4]: arr
    Out[4]:
    array([ ('C00410118', 'P20002978', 'Bachmann, Michelle', 'HARVEY, WILLIAM', 'MOBILE', 'AL', '366010290', 'RETIRED', 'RETIRED', 250.0, '20-JUN-11', '', '', '', 'SA17A', 736166L),
    ('C00410118', 'P20002978', 'Bachmann, Michelle', 'HARVEY, WILLIAM', 'MOBILE', 'AL', '366010290', 'RETIRED', 'RETIRED', 50.0, '23-JUN-11', '', '', '', 'SA17A', 736166L),
    ('C00410118', 'P20002978', 'Bachmann, Michelle', 'SMITH, LANIER', 'LANETT', 'AL', '368633403', 'INFORMATION REQUESTED', 'INFORMATION REQUESTED', 250.0, '05-JUL-11', '', '', '', 'SA17A', 749073L),
    ...,
    ('C00500587', 'P20003281', 'Perry, Rick', 'GRANE, BRYAN F. MR.', 'INFO REQUESTED', 'XX', '99999', 'INFORMATION REQUESTED PER BEST EFFORTS', 'INFORMATION REQUESTED PER BEST EFFORTS', 500.0, '29-SEP-11', '', '', '', 'SA17A', 751678L),
    ('C00500587', 'P20003281', 'Perry, Rick', 'TOLBERT, DARYL MR.', 'INFO REQUESTED', 'XX', '99999', 'T.A.C.C.', 'LONGWALL MAINTENANCE FOREMAN', 500.0, '30-SEP-11', '', '', '', 'SA17A', 751678L),
    ('C00500587', 'P20003281', 'Perry, Rick', 'ANDERSON, MARILEE MRS.', 'INFO REQUESTED', 'XX', '99999', 'INFORMATION REQUESTED PER BEST EFFORTS', 'INFORMATION REQUESTED PER BEST EFFORTS', 2500.0, '31-AUG-11', '', '', '', 'SA17A', 751678L)],
    dtype=[('cmte_id', '|S9'), ('cand_id', '|S9'), ('cand_nm', '|S30'), ('contbr_nm', '|S57'), ('contbr_city', '|S29'), ('contbr_st', '|S2'), ('contbr_zip', '|S9'), ('contbr_employer', '|S38'), ('contbr_occupation', '|S38'), ('contb_receipt_amt', '<f8'), ('contb_receipt_dt', '|S9'), ('receipt_desc', '|S76'), ('memo_cd', '|S1'), ('memo_text', '|S76'), ('form_tp', '|S5'), ('file_num', '<u8')])

    Oof. Dtypes of S38 or S76 means that field uses 76 bytes for every entry. This is not good, so let’s set a bunch of these fields to use Python objects like pandas:

    adap = iopro.text_adapter('P00000001-ALL.csv')
    adap.set_field_types({2: object, 3: object, 4: object,
                          7: object, 8: object, 13: object})
    arr = adap[:]

    Here’s the Massif peak usage which is reasonably inline with pandas:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    39 17,492,316,089 447,952,304 399,945,549 48,006,755 0
    89.28% (399,945,549B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->64.31% (288,096,256B) 0x4EADC3B: PyObject_Malloc (obmalloc.c:580)
    |
    ->10.73% (48,083,577B) 0x6E1F5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->10.73% (48,083,088B) 0x6E0BEAD: PyArray_Resize (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->02.34% (10,485,760B) 0x65391E1: open_text_adapter (text_adapter.c:58)
    |
    ->01.16% (5,196,868B) in 113 places, all below massif's threshold (01.00%)

    Conclusions

    I’m very happy to see this project to completion, finally. Python users have been suffering for years from parsers that have 1) few features, 2) are slow, and 3) use a lot of memory. In pandas I focused first on features, then on speed, and now on both speed and memory. I’m very pleased with how it turned out. I’m excited to see the code hopefully pushed upstream into NumPy when I can get some help with the integration and plumbing (and parameter hell).

    It will be a month or so before this code appears in a new release of pandas (we are about to release version 0.9.0) as I want to let folks on the bleeding edge find any bugs before releasing it to the masses.

    Future work and extensions

    Several things could (should) be added to the parser without too much effort comparatively:

  • Integrate a regular expression engine to tokenize lines with multi-character delimiters or regular expressions.
  • Code up the fixed-width-field version of the tokenizer
  • Add on-the-fly decompression of GZIP’d files
  • Code used for performance and memory benchmarks

    R code (just copy-pasted the output I got of each command). Version 2.14.0

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    system.time(df <- read.csv('parser_examples/zeros.csv', colClasses=rep("integer", 50)))
    user system elapsed
    0.616 0.004 0.623
    system.time(df <- read.csv('parser_examples/matrix.csv', colClasses=rep("numeric", 10)))
    user system elapsed
    6.920 0.136 7.071
    system.time(df <- read.csv('parser_examples/sdss6949386.csv', colClasses=rep("numeric", 8)))
    user system elapsed
    37.030 0.804 37.866
    system.time(df <- read.table('parser_examples/pagecounts-20110331-220000', sep=" ",
               header=F,
    colClasses=c("character", "character", "integer", "numeric")))
    user system elapsed
    42.250 0.356 42.651
    system.time(df <- read.csv('parser_examples/P00000001-ALL.csv'))
    user system elapsed
    18.121 0.212 18.350