Update on upcoming pandas v0.10, new file parser, other performance wins

We’re hard at work as usual getting the next major pandas release out. I hope you’re as excited as I am! An interesting problem came up recently on the ever-popular FEC Disclosure database used in my book and in many pandas demos. The powers that be decided it would be cool if they put commas at the end of each line; fooling most CSV readers into thinking there are empty fields at the end of each line:

1
2
3
4
5
6
7
8
9
In [4]: path
Out[4]: '/home/wesm/Downloads/P00000001-ALL.csv'
In [5]: !head $path -n 5
cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
C00410118,"P20002978","Bachmann, Michele","HARVEY, WILLIAM","MOBILE","AL","366010290","RETIRED","RETIRED",250,20-JUN-11,"","","","SA17A","736166","A1FDABC23D2D545A1B83","P2012",
C00410118,"P20002978","Bachmann, Michele","HARVEY, WILLIAM","MOBILE","AL","366010290","RETIRED","RETIRED",50,23-JUN-11,"","","","SA17A","736166","A899B9B0E223743EFA63","P2012",
C00410118,"P20002978","Bachmann, Michele","SMITH, LANIER","LANETT","AL","368633403","INFORMATION REQUESTED","INFORMATION REQUESTED",250,05-JUL-11,"","","","SA17A","749073","A4B144E3CB3294ABC9D6","P2012",
C00410118,"P20002978","Bachmann, Michele","BLEVINS, DARONDA","PIGGOTT","AR","724548253","NONE","RETIRED",250,01-AUG-11,"","","","SA17A","749073","A8C7C6F55B0914C5A880","P2012",

pandas’s file parsers by default will treat the first column as the DataFrame’s row names if the data have 1 too many columns, which is very useful in a lot of cases. Not so much here. So I made it so you can indicate index_col=False which results on the last column being dropped as desired. The FEC data file is now about 900MB and takes only 20 seconds to load on my spinning-rust box:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
In [1]: path = '/home/wesm/Downloads/P00000001-ALL.csv'
In [2]: %time fec = read_csv(path, index_col=False)
CPU times: user 19.29 s, sys: 1.04 s, total: 20.33 s
Wall time: 20.45 s
In [3]: fec
Out[3]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5330834 entries, 0 to 5330833
Data columns:
cmte_id 5330834 non-null values
cand_id 5330834 non-null values
cand_nm 5330834 non-null values
contbr_nm 5330834 non-null values
contbr_city 5330773 non-null values
contbr_st 5330779 non-null values
contbr_zip 5330472 non-null values
contbr_employer 5295347 non-null values
contbr_occupation 5296749 non-null values
contb_receipt_amt 5330834 non-null values
contb_receipt_dt 5330834 non-null values
receipt_desc 50420 non-null values
memo_cd 1165554 non-null values
memo_text 1182573 non-null values
form_tp 5330834 non-null values
file_num 5330834 non-null values
tran_id 5330834 non-null values
election_tp 5330135 non-null values
dtypes: float64(1), int64(1), object(16)
In [4]: fec.head()
Out[4]:
cmte_id cand_id cand_nm contbr_nm contbr_city \
0 C00410118 P20002978 Bachmann, Michele HARVEY, WILLIAM MOBILE
1 C00410118 P20002978 Bachmann, Michele HARVEY, WILLIAM MOBILE
2 C00410118 P20002978 Bachmann, Michele SMITH, LANIER LANETT
3 C00410118 P20002978 Bachmann, Michele BLEVINS, DARONDA PIGGOTT
4 C00410118 P20002978 Bachmann, Michele WARDENBURG, HAROLD HOT SPRINGS NATION
contbr_st contbr_zip contbr_employer contbr_occupation \
0 AL 3.660103e+08 RETIRED RETIRED
1 AL 3.660103e+08 RETIRED RETIRED
2 AL 3.686334e+08 INFORMATION REQUESTED INFORMATION REQUESTED
3 AR 7.245483e+08 NONE RETIRED
4 AR 7.190165e+08 NONE RETIRED
contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp \
0 250 20-JUN-11 NaN NaN NaN SA17A
1 50 23-JUN-11 NaN NaN NaN SA17A
2 250 05-JUL-11 NaN NaN NaN SA17A
3 250 01-AUG-11 NaN NaN NaN SA17A
4 300 20-JUN-11 NaN NaN NaN SA17A
file_num tran_id election_tp
0 736166 A1FDABC23D2D545A1B83 P2012
1 736166 A899B9B0E223743EFA63 P2012
2 749073 A4B144E3CB3294ABC9D6 P2012
3 749073 A8C7C6F55B0914C5A880 P2012
4 736166 A070D6D560BD84AA98AC P2012

For reference, it’s more difficult to load this file in R (2.15.2) (both because of its size and malformedness– hopefully an R guru can tell me how to deal with this trailing delimiter crap). Setting row.names=NULL causes incorrect column labelling but at least gives us a parsing + type inference performance number (about 10x slower, faster if you specify all 18 column data types):

1
2
3
> system.time(df <- read.csv(path, row.names=NULL))
user system elapsed
220.250 2.100 222.662

If you know much about this data set, you know most of these columns are not interesting to analyze. New in pandas v0.10 you can specify a subset of columns right in read_csv which results in both much faster parsing time and lower memory usage (since we're throwing away the data from the other columns after tokenizing the file):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [6]: fec_cols = ['cand_nm', 'contbr_st', 'contbr_employer', 'contb_receipt_amt',
...: 'contbr_occupation', 'contb_receipt_amt', 'contb_receipt_dt']
In [7]: %time fec = read_csv(path, index_col=False, usecols=fec_cols)
CPU times: user 8.93 s, sys: 0.36 s, total: 9.29 s
Wall time: 9.31 s
In [8]: fec
feOut[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5330834 entries, 0 to 5330833
Data columns:
cand_nm 5330834 non-null values
contbr_st 5330779 non-null values
contbr_employer 5295347 non-null values
contbr_occupation 5296749 non-null values
contb_receipt_amt 5330834 non-null values
contb_receipt_dt 5330834 non-null values
dtypes: float64(1), object(5)

Outside of file reading, a huge amount of work has been done elsewhere on pandas (aided by Chang She, Yoval P, Jeff Reback, and others). Performance has improved in many critical operations outside of parsing too (check out the groupby numbers!). Here's the output of a recent vbench run showing the latest dev version versus version 0.9.0 (numbers less than 1 indicate that the current pandas version is faster on average by that ratio):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
master v0.9.0 ratio
name
unstack_sparse_keyspace 1.3068 140.7709 0.0093
groupby_frame_apply_overhead 18.5924 320.9360 0.0579
read_csv_comment2 25.2323 361.2320 0.0699
groupbym_frame_apply 73.6522 495.8570 0.1485
read_csv_thou_vb 35.3686 191.7560 0.1844
frame_iteritems_cached 0.0700 0.3736 0.1873
concat_small_frames 12.4304 54.4093 0.2285
join_dataframe_integer_2key 5.8004 18.3924 0.3154
append_frame_single_homogenous 0.3345 0.8811 0.3796
read_csv_vb 18.5832 46.5717 0.3990
read_csv_standard 12.2059 28.3020 0.4313
panel_from_dict_all_different_indexes 73.6467 149.4339 0.4928
frame_constructor_ndarray 0.0478 0.0916 0.5219
panel_from_dict_two_different_indexes 50.8969 88.4219 0.5756
groupby_last 3.6658 6.2642 0.5852
groupby_first 3.7455 6.2646 0.5979
frame_get_numeric_data 0.0677 0.1076 0.6295
frame_iteritems 2.5812 3.5731 0.7224
groupby_simple_compress_timing 37.6556 51.8415 0.7264
append_frame_single_mixed 1.2934 1.7739 0.7291
frame_to_csv 285.7749 387.6002 0.7373
join_dataframe_integer_key 1.7632 2.2799 0.7734
groupby_multi_size 30.8166 39.5899 0.7784
groupby_frame_singlekey_integer 2.3004 2.8959 0.7944
write_csv_standard 329.9391 411.1040 0.8026
groupby_multi_cython 16.6989 19.3612 0.8625
sparse_series_to_frame 158.4511 181.7181 0.8720
groupby_multi_series_op 15.4856 17.7564 0.8721
stat_ops_level_frame_sum_multiple 7.6714 8.7897 0.8728
stat_ops_level_series_sum 2.1563 2.4705 0.8728
stat_ops_level_frame_sum 3.0208 3.3616 0.8986
merge_2intkey_nosort 18.5616 20.6283 0.8998
dataframe_reindex_columns 0.3021 0.3343 0.9037
stats_rank_average 28.2949 30.8713 0.9165
reshape_stack_simple 2.6250 2.8527 0.9202
frame_reindex_both_axes 0.3739 0.4037 0.9262
melt_dataframe 1.4668 1.5812 0.9276
groupby_frame_median 7.1793 7.7170 0.9303
groupby_indices 6.8554 7.3550 0.9321
stat_ops_level_series_sum_multiple 7.2527 7.7637 0.9342
groupby_series_simple_cython 5.0585 5.3895 0.9386
merge_2intkey_sort 44.3559 47.1701 0.9403
timeseries_timestamp_tzinfo_cons 0.0159 0.0169 0.9435
join_dataframe_index_multi 20.7460 21.9524 0.9450
frame_reindex_both_axes_ix 0.4559 0.4810 0.9479
panel_from_dict_equiv_indexes 26.2508 27.6105 0.9508
index_datetime_intersection 15.1940 15.9497 0.9526
panel_from_dict_same_index 26.1363 27.3955 0.9540
timeseries_large_lookup_value 0.0234 0.0245 0.9566
frame_sort_index_by_columns 41.4619 43.3163 0.9572
frame_boolean_row_select 0.2905 0.3034 0.9574
indexing_dataframe_boolean_rows 0.2364 0.2468 0.9581
join_dataframe_index_single_key_bigger 15.0122 15.6593 0.9587
index_datetime_union 15.2101 15.8353 0.9605
reindex_frame_level_align 0.9618 0.9992 0.9625
dataframe_reindex_daterange 0.4052 0.4206 0.9634
lib_fast_zip 10.3956 10.6966 0.9719
read_table_multiple_date_baseline 1175.4630 1208.3721 0.9728
groupby_frame_cython_many_columns 3.6782 3.7668 0.9765
groupby_multi_python 54.6882 55.9723 0.9771
series_align_int64_index 35.9280 36.6402 0.9806
groupby_multi_different_numpy_functions 13.9577 14.2328 0.9807
read_table_multiple_date 2599.5600 2648.0811 0.9817
reindex_frame_level_reindex 0.9353 0.9506 0.9839
timeseries_asof_single 0.0565 0.0574 0.9840
concat_series_axis1 70.7297 71.8611 0.9843
replace_fillna 4.9988 5.0717 0.9856
reindex_multiindex 1.2701 1.2874 0.9866
groupby_pivot_table 18.8712 19.1218 0.9869
groupby_multi_different_functions 13.9056 14.0842 0.9873
indexing_dataframe_boolean_rows_object 0.4870 0.4932 0.9875
stats_rank2d_axis1_average 14.5004 14.6610 0.9890
frame_ctor_nested_dict 90.2146 91.1775 0.9894
frame_ctor_nested_dict_int64 131.1228 132.3512 0.9907
frame_reindex_axis1 2.8508 2.8740 0.9919
timeseries_asof_nan 9.9486 10.0186 0.9930
timeseries_asof 10.3370 10.4094 0.9930
stats_rank_average_int 22.5349 22.6918 0.9931
sort_level_zero 4.3147 4.3434 0.9934
frame_reindex_axis0 1.3561 1.3643 0.9940
dti_reset_index 0.6203 0.6237 0.9945
frame_fillna_many_columns_pad 15.3549 15.4276 0.9953
frame_drop_dup_inplace 2.8577 2.8690 0.9961
stats_rank2d_axis0_average 24.5411 24.6010 0.9976
timeseries_timestamp_downsample_mean 4.4788 4.4895 0.9976
match_strings 0.3809 0.3815 0.9984
timeseries_sort_index 22.2473 22.2516 0.9998
index_int64_intersection 25.5705 25.5643 1.0002
index_int64_union 82.9492 82.8681 1.0010
frame_insert_500_columns 99.3640 99.2359 1.0013
timeseries_add_irregular 21.6117 21.5676 1.0020
replace_replacena 5.0788 5.0680 1.0021
series_ctor_from_dict 3.6953 3.6821 1.0036
series_align_left_monotonic 13.1076 13.0529 1.0042
timeseries_period_downsample_mean 6.2799 6.2450 1.0056
timeseries_1min_5min_mean 0.6190 0.6135 1.0090
stat_ops_series_std 0.2522 0.2494 1.0113
frame_drop_dup_na_inplace 2.6487 2.6187 1.0115
reindex_daterange_backfill 0.1866 0.1842 1.0128
frame_drop_duplicates 18.7704 18.5208 1.0135
series_align_irregular_string 81.0545 79.7984 1.0157
frame_drop_duplicates_na 18.7666 18.4687 1.0161
timeseries_infer_freq 10.5702 10.4007 1.0163
lib_fast_zip_fillna 13.9561 13.6818 1.0200
reindex_daterange_pad 0.1876 0.1839 1.0202
timeseries_1min_5min_ohlc 0.6971 0.6815 1.0229
reindex_fillna_backfill 0.1426 0.1388 1.0269
sort_level_one 4.4267 4.2921 1.0314
datetimeindex_add_offset 0.2387 0.2307 1.0345
frame_fancy_lookup 2.5191 2.4317 1.0359
frame_fillna_inplace 16.9570 16.2857 1.0412
timeseries_slice_minutely 0.0626 0.0600 1.0433
reindex_fillna_pad 0.1459 0.1396 1.0453
frame_ctor_list_of_dict 98.7562 93.8414 1.0524
join_dataframe_index_single_key_bigger 6.6729 6.2370 1.0699
frame_fancy_lookup_all 26.3783 24.6056 1.0720
series_constructor_ndarray 0.0124 0.0115 1.0765
series_value_counts_int64 2.7498 2.5306 1.0866
reshape_unstack_simple 3.1889 2.8616 1.1144
indexing_panel_subset 0.5772 0.5063 1.1400
groupby_apply_dict_return 45.6793 39.6327 1.1526
timeseries_to_datetime_iso8601 4.1278 3.5348 1.1678
join_dataframe_index_single_key_small 6.8140 5.7951 1.1758
frame_to_string_floats 52.5621 35.4093 1.4844
reshape_pivot_time_series 181.4950 103.2100 1.7585
sparse_frame_constructor 5.9832 3.3596 1.7809
Columns: test_name | target_duration [ms] | baseline_duration [ms] | ratio
- a Ratio of 1.30 means the target commit is 30% slower then the baseline.
Target [648d581] : ENH: index_col=False disables first column -> index. fixes FEC parsing problem. close #2442
Baseline [b5956fd] : RLS: Version 0.9.0 final

  • Matthew Dowle

    Re “hopefully an R guru can tell me how to deal with this trailing delimiter crap”. I’m not a guru but what’s wrong with reading in the final blank column, then removing that column once it’s in? Yes, as you say, it is tedious to specify column types in advance in R. So you can read the first few rows to get the types, than pass that back in to colClasses. To say that R “deeply struggles” is going a little too far. Your file reader is wonderful, yes. It is better, more convenient, than R’s. But do you really have to bash R so harshly?

    [Reply]

    Wes McKinney Reply:

    Sorry, I softened the language from “deeply struggles” to “makes it more difficult”. Have you tried loading this file in R? The issue is that the header column does NOT have a trailing delimiter so R wants to use the first column as the row names. If you pass row.names=NULL it adds a row.names first column name and mislabels the other columns.

    If someone from the R community gets motivated, the file tokenizer I built (the interesting part) is liberally licensed C code and could be easily adapted to R. I’m not much for the language wars and think that everyone should have fast, easy to use data processing tools in all languages. You’re one of the only people I know who’s been making major strides on this front recently (really great work with data.table).

    [Reply]

    Matthew Dowle Reply:

    Great, thanks. I see what you mean about the header row now. I haven’t tried it, but how about skip=1,header=FALSE to read the data (only) ignoring the header row. Then read the header row separately (nrows=1) and assign the row names afterwards. A faff, though. Interesting case.

    Funny you should say that. I’ve been working on a faster and friendlier file reader for R in spare time. There’s still quite a bit to do, but an early draft of its help file with some examples and timings is below. One of the things it’ll read directly is dual-delimited files, where each cell of a list column is itself a vector, such as in genomics (a second separator, within some columns). I’ll certainly take a look at your tokenizer, thanks. Also your use of mmap looks neat.

    https://r-forge.r-project.org/scm/viewvc.php/pkg/man/fread.Rd?view=markup&root=datatable

    [Reply]

  • Anonymous

    Excellent work. It’s great to see vbench demonstrate just how far you guys are coming along.

    I just tried to use 0.9.1′s read_csv on my machine for this FEC dataset, it used 8Gb of swap! (I killed it after it gave no result after an hour.)

    [Reply]

  • http://twitter.com/omitevski Oliver Mitevski

    I would prefer that pandas beat the R read.csv in all respects. Hopefully this can be added to improve the robustness of read_csv.

    http://stackoverflow.com/questions/13824840/feature-request-for-pandas-read-csv

    [Reply]

    Wes McKinney Reply:

    Just replied. You have to specify the escape character to make that case work. R has a default option that quotes are coded as `”` which makes it work out of the box in your case. Maybe that should be the default for pandas, too.

    [Reply]

  • Greg

    Hi, Hi! Sorry to bug you about this. I am new to python and pandas. I started coding in Python 3 and later discovered all the distributions like python(x,y), and the fact that they are all using 2.7. Moving to 2.7 seems odd now. I’ve been able to install all recomended packages/soft like Ipython etc on win7-64 using python 3.2, except for PyTables. By any chance are you aware if they are planning to release a version for Python 3? I searched their site for way to contact PyTables developers, but failed…

    [Reply]

    Wes McKinney Reply:

    I’d suggest you stick to 2.7 for another year or so if you need PyTables. I haven’t looked to see how much work would be involved in making the C code there 2 and 3 compatible.

    [Reply]

  • http://www.facebook.com/Painn.O Artyom Kharitonov

    it’s really good, thanks

    [Reply]

  • Ian GM

    First and foremost, pandas is the coolest thing I’ve found in a long time. Many thanks.

    Is it expected that read_csv is much faster than to_csv ?

    I’m seeing read_csv : to_csv : (writing to hdf5) at about 1 : 50 : 2.5

    [Reply]

    Wes McKinney Reply:

    Yes, to_csv is much slower and does not have any optimized code. I would be interested to see this changed but do not have the time to implement it myself for some time.

    [Reply]

  • Feng

    I tried this on my Mac with 8GB memory, but I got this:

    Python(12618,0xacb01a28) malloc: *** mmap(size=24051712) failed (error code=12)
    *** error: can’t allocate region
    *** set a breakpoint in malloc_error_break to debug

    what is possible reason for this? thanks for your help

    [Reply]

    Wes McKinney Reply:

    Can you post this on GitHub please? https://github.com/pydata/pandas/issues?state=open

    [Reply]

  • Anonymous

    It’s great to see read_csv get such a big improvement. I’m curious as to why or the short version of why read_csv is so much faster than numpy.loadxt()?

    For example, I’m comparing numbers on a relatively small space-separated file (about 208K) with 26 columns and 1000 rows. Each item in the file is a random floating point number that I generated with numpy.rand.

    >>> %timeit f = open(‘test.out’, ‘r’);f.readline();np.loadtxt(f, unpack=True)
    10 loops, best of 3: 26.9 ms per loop

    >>> %timeit pd.read_csv(‘test.out’, delim_whitespace=True)
    100 loops, best of 3: 5.99 ms per loop

    [Reply]

    Anonymous Reply:

    Guess the answer is pretty much already on your blog! For others to reference:

    - Fast new file parser added: http://wesmckinney.com/blog/?m=201210

    - Speeding up file parsing with Cython: http://wesmckinney.com/blog/?p=278

    [Reply]

    Wes McKinney Reply:

    Short answer: file tokenization and type inference is being handled at the lowest level possible in C/Cython. If you look at the impl of numpy.loadtxt you’ll see a lot of Python.

    [Reply]

    Anonymous Reply:

    Oh ok, that makes a lot of sense. I did look into read_csv but not loadtxt. Thanks for the feedback!

    [Reply]