Update on upcoming pandas v0.10, new file parser, other performance wins

We’re hard at work as usual getting the next major pandas release out. I hope you’re as excited as I am! An interesting problem came up recently on the ever-popular FEC Disclosure database used in my book and in many pandas demos. The powers that be decided it would be cool if they put commas at the end of each line; fooling most CSV readers into thinking there are empty fields at the end of each line:

1
2
3
4
5
6
7
8
9
In [4]: path
Out[4]: '/home/wesm/Downloads/P00000001-ALL.csv'
In [5]: !head $path -n 5
cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
C00410118,"P20002978","Bachmann, Michele","HARVEY, WILLIAM","MOBILE","AL","366010290","RETIRED","RETIRED",250,20-JUN-11,"","","","SA17A","736166","A1FDABC23D2D545A1B83","P2012",
C00410118,"P20002978","Bachmann, Michele","HARVEY, WILLIAM","MOBILE","AL","366010290","RETIRED","RETIRED",50,23-JUN-11,"","","","SA17A","736166","A899B9B0E223743EFA63","P2012",
C00410118,"P20002978","Bachmann, Michele","SMITH, LANIER","LANETT","AL","368633403","INFORMATION REQUESTED","INFORMATION REQUESTED",250,05-JUL-11,"","","","SA17A","749073","A4B144E3CB3294ABC9D6","P2012",
C00410118,"P20002978","Bachmann, Michele","BLEVINS, DARONDA","PIGGOTT","AR","724548253","NONE","RETIRED",250,01-AUG-11,"","","","SA17A","749073","A8C7C6F55B0914C5A880","P2012",

pandas’s file parsers by default will treat the first column as the DataFrame’s row names if the data have 1 too many columns, which is very useful in a lot of cases. Not so much here. So I made it so you can indicate index_col=False which results on the last column being dropped as desired. The FEC data file is now about 900MB and takes only 20 seconds to load on my spinning-rust box:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
In [1]: path = '/home/wesm/Downloads/P00000001-ALL.csv'
In [2]: %time fec = read_csv(path, index_col=False)
CPU times: user 19.29 s, sys: 1.04 s, total: 20.33 s
Wall time: 20.45 s
In [3]: fec
Out[3]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5330834 entries, 0 to 5330833
Data columns:
cmte_id 5330834 non-null values
cand_id 5330834 non-null values
cand_nm 5330834 non-null values
contbr_nm 5330834 non-null values
contbr_city 5330773 non-null values
contbr_st 5330779 non-null values
contbr_zip 5330472 non-null values
contbr_employer 5295347 non-null values
contbr_occupation 5296749 non-null values
contb_receipt_amt 5330834 non-null values
contb_receipt_dt 5330834 non-null values
receipt_desc 50420 non-null values
memo_cd 1165554 non-null values
memo_text 1182573 non-null values
form_tp 5330834 non-null values
file_num 5330834 non-null values
tran_id 5330834 non-null values
election_tp 5330135 non-null values
dtypes: float64(1), int64(1), object(16)
In [4]: fec.head()
Out[4]:
cmte_id cand_id cand_nm contbr_nm contbr_city \
0 C00410118 P20002978 Bachmann, Michele HARVEY, WILLIAM MOBILE
1 C00410118 P20002978 Bachmann, Michele HARVEY, WILLIAM MOBILE
2 C00410118 P20002978 Bachmann, Michele SMITH, LANIER LANETT
3 C00410118 P20002978 Bachmann, Michele BLEVINS, DARONDA PIGGOTT
4 C00410118 P20002978 Bachmann, Michele WARDENBURG, HAROLD HOT SPRINGS NATION
contbr_st contbr_zip contbr_employer contbr_occupation \
0 AL 3.660103e+08 RETIRED RETIRED
1 AL 3.660103e+08 RETIRED RETIRED
2 AL 3.686334e+08 INFORMATION REQUESTED INFORMATION REQUESTED
3 AR 7.245483e+08 NONE RETIRED
4 AR 7.190165e+08 NONE RETIRED
contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp \
0 250 20-JUN-11 NaN NaN NaN SA17A
1 50 23-JUN-11 NaN NaN NaN SA17A
2 250 05-JUL-11 NaN NaN NaN SA17A
3 250 01-AUG-11 NaN NaN NaN SA17A
4 300 20-JUN-11 NaN NaN NaN SA17A
file_num tran_id election_tp
0 736166 A1FDABC23D2D545A1B83 P2012
1 736166 A899B9B0E223743EFA63 P2012
2 749073 A4B144E3CB3294ABC9D6 P2012
3 749073 A8C7C6F55B0914C5A880 P2012
4 736166 A070D6D560BD84AA98AC P2012

For reference, it’s more difficult to load this file in R (2.15.2) (both because of its size and malformedness– hopefully an R guru can tell me how to deal with this trailing delimiter crap). Setting row.names=NULL causes incorrect column labelling but at least gives us a parsing + type inference performance number (about 10x slower, faster if you specify all 18 column data types):

1
2
3
> system.time(df <- read.csv(path, row.names=NULL))
user system elapsed
220.250 2.100 222.662

If you know much about this data set, you know most of these columns are not interesting to analyze. New in pandas v0.10 you can specify a subset of columns right in read_csv which results in both much faster parsing time and lower memory usage (since we're throwing away the data from the other columns after tokenizing the file):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [6]: fec_cols = ['cand_nm', 'contbr_st', 'contbr_employer', 'contb_receipt_amt',
...: 'contbr_occupation', 'contb_receipt_amt', 'contb_receipt_dt']
In [7]: %time fec = read_csv(path, index_col=False, usecols=fec_cols)
CPU times: user 8.93 s, sys: 0.36 s, total: 9.29 s
Wall time: 9.31 s
In [8]: fec
feOut[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5330834 entries, 0 to 5330833
Data columns:
cand_nm 5330834 non-null values
contbr_st 5330779 non-null values
contbr_employer 5295347 non-null values
contbr_occupation 5296749 non-null values
contb_receipt_amt 5330834 non-null values
contb_receipt_dt 5330834 non-null values
dtypes: float64(1), object(5)

Outside of file reading, a huge amount of work has been done elsewhere on pandas (aided by Chang She, Yoval P, Jeff Reback, and others). Performance has improved in many critical operations outside of parsing too (check out the groupby numbers!). Here's the output of a recent vbench run showing the latest dev version versus version 0.9.0 (numbers less than 1 indicate that the current pandas version is faster on average by that ratio):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
master v0.9.0 ratio
name
unstack_sparse_keyspace 1.3068 140.7709 0.0093
groupby_frame_apply_overhead 18.5924 320.9360 0.0579
read_csv_comment2 25.2323 361.2320 0.0699
groupbym_frame_apply 73.6522 495.8570 0.1485
read_csv_thou_vb 35.3686 191.7560 0.1844
frame_iteritems_cached 0.0700 0.3736 0.1873
concat_small_frames 12.4304 54.4093 0.2285
join_dataframe_integer_2key 5.8004 18.3924 0.3154
append_frame_single_homogenous 0.3345 0.8811 0.3796
read_csv_vb 18.5832 46.5717 0.3990
read_csv_standard 12.2059 28.3020 0.4313
panel_from_dict_all_different_indexes 73.6467 149.4339 0.4928
frame_constructor_ndarray 0.0478 0.0916 0.5219
panel_from_dict_two_different_indexes 50.8969 88.4219 0.5756
groupby_last 3.6658 6.2642 0.5852
groupby_first 3.7455 6.2646 0.5979
frame_get_numeric_data 0.0677 0.1076 0.6295
frame_iteritems 2.5812 3.5731 0.7224
groupby_simple_compress_timing 37.6556 51.8415 0.7264
append_frame_single_mixed 1.2934 1.7739 0.7291
frame_to_csv 285.7749 387.6002 0.7373
join_dataframe_integer_key 1.7632 2.2799 0.7734
groupby_multi_size 30.8166 39.5899 0.7784
groupby_frame_singlekey_integer 2.3004 2.8959 0.7944
write_csv_standard 329.9391 411.1040 0.8026
groupby_multi_cython 16.6989 19.3612 0.8625
sparse_series_to_frame 158.4511 181.7181 0.8720
groupby_multi_series_op 15.4856 17.7564 0.8721
stat_ops_level_frame_sum_multiple 7.6714 8.7897 0.8728
stat_ops_level_series_sum 2.1563 2.4705 0.8728
stat_ops_level_frame_sum 3.0208 3.3616 0.8986
merge_2intkey_nosort 18.5616 20.6283 0.8998
dataframe_reindex_columns 0.3021 0.3343 0.9037
stats_rank_average 28.2949 30.8713 0.9165
reshape_stack_simple 2.6250 2.8527 0.9202
frame_reindex_both_axes 0.3739 0.4037 0.9262
melt_dataframe 1.4668 1.5812 0.9276
groupby_frame_median 7.1793 7.7170 0.9303
groupby_indices 6.8554 7.3550 0.9321
stat_ops_level_series_sum_multiple 7.2527 7.7637 0.9342
groupby_series_simple_cython 5.0585 5.3895 0.9386
merge_2intkey_sort 44.3559 47.1701 0.9403
timeseries_timestamp_tzinfo_cons 0.0159 0.0169 0.9435
join_dataframe_index_multi 20.7460 21.9524 0.9450
frame_reindex_both_axes_ix 0.4559 0.4810 0.9479
panel_from_dict_equiv_indexes 26.2508 27.6105 0.9508
index_datetime_intersection 15.1940 15.9497 0.9526
panel_from_dict_same_index 26.1363 27.3955 0.9540
timeseries_large_lookup_value 0.0234 0.0245 0.9566
frame_sort_index_by_columns 41.4619 43.3163 0.9572
frame_boolean_row_select 0.2905 0.3034 0.9574
indexing_dataframe_boolean_rows 0.2364 0.2468 0.9581
join_dataframe_index_single_key_bigger 15.0122 15.6593 0.9587
index_datetime_union 15.2101 15.8353 0.9605
reindex_frame_level_align 0.9618 0.9992 0.9625
dataframe_reindex_daterange 0.4052 0.4206 0.9634
lib_fast_zip 10.3956 10.6966 0.9719
read_table_multiple_date_baseline 1175.4630 1208.3721 0.9728
groupby_frame_cython_many_columns 3.6782 3.7668 0.9765
groupby_multi_python 54.6882 55.9723 0.9771
series_align_int64_index 35.9280 36.6402 0.9806
groupby_multi_different_numpy_functions 13.9577 14.2328 0.9807
read_table_multiple_date 2599.5600 2648.0811 0.9817
reindex_frame_level_reindex 0.9353 0.9506 0.9839
timeseries_asof_single 0.0565 0.0574 0.9840
concat_series_axis1 70.7297 71.8611 0.9843
replace_fillna 4.9988 5.0717 0.9856
reindex_multiindex 1.2701 1.2874 0.9866
groupby_pivot_table 18.8712 19.1218 0.9869
groupby_multi_different_functions 13.9056 14.0842 0.9873
indexing_dataframe_boolean_rows_object 0.4870 0.4932 0.9875
stats_rank2d_axis1_average 14.5004 14.6610 0.9890
frame_ctor_nested_dict 90.2146 91.1775 0.9894
frame_ctor_nested_dict_int64 131.1228 132.3512 0.9907
frame_reindex_axis1 2.8508 2.8740 0.9919
timeseries_asof_nan 9.9486 10.0186 0.9930
timeseries_asof 10.3370 10.4094 0.9930
stats_rank_average_int 22.5349 22.6918 0.9931
sort_level_zero 4.3147 4.3434 0.9934
frame_reindex_axis0 1.3561 1.3643 0.9940
dti_reset_index 0.6203 0.6237 0.9945
frame_fillna_many_columns_pad 15.3549 15.4276 0.9953
frame_drop_dup_inplace 2.8577 2.8690 0.9961
stats_rank2d_axis0_average 24.5411 24.6010 0.9976
timeseries_timestamp_downsample_mean 4.4788 4.4895 0.9976
match_strings 0.3809 0.3815 0.9984
timeseries_sort_index 22.2473 22.2516 0.9998
index_int64_intersection 25.5705 25.5643 1.0002
index_int64_union 82.9492 82.8681 1.0010
frame_insert_500_columns 99.3640 99.2359 1.0013
timeseries_add_irregular 21.6117 21.5676 1.0020
replace_replacena 5.0788 5.0680 1.0021
series_ctor_from_dict 3.6953 3.6821 1.0036
series_align_left_monotonic 13.1076 13.0529 1.0042
timeseries_period_downsample_mean 6.2799 6.2450 1.0056
timeseries_1min_5min_mean 0.6190 0.6135 1.0090
stat_ops_series_std 0.2522 0.2494 1.0113
frame_drop_dup_na_inplace 2.6487 2.6187 1.0115
reindex_daterange_backfill 0.1866 0.1842 1.0128
frame_drop_duplicates 18.7704 18.5208 1.0135
series_align_irregular_string 81.0545 79.7984 1.0157
frame_drop_duplicates_na 18.7666 18.4687 1.0161
timeseries_infer_freq 10.5702 10.4007 1.0163
lib_fast_zip_fillna 13.9561 13.6818 1.0200
reindex_daterange_pad 0.1876 0.1839 1.0202
timeseries_1min_5min_ohlc 0.6971 0.6815 1.0229
reindex_fillna_backfill 0.1426 0.1388 1.0269
sort_level_one 4.4267 4.2921 1.0314
datetimeindex_add_offset 0.2387 0.2307 1.0345
frame_fancy_lookup 2.5191 2.4317 1.0359
frame_fillna_inplace 16.9570 16.2857 1.0412
timeseries_slice_minutely 0.0626 0.0600 1.0433
reindex_fillna_pad 0.1459 0.1396 1.0453
frame_ctor_list_of_dict 98.7562 93.8414 1.0524
join_dataframe_index_single_key_bigger 6.6729 6.2370 1.0699
frame_fancy_lookup_all 26.3783 24.6056 1.0720
series_constructor_ndarray 0.0124 0.0115 1.0765
series_value_counts_int64 2.7498 2.5306 1.0866
reshape_unstack_simple 3.1889 2.8616 1.1144
indexing_panel_subset 0.5772 0.5063 1.1400
groupby_apply_dict_return 45.6793 39.6327 1.1526
timeseries_to_datetime_iso8601 4.1278 3.5348 1.1678
join_dataframe_index_single_key_small 6.8140 5.7951 1.1758
frame_to_string_floats 52.5621 35.4093 1.4844
reshape_pivot_time_series 181.4950 103.2100 1.7585
sparse_frame_constructor 5.9832 3.3596 1.7809
Columns: test_name | target_duration [ms] | baseline_duration [ms] | ratio
- a Ratio of 1.30 means the target commit is 30% slower then the baseline.
Target [648d581] : ENH: index_col=False disables first column -> index. fixes FEC parsing problem. close #2442
Baseline [b5956fd] : RLS: Version 0.9.0 final