We’re hard at work as usual getting the next major pandas release out. I hope you’re as excited as I am! An interesting problem came up recently on the ever-popular FEC Disclosure database used in my book and in many pandas demos. The powers that be decided it would be cool if they put commas at the end of each line; fooling most CSV readers into thinking there are empty fields at the end of each line:

pandas’s file parsers by default will treat the first column as the DataFrame’s row names if the data have 1 too many columns, which is very useful in a lot of cases. Not so much here. So I made it so you can indicate index_col=False which results on the last column being dropped as desired. The FEC data file is now about 900MB and takes only 20 seconds to load on my spinning-rust box:

For reference, it’s more difficult to load this file in R (2.15.2) (both because of its size and malformedness– hopefully an R guru can tell me how to deal with this trailing delimiter crap). Setting row.names=NULL causes incorrect column labelling but at least gives us a parsing + type inference performance number (about 10x slower, faster if you specify all 18 column data types):

If you know much about this data set, you know most of these columns are not interesting to analyze. New in pandas v0.10 you can specify a subset of columns right in read_csv which results in both much faster parsing time and lower memory usage (since we’re throwing away the data from the other columns after tokenizing the file):

Outside of file reading, a huge amount of work has been done elsewhere on pandas (aided by Chang She, Yoval P, Jeff Reback, and others). Performance has improved in many critical operations outside of parsing too (check out the groupby numbers!). Here’s the output of a recent vbench run showing the latest dev version versus version 0.9.0 (numbers less than 1 indicate that the current pandas version is faster on average by that ratio):