Category Archives: Uncategorized

Strata NYC 2013 and PyData 2013 Talks

I was excited to be able to talk at two recent data-centric conferences in New York. They touch on some related subjects, with the PyData talk being a lot more technical and having to do with low-level architecture in pandas and engineering work I’ve been doing this year at DataPad.

Before anyone yells at me, I’m going to revisit the PostgreSQL benchmarks in my PyData talk at some point as the performance would be a lot better with the data stored in fully-normalized form (a single fact table with all primitive data types and auxiliary tables for all of the categorical / string data. Laborious but the way a DBA would set it up rather than having a single table with VARCHAR columns).

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013) from wesm

Building Better Analytics Workflows (Strata-Hadoop World 2013) from wesm

PyCon Singapore 2013

I was graciously invited to give the keynote presentation at this year’s PyCon Singapore. Luckily, I love to hack on long plane rides. See the slides from the talk below. I showed some analytics on Python posts on Stackoverflow during the talk, here is the IPython notebook. The raw data is right here.

I also gave a half day pandas tutorial, here is the IPython notebook. You will need the data to do it yourself, here’s a download link.

PyCon Singapore 2013 Keynote from wesm

I’m moving to San Francisco. And hiring

PyCon and PyData 2013 were a blast this last week. Several people noted that my GitHub activity on pandas hasn’t quite been the same lately and wondered if I was getting a little burned out. I’m happy to say quite the opposite; I’ll still be involved with pandas development (though not the 80 hours/week of the last 2 years), but I’m starting an ambitious new data project that I’m looking forward to sharing later this year. This endeavor is also taking me from New York to San Francisco. I’m sad to be leaving the (vibrant and growing) New York data community, but also looking forward to spending more time with my data hacking friends in the Bay Area.

While I can’t share too many details about the new startup, anyone who knows me knows my passion for innovation in data tooling and making people and organizations more productive in their data analysis. pandas was Act 1! So, I’m building a world class team of engineers, designers, and forward thinkers to join me in this effort. Are you one of them? If so I look forward to hearing from you (my e-mail address can be easily located on GitHub):

Front-end and Data Visualization Engineer

You will be building a richly featured web application that will stretch the capabilities of modern web browsers. You will build core UI components and work with the UX designer and backend team to make everything work seamlessly.

  • Extensive Javascript and CSS experience.
  • Experience with one or more SVG or Canvas-based visualization toolkits, i.e. D3.js, or a keen interest in learning. Maybe you’ve spent a lot of time on http://bl.ocks.org. Extra points if you have opinions about the Grammar of Graphics (or its implementations, like ggplot2).
  • Know the ins and outs of websockets and AJAX communications with various backend data services.
  • Understand data binding and have used MV* frameworks enough to be dangerous.
  • Prior data analysis experience (even at the Excel level) very useful.

Data Engineer

You are a pragmatic, performance-motivated cruncher of bytes who knows what it means to ship code on tight deadlines. You have high standards but are willing to make tradeoffs to get stuff done. You and I will spend a lot of time at the whiteboard talking about the nuts and bolts of data processing. Some of these things may describe you:

  • Experience building performance and latency-sensitive, data-driven analytical applications.
  • Knowledge of standard data structures and algorithms for data processing, their implementation details, and performance tradeoffs (hash tables, vectors, binary trees, sorting algorithms, etc.). Maybe you already have enjoyed reading my blog.
  • Knowledge of binary data formats, serialization schemes, compression, and other IO performance considerations. Familiar with a variety of database technology.
  • Python and C/C++ experience preferred. Extra points if you have programmed in an APL dialect (J or K/Q/Kona) or solved 100 or more problems on Project Euler.
  • You are a firm believer in unit testing and continuous integration.
  • Experience with building distributed data systems and analytics tools such as Spark, Crunch, or Pig a plus. Maybe you loved the Google Dremel white paper.
  • Experience with code generation (e.g. LLVM) or compiler technology a big plus.

UX Designer

You will be responsible for web design and crafting a compelling user experience. You will work intimately with the front-end / data visualization team to make a consistent look and feel throughout the product. We would prefer a designer who can also do her/his own JS/CSS implementation work if necessary. Any experience with data analysis tools, from Excel to Spotfire to Matlab (with opinions about what makes each of them easy–or terrible–to use), would be strongly preferred.

Full Stack Web Engineer

You will play a critical role in building a scalable, reliable web application, and generally keeping the trains running on time. You should be a jack of many trades with an interest in learning many more. Here are some desirable qualities:

  • Extensive experience using Python, Javascript (Node.js), etc. to build scalable, data-intensive web applications, and with strong opinions about the right technology to use.
  • Comfort with managing continuous (or highly frequent) deployments.
  • Experience with using and managing SQL (e.g. Postgres) and NoSQL (e.g. MongoDB) databases.
  • Experience building applications EC2 or other cloud computing services.

Product Lead

You will run product at the company, working to gain a deep understanding of customer use cases and working with the engineering team to drive product-market fit. Prior experience in the data analytics or business intelligence space would be very helpful.

Whirlwind tour of pandas in 10 minutes

10-minute tour of pandas from Wes McKinney on Vimeo.

Update on upcoming pandas v0.10, new file parser, other performance wins

We’re hard at work as usual getting the next major pandas release out. I hope you’re as excited as I am! An interesting problem came up recently on the ever-popular FEC Disclosure database used in my book and in many pandas demos. The powers that be decided it would be cool if they put commas at the end of each line; fooling most CSV readers into thinking there are empty fields at the end of each line:

1
2
3
4
5
6
7
8
9
In [4]: path
Out[4]: '/home/wesm/Downloads/P00000001-ALL.csv'
In [5]: !head $path -n 5
cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
C00410118,"P20002978","Bachmann, Michele","HARVEY, WILLIAM","MOBILE","AL","366010290","RETIRED","RETIRED",250,20-JUN-11,"","","","SA17A","736166","A1FDABC23D2D545A1B83","P2012",
C00410118,"P20002978","Bachmann, Michele","HARVEY, WILLIAM","MOBILE","AL","366010290","RETIRED","RETIRED",50,23-JUN-11,"","","","SA17A","736166","A899B9B0E223743EFA63","P2012",
C00410118,"P20002978","Bachmann, Michele","SMITH, LANIER","LANETT","AL","368633403","INFORMATION REQUESTED","INFORMATION REQUESTED",250,05-JUL-11,"","","","SA17A","749073","A4B144E3CB3294ABC9D6","P2012",
C00410118,"P20002978","Bachmann, Michele","BLEVINS, DARONDA","PIGGOTT","AR","724548253","NONE","RETIRED",250,01-AUG-11,"","","","SA17A","749073","A8C7C6F55B0914C5A880","P2012",

pandas’s file parsers by default will treat the first column as the DataFrame’s row names if the data have 1 too many columns, which is very useful in a lot of cases. Not so much here. So I made it so you can indicate index_col=False which results on the last column being dropped as desired. The FEC data file is now about 900MB and takes only 20 seconds to load on my spinning-rust box:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
In [1]: path = '/home/wesm/Downloads/P00000001-ALL.csv'
In [2]: %time fec = read_csv(path, index_col=False)
CPU times: user 19.29 s, sys: 1.04 s, total: 20.33 s
Wall time: 20.45 s
In [3]: fec
Out[3]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5330834 entries, 0 to 5330833
Data columns:
cmte_id 5330834 non-null values
cand_id 5330834 non-null values
cand_nm 5330834 non-null values
contbr_nm 5330834 non-null values
contbr_city 5330773 non-null values
contbr_st 5330779 non-null values
contbr_zip 5330472 non-null values
contbr_employer 5295347 non-null values
contbr_occupation 5296749 non-null values
contb_receipt_amt 5330834 non-null values
contb_receipt_dt 5330834 non-null values
receipt_desc 50420 non-null values
memo_cd 1165554 non-null values
memo_text 1182573 non-null values
form_tp 5330834 non-null values
file_num 5330834 non-null values
tran_id 5330834 non-null values
election_tp 5330135 non-null values
dtypes: float64(1), int64(1), object(16)
In [4]: fec.head()
Out[4]:
cmte_id cand_id cand_nm contbr_nm contbr_city \
0 C00410118 P20002978 Bachmann, Michele HARVEY, WILLIAM MOBILE
1 C00410118 P20002978 Bachmann, Michele HARVEY, WILLIAM MOBILE
2 C00410118 P20002978 Bachmann, Michele SMITH, LANIER LANETT
3 C00410118 P20002978 Bachmann, Michele BLEVINS, DARONDA PIGGOTT
4 C00410118 P20002978 Bachmann, Michele WARDENBURG, HAROLD HOT SPRINGS NATION
contbr_st contbr_zip contbr_employer contbr_occupation \
0 AL 3.660103e+08 RETIRED RETIRED
1 AL 3.660103e+08 RETIRED RETIRED
2 AL 3.686334e+08 INFORMATION REQUESTED INFORMATION REQUESTED
3 AR 7.245483e+08 NONE RETIRED
4 AR 7.190165e+08 NONE RETIRED
contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp \
0 250 20-JUN-11 NaN NaN NaN SA17A
1 50 23-JUN-11 NaN NaN NaN SA17A
2 250 05-JUL-11 NaN NaN NaN SA17A
3 250 01-AUG-11 NaN NaN NaN SA17A
4 300 20-JUN-11 NaN NaN NaN SA17A
file_num tran_id election_tp
0 736166 A1FDABC23D2D545A1B83 P2012
1 736166 A899B9B0E223743EFA63 P2012
2 749073 A4B144E3CB3294ABC9D6 P2012
3 749073 A8C7C6F55B0914C5A880 P2012
4 736166 A070D6D560BD84AA98AC P2012

For reference, it’s more difficult to load this file in R (2.15.2) (both because of its size and malformedness– hopefully an R guru can tell me how to deal with this trailing delimiter crap). Setting row.names=NULL causes incorrect column labelling but at least gives us a parsing + type inference performance number (about 10x slower, faster if you specify all 18 column data types):

1
2
3
> system.time(df <- read.csv(path, row.names=NULL))
user system elapsed
220.250 2.100 222.662

If you know much about this data set, you know most of these columns are not interesting to analyze. New in pandas v0.10 you can specify a subset of columns right in read_csv which results in both much faster parsing time and lower memory usage (since we're throwing away the data from the other columns after tokenizing the file):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [6]: fec_cols = ['cand_nm', 'contbr_st', 'contbr_employer', 'contb_receipt_amt',
...: 'contbr_occupation', 'contb_receipt_amt', 'contb_receipt_dt']
In [7]: %time fec = read_csv(path, index_col=False, usecols=fec_cols)
CPU times: user 8.93 s, sys: 0.36 s, total: 9.29 s
Wall time: 9.31 s
In [8]: fec
feOut[8]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5330834 entries, 0 to 5330833
Data columns:
cand_nm 5330834 non-null values
contbr_st 5330779 non-null values
contbr_employer 5295347 non-null values
contbr_occupation 5296749 non-null values
contb_receipt_amt 5330834 non-null values
contb_receipt_dt 5330834 non-null values
dtypes: float64(1), object(5)

Outside of file reading, a huge amount of work has been done elsewhere on pandas (aided by Chang She, Yoval P, Jeff Reback, and others). Performance has improved in many critical operations outside of parsing too (check out the groupby numbers!). Here's the output of a recent vbench run showing the latest dev version versus version 0.9.0 (numbers less than 1 indicate that the current pandas version is faster on average by that ratio):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
master v0.9.0 ratio
name
unstack_sparse_keyspace 1.3068 140.7709 0.0093
groupby_frame_apply_overhead 18.5924 320.9360 0.0579
read_csv_comment2 25.2323 361.2320 0.0699
groupbym_frame_apply 73.6522 495.8570 0.1485
read_csv_thou_vb 35.3686 191.7560 0.1844
frame_iteritems_cached 0.0700 0.3736 0.1873
concat_small_frames 12.4304 54.4093 0.2285
join_dataframe_integer_2key 5.8004 18.3924 0.3154
append_frame_single_homogenous 0.3345 0.8811 0.3796
read_csv_vb 18.5832 46.5717 0.3990
read_csv_standard 12.2059 28.3020 0.4313
panel_from_dict_all_different_indexes 73.6467 149.4339 0.4928
frame_constructor_ndarray 0.0478 0.0916 0.5219
panel_from_dict_two_different_indexes 50.8969 88.4219 0.5756
groupby_last 3.6658 6.2642 0.5852
groupby_first 3.7455 6.2646 0.5979
frame_get_numeric_data 0.0677 0.1076 0.6295
frame_iteritems 2.5812 3.5731 0.7224
groupby_simple_compress_timing 37.6556 51.8415 0.7264
append_frame_single_mixed 1.2934 1.7739 0.7291
frame_to_csv 285.7749 387.6002 0.7373
join_dataframe_integer_key 1.7632 2.2799 0.7734
groupby_multi_size 30.8166 39.5899 0.7784
groupby_frame_singlekey_integer 2.3004 2.8959 0.7944
write_csv_standard 329.9391 411.1040 0.8026
groupby_multi_cython 16.6989 19.3612 0.8625
sparse_series_to_frame 158.4511 181.7181 0.8720
groupby_multi_series_op 15.4856 17.7564 0.8721
stat_ops_level_frame_sum_multiple 7.6714 8.7897 0.8728
stat_ops_level_series_sum 2.1563 2.4705 0.8728
stat_ops_level_frame_sum 3.0208 3.3616 0.8986
merge_2intkey_nosort 18.5616 20.6283 0.8998
dataframe_reindex_columns 0.3021 0.3343 0.9037
stats_rank_average 28.2949 30.8713 0.9165
reshape_stack_simple 2.6250 2.8527 0.9202
frame_reindex_both_axes 0.3739 0.4037 0.9262
melt_dataframe 1.4668 1.5812 0.9276
groupby_frame_median 7.1793 7.7170 0.9303
groupby_indices 6.8554 7.3550 0.9321
stat_ops_level_series_sum_multiple 7.2527 7.7637 0.9342
groupby_series_simple_cython 5.0585 5.3895 0.9386
merge_2intkey_sort 44.3559 47.1701 0.9403
timeseries_timestamp_tzinfo_cons 0.0159 0.0169 0.9435
join_dataframe_index_multi 20.7460 21.9524 0.9450
frame_reindex_both_axes_ix 0.4559 0.4810 0.9479
panel_from_dict_equiv_indexes 26.2508 27.6105 0.9508
index_datetime_intersection 15.1940 15.9497 0.9526
panel_from_dict_same_index 26.1363 27.3955 0.9540
timeseries_large_lookup_value 0.0234 0.0245 0.9566
frame_sort_index_by_columns 41.4619 43.3163 0.9572
frame_boolean_row_select 0.2905 0.3034 0.9574
indexing_dataframe_boolean_rows 0.2364 0.2468 0.9581
join_dataframe_index_single_key_bigger 15.0122 15.6593 0.9587
index_datetime_union 15.2101 15.8353 0.9605
reindex_frame_level_align 0.9618 0.9992 0.9625
dataframe_reindex_daterange 0.4052 0.4206 0.9634
lib_fast_zip 10.3956 10.6966 0.9719
read_table_multiple_date_baseline 1175.4630 1208.3721 0.9728
groupby_frame_cython_many_columns 3.6782 3.7668 0.9765
groupby_multi_python 54.6882 55.9723 0.9771
series_align_int64_index 35.9280 36.6402 0.9806
groupby_multi_different_numpy_functions 13.9577 14.2328 0.9807
read_table_multiple_date 2599.5600 2648.0811 0.9817
reindex_frame_level_reindex 0.9353 0.9506 0.9839
timeseries_asof_single 0.0565 0.0574 0.9840
concat_series_axis1 70.7297 71.8611 0.9843
replace_fillna 4.9988 5.0717 0.9856
reindex_multiindex 1.2701 1.2874 0.9866
groupby_pivot_table 18.8712 19.1218 0.9869
groupby_multi_different_functions 13.9056 14.0842 0.9873
indexing_dataframe_boolean_rows_object 0.4870 0.4932 0.9875
stats_rank2d_axis1_average 14.5004 14.6610 0.9890
frame_ctor_nested_dict 90.2146 91.1775 0.9894
frame_ctor_nested_dict_int64 131.1228 132.3512 0.9907
frame_reindex_axis1 2.8508 2.8740 0.9919
timeseries_asof_nan 9.9486 10.0186 0.9930
timeseries_asof 10.3370 10.4094 0.9930
stats_rank_average_int 22.5349 22.6918 0.9931
sort_level_zero 4.3147 4.3434 0.9934
frame_reindex_axis0 1.3561 1.3643 0.9940
dti_reset_index 0.6203 0.6237 0.9945
frame_fillna_many_columns_pad 15.3549 15.4276 0.9953
frame_drop_dup_inplace 2.8577 2.8690 0.9961
stats_rank2d_axis0_average 24.5411 24.6010 0.9976
timeseries_timestamp_downsample_mean 4.4788 4.4895 0.9976
match_strings 0.3809 0.3815 0.9984
timeseries_sort_index 22.2473 22.2516 0.9998
index_int64_intersection 25.5705 25.5643 1.0002
index_int64_union 82.9492 82.8681 1.0010
frame_insert_500_columns 99.3640 99.2359 1.0013
timeseries_add_irregular 21.6117 21.5676 1.0020
replace_replacena 5.0788 5.0680 1.0021
series_ctor_from_dict 3.6953 3.6821 1.0036
series_align_left_monotonic 13.1076 13.0529 1.0042
timeseries_period_downsample_mean 6.2799 6.2450 1.0056
timeseries_1min_5min_mean 0.6190 0.6135 1.0090
stat_ops_series_std 0.2522 0.2494 1.0113
frame_drop_dup_na_inplace 2.6487 2.6187 1.0115
reindex_daterange_backfill 0.1866 0.1842 1.0128
frame_drop_duplicates 18.7704 18.5208 1.0135
series_align_irregular_string 81.0545 79.7984 1.0157
frame_drop_duplicates_na 18.7666 18.4687 1.0161
timeseries_infer_freq 10.5702 10.4007 1.0163
lib_fast_zip_fillna 13.9561 13.6818 1.0200
reindex_daterange_pad 0.1876 0.1839 1.0202
timeseries_1min_5min_ohlc 0.6971 0.6815 1.0229
reindex_fillna_backfill 0.1426 0.1388 1.0269
sort_level_one 4.4267 4.2921 1.0314
datetimeindex_add_offset 0.2387 0.2307 1.0345
frame_fancy_lookup 2.5191 2.4317 1.0359
frame_fillna_inplace 16.9570 16.2857 1.0412
timeseries_slice_minutely 0.0626 0.0600 1.0433
reindex_fillna_pad 0.1459 0.1396 1.0453
frame_ctor_list_of_dict 98.7562 93.8414 1.0524
join_dataframe_index_single_key_bigger 6.6729 6.2370 1.0699
frame_fancy_lookup_all 26.3783 24.6056 1.0720
series_constructor_ndarray 0.0124 0.0115 1.0765
series_value_counts_int64 2.7498 2.5306 1.0866
reshape_unstack_simple 3.1889 2.8616 1.1144
indexing_panel_subset 0.5772 0.5063 1.1400
groupby_apply_dict_return 45.6793 39.6327 1.1526
timeseries_to_datetime_iso8601 4.1278 3.5348 1.1678
join_dataframe_index_single_key_small 6.8140 5.7951 1.1758
frame_to_string_floats 52.5621 35.4093 1.4844
reshape_pivot_time_series 181.4950 103.2100 1.7585
sparse_frame_constructor 5.9832 3.3596 1.7809
Columns: test_name | target_duration [ms] | baseline_duration [ms] | ratio
- a Ratio of 1.30 means the target commit is 30% slower then the baseline.
Target [648d581] : ENH: index_col=False disables first column -> index. fixes FEC parsing problem. close #2442
Baseline [b5956fd] : RLS: Version 0.9.0 final

A new high performance, memory-efficient file parser engine for pandas

TL;DR I’ve finally gotten around to building the high performance parser engine that pandas deserves. It hasn’t been released yet (it’s in a branch on GitHub) but will after I give it a month or so for any remaining buglets to shake out:

A project I’ve put off for a long time is building a high performance, memory efficient file parser for pandas. The existing code up through and including the imminent pandas 0.9.0 release has always been makeshift; the development focus has been on parser features over the more tedious (but actually much more straightforward) issue of creating a fast C table tokenizer. It’s been on the pandas roadmap for a long time:

http://github.com/pydata/pandas/issues/821

pandas.read_csv from pandas 0.5.0 onward is actually very fast– faster than R and much faster than numpy.loadtxt– but it uses a lot of memory. I wrote about some of the implementation issues about a year ago here. The key problem with the existing code is this: all of the existing parsing solutions in pandas as well as NumPy first read the file data into pure Python data structures: a list of tuples or a list of lists. If you have a very large file, a list of 1 million or 10 million Python tuples has an extraordinary memory footprint– significantly greater than the size of the file on disk (can be 5x or more footprint, far too much). Some people have pointed out the large memory usage without correctly explaining why, but this is the one and only reason: too many intermediate Python data structures.

Building a good parser engine isn’t exactly rocket science; we’re talking optimizing the implementation of dirt simple O(n) algorithms here. The task is divided into several key pieces:

  • File tokenization: read bytes from the file, identify where fields begin and end and which column each belongs to. Python’s csv module is an example of a tokenizer. Things like quoting conventions need to be taken into account. Doing this well in C is about picking the right data structures and making the code lean and mean. To be clear: if you design the tokenizer data structure wrong, you’ve lost before you’ve begun.
  • NA value filtering: detect NA (missing) value sentinels and convert to the appropriate NA representation. Examples of NA sentinels are NA, #N/A or other bespoke sentinels like -999. Practically speaking this means keeping a hash set of strings considered NA and check whether each parsed token is in the set (and you can have different NA sets for each column, too!). If the number of sentinel values is small, you could use an array of C strings instead of a hash set.
  • Tolerating “bad” rows: Can aberrant rows be gracefully ignored with your consent? Is the error message informative?
  • Type inference / conversion: Converting the tokens in the file to the right C types (string, date, floating point, integer, boolean).
  • Skipping rows: Ignore certain rows in file or at end of file.
  • Date parsing / value conversion: Convert one or more columns into timestamps. In some cases concatenate date/time information spread across multiple columns.
  • Handling of “index” columns: Handle row names appropriately, yielding a DataFrame with the expected row index.

  • None of this is that hard; it’s made much more time consuming due to the proliferation of fine-grained options (and resulting “parameter hell”). Anyway, I finally mustered the energy to hack it out over a few intense days in late August and September. I’m hoping to ship it in a quick pandas 0.10 release (“version point-ten”) toward the end of October if possible. It would be nice to push this code upstream into NumPy to improve loadtxt and genfromtxt’s performance as well.

    Benchmarks against R, NumPy, Continuum’s IOPro

    Outside of parser features (i.e. “can the tool read my file correctly”), there are two performance areas of interest:

  • CPU Speed: how long does it take to parse the file?
  • Memory utilization: what’s the maximum amount of RAM used while the file is being parsed (including the final returned table)? There’s really nothing worse than your computer starting to swap when you try to parse a large file
  • I’ll compare the new pandas parser engine in a group of several tools that you can use to do the same job, including R’s parser functions:

  • R’s venerable read.csv and read.table functions
  • numpy.loadtxt: this is a pure Python parser, to be clear.
  • New pandas engine, via pandas.read_csv and read_table
  • A new commercial library, IOPro, from my good friends at Continuum Analytics.
  • To do the performance analysis, I’ll look at 5 representative data sets:

  • A 100,000 x 50 CSV matrix of randomly generated 0′s and 1′s. It looks like this:
  • 1
    2
    3
    4
    5
    6
    In [3]: !head -n 5 parser_examples/zeros.csv
    0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49
    1,1,0,1,1,0,0,1,1,0,1,0,1,1,0,0,1,0,0,0,0,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,0,1,1,0,1,0,0,0,1,1,0,0,0,0
    1,0,0,1,1,1,0,0,1,1,1,0,1,1,0,0,1,0,1,1,0,1,1,1,1,1,1,0,1,1,1,0,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0,1,1,1
    1,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0,1,1,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,1,1,1,0,1,0,0,1,1,1,1,1,1,1,1,0,0
    0,1,0,0,1,1,0,0,0,0,0,0,0,0,1,0,1,1,0,1,0,1,0,1,0,0,1,1,0,1,1,0,0,0,0,0,1,0,1,1,1,0,0,0,1,0,1,0,0,1

  • A 1,000,000 x 10 CSV matrix of randomly generated normally distributed data. Looks like this:
  • 1
    2
    3
    4
    5
    6
    In [4]: !head -n 5 parser_examples/matrix.csv
    0,1,2,3,4,5,6,7,8,9
    0.609633439034,0.249525535926,0.180502465241,0.940871913454,-0.35702932376,1.12983701927,0.77045731318,-0.16976884026,-0.685520348835,0.216936429382
    0.76523368046,1.08405034644,1.2099841819,-0.858404123158,1.47061247583,-1.15728386054,-0.375685123416,-0.00475949800828,0.522530689417,0.485226447392
    -0.958266896007,-0.0583065555495,-0.17369448475,0.465274502954,0.92612769921,0.362029345941,-2.27118704972,0.944967722699,1.34525304565,1.60130304607
    -0.518406503139,-1.19158517434,0.064195872451,-2.244687656,0.947562908985,0.775078137538,0.160741686264,-0.706110036551,-0.780137216247,1.02794242373

  • The Federal election committee (FEC) data set as a CSV file. One of my favorite example data sets for talking about pandas. Here’s what it looks like when parsed with pandas.read_csv
  • 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    In [2]: df
    Out[2]:
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 1001731 entries, 0 to 1001730
    Data columns:
    cmte_id 1001731 non-null values
    cand_id 1001731 non-null values
    cand_nm 1001731 non-null values
    contbr_nm 1001731 non-null values
    contbr_city 1001716 non-null values
    contbr_st 1001727 non-null values
    contbr_zip 1001620 non-null values
    contbr_employer 994314 non-null values
    contbr_occupation 994433 non-null values
    contb_receipt_amt 1001731 non-null values
    contb_receipt_dt 1001731 non-null values
    receipt_desc 14166 non-null values
    memo_cd 92482 non-null values
    memo_text 97770 non-null values
    form_tp 1001731 non-null values
    file_num 1001731 non-null values
    dtypes: float64(1), int64(1), object(14)

  • Wikipedia page count data used for benchmarks in this blog post. It’s delimited by single spaces and has no column header:
  • 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    In [8]: df
    Out[8]:
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 6078103 entries, 0 to 6078102
    Data columns:
    X.1 6077987 non-null values
    X.2 6078090 non-null values
    X.3 6078103 non-null values
    X.4 6078103 non-null values
    dtypes: int64(2), object(2)
    In [9]: df.head()
    Out[9]:
    X.1 X.2 X.3 X.4
    0 aa.b Special%3aStatistics 1 18127
    1 aa.b Special:WhatLinksHere/User:Sir_Lestaty_de_Lion... 1 5325
    2 aa.b User:EVula 1 21080
    3 aa.b User:EVula/header 1 17332
    4 aa.b User:Manecke 1 21041

  • A large numerical astronomy data set used for benchmarks in this blog post. Looks like this:
  • 1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    In [19]: df
    Out[19]:
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 6949386 entries, 0 to 6949385
    Data columns:
    objectid (long) 6949386 non-null values
    right ascension (float) 6949386 non-null values
    declination (float) 6949386 non-null values
    ultraviolet (double) 6949386 non-null values
    green (double) 6949386 non-null values
    red (double) 6949386 non-null values
    infrared (double) 6949386 non-null values
    z (double) 6949386 non-null values
    dtypes: float64(7), int64(1)
    In [20]: df[:2].T
    Out[20]:
    0 1
    objectid (long) 7.588828e+17 7.588828e+17
    right ascension (float) 2.634087e+02 2.634271e+02
    declination (float) 6.278961e+00 6.310742e+00
    ultraviolet (double) 2.459675e+01 2.330080e+01
    green (double) 2.347177e+01 2.275493e+01
    red (double) 2.169188e+01 2.188667e+01
    infrared (double) 2.118722e+01 2.066283e+01
    z (double) 2.043528e+01 2.135766e+01

    Here’s a link to an archive of all the datasets (warning: about 500 megabytes): Table datasets

    I don’t have time to compare features (which vary greatly across the tools).

    Oh, and my rig:

  • Core i7 950 @ 3.07 GHz
  • 24 GB of ram (so we won’t get close to swapping)
  • OCZ Vertex 3 Sata 3 SSD
  • (Because I have an SSD I would expect the benchmarks for spinning rust to differ roughly by a constant amount based on read times for slurping the bytes of the disk. In my case, the disk reads aren’t a major factor. In corporate environments with NFS servers under heavy load, you would expect similar reads to take a bit longer.)

    CPU Performance benchmarks

    So numpy.loadtxt is really slow, and I’m excluding it from the benchmarks. On the smallest and simplest file in these benchmarks, it’s more than 10 times slower than the new pandas parser:

    1
    2
    3
    4
    5
    6
    In [27]: timeit read_csv('zeros.csv')
    1 loops, best of 3: 415 ms per loop
    In [29]: %time arr = np.loadtxt('zeros.csv', delimiter=',', dtype=np.int_, skiprows=1)
    CPU times: user 4.88 s, sys: 0.04 s, total: 4.92 s
    Wall time: 4.92 s

    Here are the results for everybody else (see code at end of post):

    Here are the raw numbers in seconds:

    In [30]: results
    Out[30]:
                       iopro    pandas       R
    astro          17.646228  6.955254  37.030
    double-matrix   3.377430  1.279502   6.920
    fec             3.685799  2.306570  18.121
    wikipedia      11.752624  4.369659  42.250
    zero-matrix     0.673885  0.268830   0.616

    IOPro vs. new pandas parser: look closer

    But hey, wait a second. If you are intimately familiar with IOPro and pandas you will already be saying that I am not making an apples to apples comparison. True. Why not?

  • IOPro does not check for and substitute common NA sentinel values (I believe you can give it a list of values to check for– the documentation was a bit hard to work out in this regard)
  • IOPro returns NumPy arrays with structured dtype. Pandas DataFrame has a slightly different internal format, and strings are boxed as Python objects rather than stored in NumPy string dtype arrays
  • To level the playing field, I’ll disable the NA filtering logic (passing na_filter=False) in pandas, instruct the parser to return a structured array instead of a DataFrame (as_recarray=True). Secondly, let’s only look at the numerical datasets (exclude wikipedia and fec, for now) to exclude the impact of handling of string datatypes. Here is the resulting graph (with relative timings):

    It looks like the savings of not passing all the tokens through the NA filter is balanced by the cost of transferring the column arrays into the structured array (which is a raw array of bytes interpreted as a table by NumPy). This could very likely be made faster (more cache-efficient) than it currently is with some effort.

    Memory usage benchmarks

    Profiling peak memory usage is a tedious process. The canonical tool for the job is Massif from the Valgrind suite. I’m not yet done obsessing over memory allocation and data management inside the parser system, but here’s what the numbers look like compared with R and IOPro. I’m using the following valgrind commands (plus ms_print) to get this output (if this is not correct, please someone tell me):

    1
    2
    valgrind --tool=massif --depth=1 python -c command
    ms_print massif_output_file

    I’ll use the largest file in this post, the astro numerical dataset.

    First, IOPro advertises very low memory footprint. It does not, however, avoid having 2 copies of the data set in memory (I don’t either. It’s actually very difficult–and costly–to avoid this). Here is the final output of ms_print showing peak memory usage at the very end when the structured array is created and returned:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    65 91,209,383,985 467,332,576 467,252,659 79,917 0
    66 92,193,310,604 467,332,576 467,252,659 79,917 0
    67 93,154,564,618 912,093,904 908,774,712 3,319,192 0
    99.64% (908,774,712B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->48.76% (444,761,193B) 0x6E1F5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->48.76% (444,760,704B) 0x6E0BEAD: PyArray_Resize (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->01.15% (10,485,760B) 0x65391E1: open_text_adapter (text_adapter.c:58)
    |
    ->00.96% (8,767,055B) in 1+ places, all below ms_print's threshold (01.00%)

    Let’s look at R. Peak memory allocation comes in slightly under IOPro at 903MM bytes vs. 912MM:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    71 227,687,337,998 635,817,352 635,766,423 50,929 0
    72 227,723,703,780 681,469,608 681,418,671 50,937 0
    73 227,758,451,071 737,064,752 737,013,799 50,953 0
    74 227,803,513,492 792,659,896 792,608,927 50,969 0
    75 227,848,799,505 848,255,040 848,204,055 50,985 0
    76 227,893,861,838 903,850,184 903,799,183 51,001 0
    77 227,937,535,005 903,850,184 903,799,183 51,001 0
    99.99% (903,799,183B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->98.52% (890,458,496B) 0x4F45020: Rf_allocVector (memory.c:2388)
    |
    ->01.48% (13,340,687B) in 108 places, all below massif's threshold (01.00

    In the new pandas parser, I’ll look at 2 things: memory allocation by the parser engine before creation of the final DataFrame (which causes data-doubling as with IOPro) and the user-facing read_csv. First, the profile of using read_csv (which also creates a simple integer Index for the DataFrame) uses 1014MM bytes, about 10% more than either of the above:

    1
    2
    3
    4
    5
    6
    7
    8
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    70 56,100,696,245 1,014,234,384 965,776,492 48,457,892 0
    95.22% (965,776,492B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->93.19% (945,117,309B) 0x6B9E5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->02.04% (20,659,183B) in 145 places, all below massif's threshold (01.00%)

    Considering only the parser engine (which returns a dict of arrays, i.e. no data doubling) uses only 570MM bytes:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    56 39,618,471,594 569,619,528 521,140,509 48,479,019 0
    91.49% (521,140,509B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->87.84% (500,356,421B) 0x6B9E5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->01.95% (11,084,760B) in 128 places, all below massif's threshold (01.00%)
    |
    ->01.70% (9,699,328B) 0x4EADC3B: PyObject_Malloc (obmalloc.c:580)

    Memory usage with non-numerical data depends on a lot of issues surrounding the handling of string data. Let’s consider the FEC data set, where pandas does pretty well out of the box, using only 415MM bytes at peak (I realized why it was so high while writing this article…will reduce soon):

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    88 12,473,186,057 415,804,184 348,381,546 67,422,638 0
    83.79% (348,381,546B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->63.60% (264,457,869B) 0x6B9E5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->14.88% (61,865,984B) 0x4EADC3B: PyObject_Malloc (obmalloc.c:580)
    |
    ->02.34% (9,722,205B) in 149 places, all below massif's threshold (01.00%)
    |
    ->01.91% (7,959,808B) 0x4E9799C: fill_free_list (intobject.c:52)
    |
    ->01.05% (4,375,680B) 0x4EA5147: dictresize (dictobject.c:632)

    IOPro out of the box uses 3 times more. This would obviously be completely undesirable:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    76 15,039,940,724 1,232,035,520 828,653,666 403,381,854 0
    67.26% (828,653,666B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->32.85% (404,699,813B) 0x6E1F5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->32.85% (404,699,324B) 0x6E0BEAD: PyArray_Resize (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->01.56% (19,254,529B) in 114 places, all below massif's threshold (01.00%)

    What about R? It may not be fast but it uses the least memory again:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    75 51,948,622,306 259,260,296 258,997,057 263,239 0
    99.90% (258,997,057B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->74.63% (193,486,656B) 0x4F45020: Rf_allocVector (memory.c:2388)
    |
    ->23.57% (61,112,304B) 0x4F445C0: GetNewPage (memory.c:786)
    |
    ->01.14% (2,951,289B) 0x4FBDD9B: do_lazyLoadDBfetch (serialize.c:2335)
    |
    ->00.56% (1,446,808B) in 1+ places, all below ms_print's threshold (01.00%)

    You might be wondering why IOPro uses so much memory? The problem is fixed-width string types:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    In [2]: adap = iopro.text_adapter('P00000001-ALL.csv')
    In [3]: arr = adap[:]
    In [4]: arr
    Out[4]:
    array([ ('C00410118', 'P20002978', 'Bachmann, Michelle', 'HARVEY, WILLIAM', 'MOBILE', 'AL', '366010290', 'RETIRED', 'RETIRED', 250.0, '20-JUN-11', '', '', '', 'SA17A', 736166L),
    ('C00410118', 'P20002978', 'Bachmann, Michelle', 'HARVEY, WILLIAM', 'MOBILE', 'AL', '366010290', 'RETIRED', 'RETIRED', 50.0, '23-JUN-11', '', '', '', 'SA17A', 736166L),
    ('C00410118', 'P20002978', 'Bachmann, Michelle', 'SMITH, LANIER', 'LANETT', 'AL', '368633403', 'INFORMATION REQUESTED', 'INFORMATION REQUESTED', 250.0, '05-JUL-11', '', '', '', 'SA17A', 749073L),
    ...,
    ('C00500587', 'P20003281', 'Perry, Rick', 'GRANE, BRYAN F. MR.', 'INFO REQUESTED', 'XX', '99999', 'INFORMATION REQUESTED PER BEST EFFORTS', 'INFORMATION REQUESTED PER BEST EFFORTS', 500.0, '29-SEP-11', '', '', '', 'SA17A', 751678L),
    ('C00500587', 'P20003281', 'Perry, Rick', 'TOLBERT, DARYL MR.', 'INFO REQUESTED', 'XX', '99999', 'T.A.C.C.', 'LONGWALL MAINTENANCE FOREMAN', 500.0, '30-SEP-11', '', '', '', 'SA17A', 751678L),
    ('C00500587', 'P20003281', 'Perry, Rick', 'ANDERSON, MARILEE MRS.', 'INFO REQUESTED', 'XX', '99999', 'INFORMATION REQUESTED PER BEST EFFORTS', 'INFORMATION REQUESTED PER BEST EFFORTS', 2500.0, '31-AUG-11', '', '', '', 'SA17A', 751678L)],
    dtype=[('cmte_id', '|S9'), ('cand_id', '|S9'), ('cand_nm', '|S30'), ('contbr_nm', '|S57'), ('contbr_city', '|S29'), ('contbr_st', '|S2'), ('contbr_zip', '|S9'), ('contbr_employer', '|S38'), ('contbr_occupation', '|S38'), ('contb_receipt_amt', '<f8'), ('contb_receipt_dt', '|S9'), ('receipt_desc', '|S76'), ('memo_cd', '|S1'), ('memo_text', '|S76'), ('form_tp', '|S5'), ('file_num', '<u8')])

    Oof. Dtypes of S38 or S76 means that field uses 76 bytes for every entry. This is not good, so let’s set a bunch of these fields to use Python objects like pandas:

    adap = iopro.text_adapter('P00000001-ALL.csv')
    adap.set_field_types({2: object, 3: object, 4: object,
                          7: object, 8: object, 13: object})
    arr = adap[:]

    Here’s the Massif peak usage which is reasonably inline with pandas:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    --------------------------------------------------------------------------------
    n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
    --------------------------------------------------------------------------------
    39 17,492,316,089 447,952,304 399,945,549 48,006,755 0
    89.28% (399,945,549B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
    ->64.31% (288,096,256B) 0x4EADC3B: PyObject_Malloc (obmalloc.c:580)
    |
    ->10.73% (48,083,577B) 0x6E1F5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->10.73% (48,083,088B) 0x6E0BEAD: PyArray_Resize (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
    |
    ->02.34% (10,485,760B) 0x65391E1: open_text_adapter (text_adapter.c:58)
    |
    ->01.16% (5,196,868B) in 113 places, all below massif's threshold (01.00%)

    Conclusions

    I’m very happy to see this project to completion, finally. Python users have been suffering for years from parsers that have 1) few features, 2) are slow, and 3) use a lot of memory. In pandas I focused first on features, then on speed, and now on both speed and memory. I’m very pleased with how it turned out. I’m excited to see the code hopefully pushed upstream into NumPy when I can get some help with the integration and plumbing (and parameter hell).

    It will be a month or so before this code appears in a new release of pandas (we are about to release version 0.9.0) as I want to let folks on the bleeding edge find any bugs before releasing it to the masses.

    Future work and extensions

    Several things could (should) be added to the parser without too much effort comparatively:

  • Integrate a regular expression engine to tokenize lines with multi-character delimiters or regular expressions.
  • Code up the fixed-width-field version of the tokenizer
  • Add on-the-fly decompression of GZIP’d files
  • Code used for performance and memory benchmarks

    R code (just copy-pasted the output I got of each command). Version 2.14.0

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    system.time(df <- read.csv('parser_examples/zeros.csv', colClasses=rep("integer", 50)))
    user system elapsed
    0.616 0.004 0.623
    system.time(df <- read.csv('parser_examples/matrix.csv', colClasses=rep("numeric", 10)))
    user system elapsed
    6.920 0.136 7.071
    system.time(df <- read.csv('parser_examples/sdss6949386.csv', colClasses=rep("numeric", 8)))
    user system elapsed
    37.030 0.804 37.866
    system.time(df <- read.table('parser_examples/pagecounts-20110331-220000', sep=" ",
               header=F,
    colClasses=c("character", "character", "integer", "numeric")))
    user system elapsed
    42.250 0.356 42.651
    system.time(df <- read.csv('parser_examples/P00000001-ALL.csv'))
    user system elapsed
    18.121 0.212 18.350

    Requirements for EuroSciPy 2012 pandas tutorial

    Much belatedly, here are the requirements for the EuroSciPy tutorial on pandas in Brussels tomorrow. They are the same as for the SciPy 2012 (Austin) tutorial:

  • NumPy 1.6.1 or higher
  • matplotlib 1.0 or higher
  • IPython 0.13 or higher, HTML notebook dependencies
  • pandas 0.8.1 (or better, GitHub master revision) and dependencies (dateutil, pytz)
  • One of the easiest ways to get started from scratch is with EPDFree and then installing pandas 0.8.1 or higher. If you’re using 0.8.1 we may run into a few minor bugs that I have fixed since the last release.

    Finally, GitHub listens! The importance of attention to detail in UX

    I like using GitHub, as do apparently a lot of other people. When doing pull requests, I’ve been annoyed for a long time by the amount of clicking necessary to get the git URL to the contributor’s repository (which must be added as a remote, etc.):

  • Click on new contributor
  • Locate pandas fork on their list of repositories
  • Copy git link from their fork, git remote add and I’m in business
  • But, GitHub is about using git, so why should I have to go fishing to get the git link? Earlier in the Pull Request UI, there was a help box that contained the git url (which I used), but then they took it away! So fishing it was. This might seem very minor, but after adding many dozens of remotes like that the annoyance had accumulated. It seemed obvious to me that the branch names should just be links:

    Each time GitHub tweaked the UI, I complained on Twitter, like so:


    Fffffff @ changes the pull request UI again but the branch names still are not links to the repository?
    @wesmckinn
    Wes McKinney

    Finally, someone noticed:


    @ Great idea. Links added as of a few minutes ago.
    @cameronmcefee
    Cameron McEfee

    Small victories, I guess. Enjoy.

    Latest Table of Contents for Python for Data Analysis

    Making some progress on Python for Data Analysis

    Intro to Python for Financial Data Analysis at General Assembly

    I taught a class this past Monday, June 18, at General Assembly. Here are the (very brief) slides and a link to the IPython notebooks. You’ll need at least pandas 0.8.0b2, though unfortunately I identified a few bugs during the class that have since been fixed. Look out for the final release of pandas 0.8.0 any day now.