<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Wes McKinney&#039;s Blog</title>
	<atom:link href="http://wesmckinney.com/blog/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://wesmckinney.com/blog</link>
	<description>A blog about scientific Python, data analysis, statistics, algorithms, and hacking of all kinds</description>
	<lastBuildDate>Sat, 15 Jun 2013 05:00:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2</generator>
		<item>
		<title>PyCon Singapore 2013</title>
		<link>http://wesmckinney.com/blog/?p=687</link>
		<comments>http://wesmckinney.com/blog/?p=687#comments</comments>
		<pubDate>Sat, 15 Jun 2013 05:00:52 +0000</pubDate>
		<dc:creator>Wes McKinney</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://wesmckinney.com/blog/?p=687</guid>
		<description><![CDATA[I was graciously invited to give the keynote presentation at this year&#8217;s PyCon Singapore. Luckily, I love to hack on long plane rides. See the slides from the talk below. I showed some analytics on Python posts on Stackoverflow during &#8230;<p class="read-more"><a href="http://wesmckinney.com/blog/?p=687">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p>I was graciously invited to give the keynote presentation at this year&#8217;s PyCon Singapore. Luckily, I love to hack on long plane rides. See the slides from the talk below. I showed some analytics on Python posts on Stackoverflow during the talk, <a href="http://nbviewer.ipython.org/urls/gist.github.com/wesm/5786965/raw/ed219255583af7dea499577a134a9ff8625a6f25/gistfile1.txt" title="stackoverflow analytics" target="_blank">here is the IPython notebook</a>. The raw data is <a href="https://www.dropbox.com/sh/3n4gyym1xjzdjfw/1QUflfXAsS" title="Python stackoverflow data" target="_blank">right here</a>.</p>
<p>I also gave a half day pandas tutorial, <a href="http://nbviewer.ipython.org/urls/gist.github.com/wesm/5773719/raw/1399562c0a02b9edc3d13c71a70387a31d87260b/tutorial.ipynb" title="tutorial notebook" target="_blank">here is the IPython notebook</a>. You will need the data to do it yourself, here&#8217;s a <a href="https://www.dropbox.com/s/zux0ed09j4fp5fy/tutorial_files.zip" title="PyconSGTutorials" target="_blank">download link</a>.</p>
<p><iframe src="http://www.slideshare.net/slideshow/embed_code/23010960" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px" allowfullscreen webkitallowfullscreen mozallowfullscreen> </iframe>
<div style="margin-bottom:5px"> <strong> <a href="http://www.slideshare.net/wesm/pycon-sg-slides" title="PyCon Singapore 2013 Keynote" target="_blank">PyCon Singapore 2013 Keynote</a> </strong> from <strong><a href="http://www.slideshare.net/wesm" target="_blank">wesm</a></strong> </div>
]]></content:encoded>
			<wfw:commentRss>http://wesmckinney.com/blog/?feed=rss2&#038;p=687</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>I&#8217;m moving to San Francisco. And hiring</title>
		<link>http://wesmckinney.com/blog/?p=651</link>
		<comments>http://wesmckinney.com/blog/?p=651#comments</comments>
		<pubDate>Fri, 22 Mar 2013 20:43:08 +0000</pubDate>
		<dc:creator>Wes McKinney</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://wesmckinney.com/blog/?p=651</guid>
		<description><![CDATA[PyCon and PyData 2013 were a blast this last week. Several people noted that my GitHub activity on pandas hasn&#8217;t quite been the same lately and wondered if I was getting a little burned out. I&#8217;m happy to say quite &#8230;<p class="read-more"><a href="http://wesmckinney.com/blog/?p=651">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p>PyCon and PyData 2013 were a blast this last week. Several people noted that my GitHub activity on pandas hasn&#8217;t quite been the same lately and wondered if I was getting a little burned out. I&#8217;m happy to say quite the opposite; I&#8217;ll still be involved with pandas development (though not the 80 hours/week of the last 2 years), but I&#8217;m starting an ambitious new data project that I&#8217;m looking forward to sharing later this year. This endeavor is also taking me from New York to San Francisco. I&#8217;m sad to be leaving the (vibrant and growing) New York data community, but also looking forward to spending more time with my data hacking friends in the Bay Area.</p>
<p>While I can&#8217;t share too many details about the new startup, anyone who knows me knows my passion for innovation in data tooling and making people and organizations more productive in their data analysis. <a href="http://pandas.pydata.org" target="_blank">pandas</a> was Act 1! So, I&#8217;m building a world class team of engineers, designers, and forward thinkers to join me in this effort. Are you one of them? If so I look forward to hearing from you (my e-mail address can be easily located on GitHub):</p>
<h3>Front-end and Data Visualization Engineer</h3>
<p>You will be building a richly featured web application that will stretch the capabilities of modern web browsers. You will build core UI components and work with the UX designer and backend team to make everything work seamlessly. </p>
<ul>
<li>Extensive Javascript and CSS experience.</li>
<li>Experience with one or more SVG or Canvas-based visualization toolkits, i.e. D3.js, or a keen interest in learning. Maybe you&#8217;ve spent a lot of time on http://bl.ocks.org. Extra points if you have opinions about the Grammar of Graphics (or its implementations, like ggplot2).</li>
<li>Know the ins and outs of websockets and AJAX communications with various backend data services.</li>
<li>Understand data binding and have used MV* frameworks enough to be dangerous.</li>
<li>Prior data analysis experience (even at the Excel level) very useful.</li>
</ul>
<h3>Data Engineer</h3>
<p>You are a pragmatic, performance-motivated cruncher of bytes who knows what it means to ship code on tight deadlines. You have high standards but are willing to make tradeoffs to get stuff done. You and I will spend a lot of time at the whiteboard talking about the nuts and bolts of data processing. Some of these things may describe you:</p>
<ul>
<li>Experience building performance and latency-sensitive, data-driven analytical applications.</li>
<li>Knowledge of standard data structures and algorithms for data processing, their implementation details, and performance tradeoffs (hash tables, vectors, binary trees, sorting algorithms, etc.). Maybe you already have enjoyed reading my blog.</li>
<li>Knowledge of binary data formats, serialization schemes, compression, and other IO performance considerations. Familiar with a variety of database technology.</li>
<li>Python and C/C++ experience preferred. Extra points if you have programmed in an APL dialect (J or K/Q/Kona) or solved 100 or more problems on Project Euler.</li>
<li>You are a firm believer in unit testing and continuous integration.</li>
<li>Experience with building distributed data systems and analytics tools such as Spark, Crunch, or Pig a plus. Maybe you loved the Google Dremel white paper.</li>
<li>Experience with code generation (e.g. LLVM) or compiler technology a big plus.</li>
</ul>
<h3>UX Designer</h3>
<p>You will be responsible for web design and crafting a compelling user experience. You will work intimately with the front-end / data visualization team to make a consistent look and feel throughout the product. We would prefer a designer who can also do her/his own JS/CSS implementation work if necessary. Any experience with data analysis tools, from Excel to Spotfire to Matlab (with opinions about what makes each of them easy&#8211;or terrible&#8211;to use), would be strongly preferred.</p>
<h3>Full Stack Web Engineer</h3>
<p>You will play a critical role in building a scalable, reliable web application, and generally keeping the trains running on time. You should be a jack of many trades with an interest in learning many more. Here are some desirable qualities:</p>
<ul>
<li>Extensive experience using Python, Javascript (Node.js), etc. to build scalable, data-intensive web applications, and with strong opinions about the right technology to use.</li>
<li>Comfort with managing continuous (or highly frequent) deployments.</li>
<li>Experience with using and managing SQL (e.g. Postgres) and NoSQL (e.g. MongoDB) databases.</li>
<li>Experience building applications EC2 or other cloud computing services.</li>
</ul>
<h3>Product Lead</h3>
<p>You will run product at the company, working to gain a deep understanding of customer use cases and working with the engineering team to drive product-market fit. Prior experience in the data analytics or business intelligence space would be very helpful.</p>
]]></content:encoded>
			<wfw:commentRss>http://wesmckinney.com/blog/?feed=rss2&#038;p=651</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>Whirlwind tour of pandas in 10 minutes</title>
		<link>http://wesmckinney.com/blog/?p=647</link>
		<comments>http://wesmckinney.com/blog/?p=647#comments</comments>
		<pubDate>Mon, 11 Feb 2013 22:17:51 +0000</pubDate>
		<dc:creator>Wes McKinney</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://wesmckinney.com/blog/?p=647</guid>
		<description><![CDATA[10-minute tour of pandas from Wes McKinney on Vimeo.]]></description>
			<content:encoded><![CDATA[<p><iframe src="http://player.vimeo.com/video/59324550" width="500" height="309" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe>
<p><a href="http://vimeo.com/59324550">10-minute tour of pandas</a> from <a href="http://vimeo.com/user10077863">Wes McKinney</a> on <a href="http://vimeo.com">Vimeo</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://wesmckinney.com/blog/?feed=rss2&#038;p=647</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Update on upcoming pandas v0.10, new file parser, other performance wins</title>
		<link>http://wesmckinney.com/blog/?p=635</link>
		<comments>http://wesmckinney.com/blog/?p=635#comments</comments>
		<pubDate>Mon, 10 Dec 2012 20:58:33 +0000</pubDate>
		<dc:creator>Wes McKinney</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://wesmckinney.com/blog/?p=635</guid>
		<description><![CDATA[We&#8217;re hard at work as usual getting the next major pandas release out. I hope you&#8217;re as excited as I am! An interesting problem came up recently on the ever-popular FEC Disclosure database used in my book and in many &#8230;<p class="read-more"><a href="http://wesmckinney.com/blog/?p=635">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p>We&#8217;re hard at work as usual getting the next major pandas release out. I hope you&#8217;re as excited as I am! An interesting problem came up recently on the ever-popular <a href="http://www.fec.gov/disclosurep/PDownload.do" title="FEC data" target="_blank">FEC Disclosure database</a> used in my book and in many pandas demos. The powers that be decided it would be cool if they put commas at the end of each line; fooling most CSV readers into thinking there are empty fields at the end of each line:</p>
<p><pre><code>In [4]: path
Out[4]: '/home/wesm/Downloads/P00000001-ALL.csv'

In [5]: !head $path -n 5
cmte_id,cand_id,cand_nm,contbr_nm,contbr_city,contbr_st,contbr_zip,contbr_employer,contbr_occupation,contb_receipt_amt,contb_receipt_dt,receipt_desc,memo_cd,memo_text,form_tp,file_num,tran_id,election_tp
C00410118,&quot;P20002978&quot;,&quot;Bachmann, Michele&quot;,&quot;HARVEY, WILLIAM&quot;,&quot;MOBILE&quot;,&quot;AL&quot;,&quot;366010290&quot;,&quot;RETIRED&quot;,&quot;RETIRED&quot;,250,20-JUN-11,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;SA17A&quot;,&quot;736166&quot;,&quot;A1FDABC23D2D545A1B83&quot;,&quot;P2012&quot;,
C00410118,&quot;P20002978&quot;,&quot;Bachmann, Michele&quot;,&quot;HARVEY, WILLIAM&quot;,&quot;MOBILE&quot;,&quot;AL&quot;,&quot;366010290&quot;,&quot;RETIRED&quot;,&quot;RETIRED&quot;,50,23-JUN-11,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;SA17A&quot;,&quot;736166&quot;,&quot;A899B9B0E223743EFA63&quot;,&quot;P2012&quot;,
C00410118,&quot;P20002978&quot;,&quot;Bachmann, Michele&quot;,&quot;SMITH, LANIER&quot;,&quot;LANETT&quot;,&quot;AL&quot;,&quot;368633403&quot;,&quot;INFORMATION REQUESTED&quot;,&quot;INFORMATION REQUESTED&quot;,250,05-JUL-11,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;SA17A&quot;,&quot;749073&quot;,&quot;A4B144E3CB3294ABC9D6&quot;,&quot;P2012&quot;,
C00410118,&quot;P20002978&quot;,&quot;Bachmann, Michele&quot;,&quot;BLEVINS, DARONDA&quot;,&quot;PIGGOTT&quot;,&quot;AR&quot;,&quot;724548253&quot;,&quot;NONE&quot;,&quot;RETIRED&quot;,250,01-AUG-11,&quot;&quot;,&quot;&quot;,&quot;&quot;,&quot;SA17A&quot;,&quot;749073&quot;,&quot;A8C7C6F55B0914C5A880&quot;,&quot;P2012&quot;,</code></pre></p>
<p>pandas&#8217;s file parsers by default will treat the first column as the DataFrame&#8217;s row names if the data have 1 too many columns, which is very useful in a lot of cases. Not so much here. So I made it so you can indicate <kbd>index_col=False</kbd> which results on the last column being dropped as desired. The FEC data file is now about 900MB and takes only 20 seconds to load on my spinning-rust box:</p>
<p><pre><code>In [1]: path = '/home/wesm/Downloads/P00000001-ALL.csv'

In [2]: %time fec = read_csv(path, index_col=False)
CPU times: user 19.29 s, sys: 1.04 s, total: 20.33 s
Wall time: 20.45 s

In [3]: fec
Out[3]: 
&lt;class 'pandas.core.frame.DataFrame'&gt;
Int64Index: 5330834 entries, 0 to 5330833
Data columns:
cmte_id              5330834  non-null values
cand_id              5330834  non-null values
cand_nm              5330834  non-null values
contbr_nm            5330834  non-null values
contbr_city          5330773  non-null values
contbr_st            5330779  non-null values
contbr_zip           5330472  non-null values
contbr_employer      5295347  non-null values
contbr_occupation    5296749  non-null values
contb_receipt_amt    5330834  non-null values
contb_receipt_dt     5330834  non-null values
receipt_desc         50420  non-null values
memo_cd              1165554  non-null values
memo_text            1182573  non-null values
form_tp              5330834  non-null values
file_num             5330834  non-null values
tran_id              5330834  non-null values
election_tp          5330135  non-null values
dtypes: float64(1), int64(1), object(16)

In [4]: fec.head()
Out[4]: 
     cmte_id    cand_id            cand_nm           contbr_nm         contbr_city  \
0  C00410118  P20002978  Bachmann, Michele     HARVEY, WILLIAM              MOBILE   
1  C00410118  P20002978  Bachmann, Michele     HARVEY, WILLIAM              MOBILE   
2  C00410118  P20002978  Bachmann, Michele       SMITH, LANIER              LANETT   
3  C00410118  P20002978  Bachmann, Michele    BLEVINS, DARONDA             PIGGOTT   
4  C00410118  P20002978  Bachmann, Michele  WARDENBURG, HAROLD  HOT SPRINGS NATION   

  contbr_st    contbr_zip        contbr_employer      contbr_occupation  \
0        AL  3.660103e+08                RETIRED                RETIRED   
1        AL  3.660103e+08                RETIRED                RETIRED   
2        AL  3.686334e+08  INFORMATION REQUESTED  INFORMATION REQUESTED   
3        AR  7.245483e+08                   NONE                RETIRED   
4        AR  7.190165e+08                   NONE                RETIRED   

   contb_receipt_amt contb_receipt_dt receipt_desc memo_cd memo_text form_tp  \
0                250        20-JUN-11          NaN     NaN       NaN   SA17A   
1                 50        23-JUN-11          NaN     NaN       NaN   SA17A   
2                250        05-JUL-11          NaN     NaN       NaN   SA17A   
3                250        01-AUG-11          NaN     NaN       NaN   SA17A   
4                300        20-JUN-11          NaN     NaN       NaN   SA17A   

   file_num               tran_id election_tp  
0    736166  A1FDABC23D2D545A1B83       P2012  
1    736166  A899B9B0E223743EFA63       P2012  
2    749073  A4B144E3CB3294ABC9D6       P2012  
3    749073  A8C7C6F55B0914C5A880       P2012  
4    736166  A070D6D560BD84AA98AC       P2012</code></pre></p>
<p>For reference, it&#8217;s more difficult to load this file in R (2.15.2) (both because of its size and malformedness&#8211; hopefully an R guru can tell me how to deal with this trailing delimiter crap). Setting <kbd>row.names=NULL</kbd> causes incorrect column labelling but at least gives us a parsing + type inference performance number (about 10x slower, faster if you specify all 18 column data types):</p>
<p><pre><code>&gt; system.time(df &lt;- read.csv(path, row.names=NULL))
   user  system elapsed 
220.250   2.100 222.662</code></pre></p>
<p>If you know much about this data set, you know most of these columns are not interesting to analyze. New in pandas v0.10 you can specify a subset of columns right in <kbd>read_csv</kbd> which results in both <i>much</i> faster parsing time and lower memory usage (since we're throwing away the data from the other columns after tokenizing the file):</p>
<p><pre><code>In [6]: fec_cols = ['cand_nm', 'contbr_st', 'contbr_employer', 'contb_receipt_amt', 
   ...:             'contbr_occupation', 'contb_receipt_amt', 'contb_receipt_dt']

In [7]: %time fec = read_csv(path, index_col=False, usecols=fec_cols) 
CPU times: user 8.93 s, sys: 0.36 s, total: 9.29 s
Wall time: 9.31 s

In [8]: fec
feOut[8]: 
&lt;class 'pandas.core.frame.DataFrame'&gt;
Int64Index: 5330834 entries, 0 to 5330833
Data columns:
cand_nm              5330834  non-null values
contbr_st            5330779  non-null values
contbr_employer      5295347  non-null values
contbr_occupation    5296749  non-null values
contb_receipt_amt    5330834  non-null values
contb_receipt_dt     5330834  non-null values
dtypes: float64(1), object(5)</code></pre></p>
<p>Outside of file reading, a huge amount of work has been done elsewhere on pandas (aided by Chang She, Yoval P, Jeff Reback, and others). Performance has improved in many critical operations outside of parsing too (check out the groupby numbers!). Here's the output of a recent vbench run showing the latest dev version versus version 0.9.0 (numbers less than 1 indicate that the current pandas version is faster on average by that ratio):</p>
<p><pre><code>master     v0.9.0      ratio
name                                                                    
unstack_sparse_keyspace                     1.3068   140.7709     0.0093
groupby_frame_apply_overhead               18.5924   320.9360     0.0579
read_csv_comment2                          25.2323   361.2320     0.0699
groupbym_frame_apply                       73.6522   495.8570     0.1485
read_csv_thou_vb                           35.3686   191.7560     0.1844
frame_iteritems_cached                      0.0700     0.3736     0.1873
concat_small_frames                        12.4304    54.4093     0.2285
join_dataframe_integer_2key                 5.8004    18.3924     0.3154
append_frame_single_homogenous              0.3345     0.8811     0.3796
read_csv_vb                                18.5832    46.5717     0.3990
read_csv_standard                          12.2059    28.3020     0.4313
panel_from_dict_all_different_indexes      73.6467   149.4339     0.4928
frame_constructor_ndarray                   0.0478     0.0916     0.5219
panel_from_dict_two_different_indexes      50.8969    88.4219     0.5756
groupby_last                                3.6658     6.2642     0.5852
groupby_first                               3.7455     6.2646     0.5979
frame_get_numeric_data                      0.0677     0.1076     0.6295
frame_iteritems                             2.5812     3.5731     0.7224
groupby_simple_compress_timing             37.6556    51.8415     0.7264
append_frame_single_mixed                   1.2934     1.7739     0.7291
frame_to_csv                              285.7749   387.6002     0.7373
join_dataframe_integer_key                  1.7632     2.2799     0.7734
groupby_multi_size                         30.8166    39.5899     0.7784
groupby_frame_singlekey_integer             2.3004     2.8959     0.7944
write_csv_standard                        329.9391   411.1040     0.8026
groupby_multi_cython                       16.6989    19.3612     0.8625
sparse_series_to_frame                    158.4511   181.7181     0.8720
groupby_multi_series_op                    15.4856    17.7564     0.8721
stat_ops_level_frame_sum_multiple           7.6714     8.7897     0.8728
stat_ops_level_series_sum                   2.1563     2.4705     0.8728
stat_ops_level_frame_sum                    3.0208     3.3616     0.8986
merge_2intkey_nosort                       18.5616    20.6283     0.8998
dataframe_reindex_columns                   0.3021     0.3343     0.9037
stats_rank_average                         28.2949    30.8713     0.9165
reshape_stack_simple                        2.6250     2.8527     0.9202
frame_reindex_both_axes                     0.3739     0.4037     0.9262
melt_dataframe                              1.4668     1.5812     0.9276
groupby_frame_median                        7.1793     7.7170     0.9303
groupby_indices                             6.8554     7.3550     0.9321
stat_ops_level_series_sum_multiple          7.2527     7.7637     0.9342
groupby_series_simple_cython                5.0585     5.3895     0.9386
merge_2intkey_sort                         44.3559    47.1701     0.9403
timeseries_timestamp_tzinfo_cons            0.0159     0.0169     0.9435
join_dataframe_index_multi                 20.7460    21.9524     0.9450
frame_reindex_both_axes_ix                  0.4559     0.4810     0.9479
panel_from_dict_equiv_indexes              26.2508    27.6105     0.9508
index_datetime_intersection                15.1940    15.9497     0.9526
panel_from_dict_same_index                 26.1363    27.3955     0.9540
timeseries_large_lookup_value               0.0234     0.0245     0.9566
frame_sort_index_by_columns                41.4619    43.3163     0.9572
frame_boolean_row_select                    0.2905     0.3034     0.9574
indexing_dataframe_boolean_rows             0.2364     0.2468     0.9581
join_dataframe_index_single_key_bigger     15.0122    15.6593     0.9587
index_datetime_union                       15.2101    15.8353     0.9605
reindex_frame_level_align                   0.9618     0.9992     0.9625
dataframe_reindex_daterange                 0.4052     0.4206     0.9634
lib_fast_zip                               10.3956    10.6966     0.9719
read_table_multiple_date_baseline        1175.4630  1208.3721     0.9728
groupby_frame_cython_many_columns           3.6782     3.7668     0.9765
groupby_multi_python                       54.6882    55.9723     0.9771
series_align_int64_index                   35.9280    36.6402     0.9806
groupby_multi_different_numpy_functions    13.9577    14.2328     0.9807
read_table_multiple_date                 2599.5600  2648.0811     0.9817
reindex_frame_level_reindex                 0.9353     0.9506     0.9839
timeseries_asof_single                      0.0565     0.0574     0.9840
concat_series_axis1                        70.7297    71.8611     0.9843
replace_fillna                              4.9988     5.0717     0.9856
reindex_multiindex                          1.2701     1.2874     0.9866
groupby_pivot_table                        18.8712    19.1218     0.9869
groupby_multi_different_functions          13.9056    14.0842     0.9873
indexing_dataframe_boolean_rows_object      0.4870     0.4932     0.9875
stats_rank2d_axis1_average                 14.5004    14.6610     0.9890
frame_ctor_nested_dict                     90.2146    91.1775     0.9894
frame_ctor_nested_dict_int64              131.1228   132.3512     0.9907
frame_reindex_axis1                         2.8508     2.8740     0.9919
timeseries_asof_nan                         9.9486    10.0186     0.9930
timeseries_asof                            10.3370    10.4094     0.9930
stats_rank_average_int                     22.5349    22.6918     0.9931
sort_level_zero                             4.3147     4.3434     0.9934
frame_reindex_axis0                         1.3561     1.3643     0.9940
dti_reset_index                             0.6203     0.6237     0.9945
frame_fillna_many_columns_pad              15.3549    15.4276     0.9953
frame_drop_dup_inplace                      2.8577     2.8690     0.9961
stats_rank2d_axis0_average                 24.5411    24.6010     0.9976
timeseries_timestamp_downsample_mean        4.4788     4.4895     0.9976
match_strings                               0.3809     0.3815     0.9984
timeseries_sort_index                      22.2473    22.2516     0.9998
index_int64_intersection                   25.5705    25.5643     1.0002
index_int64_union                          82.9492    82.8681     1.0010
frame_insert_500_columns                   99.3640    99.2359     1.0013
timeseries_add_irregular                   21.6117    21.5676     1.0020
replace_replacena                           5.0788     5.0680     1.0021
series_ctor_from_dict                       3.6953     3.6821     1.0036
series_align_left_monotonic                13.1076    13.0529     1.0042
timeseries_period_downsample_mean           6.2799     6.2450     1.0056
timeseries_1min_5min_mean                   0.6190     0.6135     1.0090
stat_ops_series_std                         0.2522     0.2494     1.0113
frame_drop_dup_na_inplace                   2.6487     2.6187     1.0115
reindex_daterange_backfill                  0.1866     0.1842     1.0128
frame_drop_duplicates                      18.7704    18.5208     1.0135
series_align_irregular_string              81.0545    79.7984     1.0157
frame_drop_duplicates_na                   18.7666    18.4687     1.0161
timeseries_infer_freq                      10.5702    10.4007     1.0163
lib_fast_zip_fillna                        13.9561    13.6818     1.0200
reindex_daterange_pad                       0.1876     0.1839     1.0202
timeseries_1min_5min_ohlc                   0.6971     0.6815     1.0229
reindex_fillna_backfill                     0.1426     0.1388     1.0269
sort_level_one                              4.4267     4.2921     1.0314
datetimeindex_add_offset                    0.2387     0.2307     1.0345
frame_fancy_lookup                          2.5191     2.4317     1.0359
frame_fillna_inplace                       16.9570    16.2857     1.0412
timeseries_slice_minutely                   0.0626     0.0600     1.0433
reindex_fillna_pad                          0.1459     0.1396     1.0453
frame_ctor_list_of_dict                    98.7562    93.8414     1.0524
join_dataframe_index_single_key_bigger      6.6729     6.2370     1.0699
frame_fancy_lookup_all                     26.3783    24.6056     1.0720
series_constructor_ndarray                  0.0124     0.0115     1.0765
series_value_counts_int64                   2.7498     2.5306     1.0866
reshape_unstack_simple                      3.1889     2.8616     1.1144
indexing_panel_subset                       0.5772     0.5063     1.1400
groupby_apply_dict_return                  45.6793    39.6327     1.1526
timeseries_to_datetime_iso8601              4.1278     3.5348     1.1678
join_dataframe_index_single_key_small       6.8140     5.7951     1.1758
frame_to_string_floats                     52.5621    35.4093     1.4844
reshape_pivot_time_series                 181.4950   103.2100     1.7585
sparse_frame_constructor                    5.9832     3.3596     1.7809

Columns: test_name | target_duration [ms] | baseline_duration [ms] | ratio

- a Ratio of 1.30 means the target commit is 30% slower then the baseline.

Target [648d581] : ENH: index_col=False disables first column -&gt; index. fixes FEC parsing problem. close #2442
Baseline [b5956fd] : RLS: Version 0.9.0 final</code></pre></p>
]]></content:encoded>
			<wfw:commentRss>http://wesmckinney.com/blog/?feed=rss2&#038;p=635</wfw:commentRss>
		<slash:comments>17</slash:comments>
		</item>
		<item>
		<title>A new high performance, memory-efficient file parser engine for pandas</title>
		<link>http://wesmckinney.com/blog/?p=543</link>
		<comments>http://wesmckinney.com/blog/?p=543#comments</comments>
		<pubDate>Thu, 04 Oct 2012 20:31:18 +0000</pubDate>
		<dc:creator>Wes McKinney</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://wesmckinney.com/blog/?p=543</guid>
		<description><![CDATA[TL;DR I&#8217;ve finally gotten around to building the high performance parser engine that pandas deserves. It hasn&#8217;t been released yet (it&#8217;s in a branch on GitHub) but will after I give it a month or so for any remaining buglets &#8230;<p class="read-more"><a href="http://wesmckinney.com/blog/?p=543">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p><b>TL;DR</b> I&#8217;ve finally gotten around to building the high performance parser engine that pandas deserves. It hasn&#8217;t been released yet (it&#8217;s in a branch <a href="http://github.com/pydata/pandas" title="GitHub" target="_blank">on GitHub</a>) but will after I give it a month or so for any remaining buglets to shake out:</p>
<p><a href="http://wesmckinney.com/blog/wp-content/uploads/2012/10/parser_benchmarks.png"><img src="http://wesmckinney.com/blog/wp-content/uploads/2012/10/parser_benchmarks.png" alt="" title="parser_benchmarks" width="908" height="395" class="aligncenter size-full wp-image-583" /></a></p>
<p>A project I&#8217;ve put off for a long time is building a high performance, memory efficient file parser for pandas. The existing code up through and including the imminent pandas 0.9.0 release has always been makeshift; the development focus has been on parser features over the more tedious (but actually much more straightforward) issue of creating a fast C table tokenizer. It&#8217;s been on the pandas roadmap for a long time:</p>
<p><a href="http://github.com/pydata/pandas/issues/821">http://github.com/pydata/pandas/issues/821</a></p>
<p><kbd>pandas.read_csv</kbd> from pandas 0.5.0 onward is actually very fast&#8211; faster than R and much faster than <kbd>numpy.loadtxt</kbd>&#8211; but it uses a lot of memory. I wrote about some of the implementation issues <a href="http://wesmckinney.com/blog/?p=278">about a year ago here</a>. The key problem with the existing code is this: all of the existing parsing solutions in pandas as well as NumPy <b>first read the file data into pure Python data structures</b>: a list of tuples or a list of lists. If you have a very large file, a list of 1 million or 10 million Python tuples has an extraordinary memory footprint&#8211; <i>significantly</i> greater than the size of the file on disk (can be 5x or more footprint, far too much). Some people have <a href="http://continuum.io/blog/faster-and-smaller">pointed out the large memory usage</a> without correctly explaining why, but this is the one and only reason: too many intermediate Python data structures.</p>
<p>Building a good parser engine isn&#8217;t exactly rocket science; we&#8217;re talking optimizing the implementation of dirt simple O(n) algorithms here. The task is divided into several key pieces:</p>
<li><b>File tokenization</b>: read bytes from the file, identify where fields begin and end and which column each belongs to. Python&#8217;s <kbd>csv</kbd> module is an example of a tokenizer. Things like quoting conventions need to be taken into account. Doing this well in C is about picking the right data structures and making the code lean and mean. To be clear: if you design the tokenizer data structure wrong, you&#8217;ve lost before you&#8217;ve begun.</li>
<li><b>NA value filtering</b>: detect NA (missing) value sentinels and convert to the appropriate NA representation. Examples of NA sentinels are <kbd>NA, #N/A</kbd> or other bespoke sentinels like <kbd>-999</kbd>. Practically speaking this means keeping a hash set of strings considered NA and check whether each parsed token is in the set (and you can have different NA sets for each column, too!). If the number of sentinel values is small, you could use an array of C strings instead of a hash set.</li>
<li><b>Tolerating &#8220;bad&#8221; rows</b>: Can aberrant rows be gracefully ignored with your consent? Is the error message informative?</li>
<li><b>Type inference / conversion</b>: Converting the tokens in the file to the right C types (string, date, floating point, integer, boolean).</li>
<li><b>Skipping rows</b>: Ignore certain rows in file or at end of file.</li>
<li><b>Date parsing / value conversion</b>: Convert one or more columns into timestamps. In some cases concatenate date/time information spread across multiple columns.</li>
<li><b>Handling of &#8220;index&#8221; columns</b>: Handle row names appropriately, yielding a DataFrame with the expected row index.</li>
<p> </br></p>
<p>None of this is that hard; it&#8217;s made much more time consuming due to the proliferation of fine-grained options (and resulting &#8220;parameter hell&#8221;). Anyway, I finally mustered the energy to hack it out over a few intense days in late August and September. I&#8217;m hoping to ship it in a quick pandas 0.10 release (&#8220;version point-ten&#8221;) toward the end of October if possible. It would be nice to push this code upstream into NumPy to improve loadtxt and genfromtxt&#8217;s performance as well.</p>
<h1>Benchmarks against R, NumPy, Continuum&#8217;s IOPro</h1>
<p>Outside of parser features (i.e. &#8220;can the tool read my file correctly&#8221;), there are two performance areas of interest:</p>
<li>CPU Speed: how long does it take to parse the file?</li>
<li>Memory utilization: what&#8217;s the maximum amount of RAM used while the file is being parsed (including the final returned table)? There&#8217;s really nothing worse than your computer starting to swap when you try to parse a large file</li>
<p>
<p>I&#8217;ll compare the new pandas parser engine in a group of several tools that you can use to do the same job, including R&#8217;s parser functions:</p>
<li>R&#8217;s venerable <kbd>read.csv</kbd> and <kbd>read.table</kbd> functions</li>
<li><kbd>numpy.loadtxt</kbd>: this is a pure Python parser, to be clear.</li>
<li>New pandas engine, via <kbd>pandas.read_csv</kbd> and <kbd>read_table</kbd> </li>
<li>A new commercial library, <a href="https://store.continuum.io/cshop/iopro" title="IOPro" target="_blank">IOPro</a>, from my good friends at <a href="http://continuum.io/" title="Continuum Analytics" target="_blank">Continuum Analytics</a>.</li>
<p>
<p>To do the performance analysis, I&#8217;ll look at 5 representative data sets:</p>
<li>A 100,000 x 50 CSV matrix of randomly generated 0&#8242;s and 1&#8242;s. It looks like this:</li>
<p><pre><code>In [3]: !head -n 5 parser_examples/zeros.csv
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49
1,1,0,1,1,0,0,1,1,0,1,0,1,1,0,0,1,0,0,0,0,1,0,1,1,1,1,1,0,1,0,1,1,1,1,1,0,1,1,0,1,0,0,0,1,1,0,0,0,0
1,0,0,1,1,1,0,0,1,1,1,0,1,1,0,0,1,0,1,1,0,1,1,1,1,1,1,0,1,1,1,0,0,0,0,1,0,1,0,1,0,0,0,0,1,1,0,1,1,1
1,1,0,0,0,0,0,0,0,1,0,0,0,1,1,0,1,1,0,0,0,0,0,0,1,0,0,0,0,1,1,1,0,1,1,1,0,1,0,0,1,1,1,1,1,1,1,1,0,0
0,1,0,0,1,1,0,0,0,0,0,0,0,0,1,0,1,1,0,1,0,1,0,1,0,0,1,1,0,1,1,0,0,0,0,0,1,0,1,1,1,0,0,0,1,0,1,0,0,1</code></pre></p>
<li>A 1,000,000 x 10 CSV matrix of randomly generated normally distributed data. Looks like this:</li>
<p><pre><code>In [4]: !head -n 5 parser_examples/matrix.csv
0,1,2,3,4,5,6,7,8,9
0.609633439034,0.249525535926,0.180502465241,0.940871913454,-0.35702932376,1.12983701927,0.77045731318,-0.16976884026,-0.685520348835,0.216936429382
0.76523368046,1.08405034644,1.2099841819,-0.858404123158,1.47061247583,-1.15728386054,-0.375685123416,-0.00475949800828,0.522530689417,0.485226447392
-0.958266896007,-0.0583065555495,-0.17369448475,0.465274502954,0.92612769921,0.362029345941,-2.27118704972,0.944967722699,1.34525304565,1.60130304607
-0.518406503139,-1.19158517434,0.064195872451,-2.244687656,0.947562908985,0.775078137538,0.160741686264,-0.706110036551,-0.780137216247,1.02794242373</code></pre></p>
<li>The Federal election committee (FEC) data set as a CSV file. One of my favorite example data sets for talking about pandas. Here&#8217;s what it looks like when parsed with <kbd>pandas.read_csv</kbd></li>
<p><pre><code>In [2]: df
Out[2]: 
&lt;class 'pandas.core.frame.DataFrame'&gt;
Int64Index: 1001731 entries, 0 to 1001730
Data columns:
cmte_id              1001731  non-null values
cand_id              1001731  non-null values
cand_nm              1001731  non-null values
contbr_nm            1001731  non-null values
contbr_city          1001716  non-null values
contbr_st            1001727  non-null values
contbr_zip           1001620  non-null values
contbr_employer      994314  non-null values
contbr_occupation    994433  non-null values
contb_receipt_amt    1001731  non-null values
contb_receipt_dt     1001731  non-null values
receipt_desc         14166  non-null values
memo_cd              92482  non-null values
memo_text            97770  non-null values
form_tp              1001731  non-null values
file_num             1001731  non-null values
dtypes: float64(1), int64(1), object(14)</code></pre></p>
<li>Wikipedia page count data used for benchmarks in <a href="http://continuum.io/blog/faster-and-smaller">this blog post</a>. It&#8217;s delimited by single spaces and has no column header:</li>
<p><pre><code>In [8]: df
Out[8]: 
&lt;class 'pandas.core.frame.DataFrame'&gt;
Int64Index: 6078103 entries, 0 to 6078102
Data columns:
X.1    6077987  non-null values
X.2    6078090  non-null values
X.3    6078103  non-null values
X.4    6078103  non-null values
dtypes: int64(2), object(2)

In [9]: df.head()
Out[9]: 
    X.1                                                X.2  X.3    X.4
0  aa.b                               Special%3aStatistics    1  18127
1  aa.b  Special:WhatLinksHere/User:Sir_Lestaty_de_Lion...    1   5325
2  aa.b                                         User:EVula    1  21080
3  aa.b                                  User:EVula/header    1  17332
4  aa.b                                       User:Manecke    1  21041</code></pre></p>
<li>A large numerical astronomy data set used for benchmarks in <a href="http://continuum.io/blog/numerical-data-sets-and-iopro">this blog post</a>. Looks like this:</li>
<p><pre><code>In [19]: df
 Out[19]: 
&lt;class 'pandas.core.frame.DataFrame'&gt;
Int64Index: 6949386 entries, 0 to 6949385
Data columns:
objectid (long)            6949386  non-null values
right ascension (float)    6949386  non-null values
declination (float)        6949386  non-null values
ultraviolet (double)       6949386  non-null values
green (double)             6949386  non-null values
red (double)               6949386  non-null values
infrared (double)          6949386  non-null values
z (double)                 6949386  non-null values
dtypes: float64(7), int64(1)

In [20]: df[:2].T
Out[20]: 
                                    0             1
objectid (long)          7.588828e+17  7.588828e+17
right ascension (float)  2.634087e+02  2.634271e+02
declination (float)      6.278961e+00  6.310742e+00
ultraviolet (double)     2.459675e+01  2.330080e+01
green (double)           2.347177e+01  2.275493e+01
red (double)             2.169188e+01  2.188667e+01
infrared (double)        2.118722e+01  2.066283e+01
z (double)               2.043528e+01  2.135766e+01</code></pre></p>
<p>Here&#8217;s a link to an archive of all the datasets (warning: about 500 megabytes): <a href="http://dl.dropbox.com/u/11102422/datasets.tar.gz">Table datasets</a></p>
<p>
<p>I don&#8217;t have time to compare features (which vary greatly across the tools).</p>
<p>
<p>Oh, and my rig:</p>
<li>Core i7 950 @ 3.07 GHz</li>
<li>24 GB of ram (so we won&#8217;t get close to swapping)</li>
<li>OCZ Vertex 3 Sata 3 SSD</li>
<p>
<p>(Because I have an SSD I would expect the benchmarks for spinning rust to differ roughly by a constant amount based on read times for slurping the bytes of the disk. In my case, the disk reads aren&#8217;t a major factor. In corporate environments with NFS servers under heavy load, you would expect similar reads to take a bit longer.)</p>
<h2>CPU Performance benchmarks</h2>
<p>So <kbd>numpy.loadtxt</kbd> is really slow, and I&#8217;m excluding it from the benchmarks. On the smallest and simplest file in these benchmarks, it&#8217;s more than 10 times slower than the new pandas parser:</p>
<p><pre><code>In [27]: timeit read_csv('zeros.csv')
1 loops, best of 3: 415 ms per loop

In [29]: %time arr = np.loadtxt('zeros.csv', delimiter=',', dtype=np.int_, skiprows=1)
CPU times: user 4.88 s, sys: 0.04 s, total: 4.92 s
Wall time: 4.92 s</code></pre></p>
<p>Here are the results for everybody else (see code at end of post):</p>
<p><a href="http://wesmckinney.com/blog/wp-content/uploads/2012/10/parser_benchmarks.png"><img src="http://wesmckinney.com/blog/wp-content/uploads/2012/10/parser_benchmarks.png" alt="" title="parser_benchmarks" width="908" height="395" class="aligncenter size-full wp-image-583" /></a></p>
<p>Here are the raw numbers in seconds:</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">In [30]: results<br />
Out[30]: <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;iopro &nbsp; &nbsp;pandas &nbsp; &nbsp; &nbsp; R<br />
astro &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;17.646228 &nbsp;6.955254 &nbsp;37.030<br />
double-matrix &nbsp; 3.377430 &nbsp;1.279502 &nbsp; 6.920<br />
fec &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 3.685799 &nbsp;2.306570 &nbsp;18.121<br />
wikipedia &nbsp; &nbsp; &nbsp;11.752624 &nbsp;4.369659 &nbsp;42.250<br />
zero-matrix &nbsp; &nbsp; 0.673885 &nbsp;0.268830 &nbsp; 0.616</div></div>
<h3>IOPro vs. new pandas parser: look closer</h3>
<p>But hey, wait a second. If you are intimately familiar with IOPro and pandas you will already be saying that I am not making an apples to apples comparison. True. Why not?</p>
<li>IOPro does not check for and substitute common NA sentinel values (I believe you can give it a list of values to check for&#8211; the documentation was a bit hard to work out in this regard)</li>
<li>IOPro returns NumPy arrays with structured dtype. Pandas DataFrame has a slightly different internal format, and strings are boxed as Python objects rather than stored in NumPy string dtype arrays</li>
<p>To level the playing field, I&#8217;ll disable the NA filtering logic (passing <kbd>na_filter=False</kbd>) in pandas, instruct the parser to return a structured array instead of a DataFrame (<kbd>as_recarray=True</kbd>). Secondly, let&#8217;s only look at the numerical datasets (exclude wikipedia and fec, for now) to exclude the impact of handling of string datatypes. Here is the resulting graph (with relative timings):</p>
<p><a href="http://wesmckinney.com/blog/wp-content/uploads/2012/10/parser_benchmarks2.png"><img src="http://wesmckinney.com/blog/wp-content/uploads/2012/10/parser_benchmarks2-1024x388.png" alt="" title="parser_benchmarks2" width="620" height="234" class="aligncenter size-large wp-image-587" /></a></p>
<p>It looks like the savings of not passing all the tokens through the NA filter is balanced by the cost of transferring the column arrays into the structured array (which is a raw array of bytes interpreted as a table by NumPy). This could very likely be made faster (more cache-efficient) than it currently is with some effort. </p>
<h2>Memory usage benchmarks</h2>
<p><a href="http://wesmckinney.com/blog/wp-content/uploads/2012/10/haters-gonna-hate-cool-dog.jpg"><img src="http://wesmckinney.com/blog/wp-content/uploads/2012/10/haters-gonna-hate-cool-dog-300x220.jpg" alt="" title="haters-gonna-hate-cool-dog" width="300" height="220" class="aligncenter size-medium wp-image-576" /></a></p>
<p>Profiling peak memory usage is a tedious process. The canonical tool for the job is <a href="http://valgrind.org/docs/manual/ms-manual.html" title="Massif" target="_blank">Massif</a> from the Valgrind suite. I&#8217;m not yet done obsessing over memory allocation and data management inside the parser system, but here&#8217;s what the numbers look like compared with R and IOPro. I&#8217;m using the following valgrind commands (plus ms_print) to get this output (if this is not correct, please someone tell me):</p>
<p><pre><code>valgrind --tool=massif --depth=1 python -c command
ms_print massif_output_file</code></pre></p>
<p>I&#8217;ll use the largest file in this post, the astro numerical dataset.</p>
<p>First, IOPro advertises very low memory footprint. It does not, however, avoid having 2 copies of the data set in memory (I don&#8217;t either. It&#8217;s actually very difficult&#8211;and costly&#8211;to avoid this). Here is the final output of ms_print showing peak memory usage at the very end when the structured array is created and returned:</p>
<p><pre><code>--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 65 91,209,383,985      467,332,576      467,252,659        79,917            0
 66 92,193,310,604      467,332,576      467,252,659        79,917            0
 67 93,154,564,618      912,093,904      908,774,712     3,319,192            0
99.64% (908,774,712B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
-&gt;48.76% (444,761,193B) 0x6E1F5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
| 
-&gt;48.76% (444,760,704B) 0x6E0BEAD: PyArray_Resize (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
| 
-&gt;01.15% (10,485,760B) 0x65391E1: open_text_adapter (text_adapter.c:58)
| 
-&gt;00.96% (8,767,055B) in 1+ places, all below ms_print's threshold (01.00%)</code></pre></p>
<p>Let&#8217;s look at R. Peak memory allocation comes in slightly under IOPro at 903MM bytes vs. 912MM:</p>
<p><pre><code>--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 71 227,687,337,998      635,817,352      635,766,423        50,929            0
 72 227,723,703,780      681,469,608      681,418,671        50,937            0
 73 227,758,451,071      737,064,752      737,013,799        50,953            0
 74 227,803,513,492      792,659,896      792,608,927        50,969            0
 75 227,848,799,505      848,255,040      848,204,055        50,985            0
 76 227,893,861,838      903,850,184      903,799,183        51,001            0
 77 227,937,535,005      903,850,184      903,799,183        51,001            0
99.99% (903,799,183B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
-&gt;98.52% (890,458,496B) 0x4F45020: Rf_allocVector (memory.c:2388)
| 
-&gt;01.48% (13,340,687B) in 108 places, all below massif's threshold (01.00</code></pre></p>
<p>In the new pandas parser, I&#8217;ll look at 2 things: memory allocation by the parser engine before creation of the final DataFrame (which causes data-doubling as with IOPro) and the user-facing read_csv. First, the profile of using read_csv (which also creates a simple integer Index for the DataFrame) uses 1014MM bytes, about 10% more than either of the above:</p>
<p><pre><code>--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 70 56,100,696,245    1,014,234,384      965,776,492    48,457,892            0
95.22% (965,776,492B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
-&gt;93.19% (945,117,309B) 0x6B9E5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
| 
-&gt;02.04% (20,659,183B) in 145 places, all below massif's threshold (01.00%)</code></pre></p>
<p>Considering only the parser engine (which returns a dict of arrays, i.e. no data doubling) uses only 570MM bytes:</p>
<p><pre><code>--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 56 39,618,471,594      569,619,528      521,140,509    48,479,019            0
91.49% (521,140,509B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
-&gt;87.84% (500,356,421B) 0x6B9E5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
| 
-&gt;01.95% (11,084,760B) in 128 places, all below massif's threshold (01.00%)
| 
-&gt;01.70% (9,699,328B) 0x4EADC3B: PyObject_Malloc (obmalloc.c:580)</code></pre></p>
<p>Memory usage with non-numerical data depends on a lot of issues surrounding the handling of string data. Let&#8217;s consider the FEC data set, where pandas does pretty well out of the box, using only 415MM bytes at peak (I realized why it was so high while writing this article&#8230;will reduce soon):</p>
<p><pre><code>--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 88 12,473,186,057      415,804,184      348,381,546    67,422,638            0
83.79% (348,381,546B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
-&gt;63.60% (264,457,869B) 0x6B9E5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
| 
-&gt;14.88% (61,865,984B) 0x4EADC3B: PyObject_Malloc (obmalloc.c:580)
| 
-&gt;02.34% (9,722,205B) in 149 places, all below massif's threshold (01.00%)
| 
-&gt;01.91% (7,959,808B) 0x4E9799C: fill_free_list (intobject.c:52)
| 
-&gt;01.05% (4,375,680B) 0x4EA5147: dictresize (dictobject.c:632)</code></pre></p>
<p>IOPro <i>out of the box</i> uses 3 times more. This would obviously be completely undesirable:</p>
<p><pre><code>--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 76 15,039,940,724    1,232,035,520      828,653,666   403,381,854            0
67.26% (828,653,666B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
-&gt;32.85% (404,699,813B) 0x6E1F5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
| 
-&gt;32.85% (404,699,324B) 0x6E0BEAD: PyArray_Resize (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
| 
-&gt;01.56% (19,254,529B) in 114 places, all below massif's threshold (01.00%)</code></pre></p>
<p>What about R? It may not be fast but it uses the least memory again:</p>
<p><pre><code>--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 75 51,948,622,306      259,260,296      258,997,057       263,239            0
99.90% (258,997,057B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
-&gt;74.63% (193,486,656B) 0x4F45020: Rf_allocVector (memory.c:2388)
| 
-&gt;23.57% (61,112,304B) 0x4F445C0: GetNewPage (memory.c:786)
| 
-&gt;01.14% (2,951,289B) 0x4FBDD9B: do_lazyLoadDBfetch (serialize.c:2335)
| 
-&gt;00.56% (1,446,808B) in 1+ places, all below ms_print's threshold (01.00%)</code></pre></p>
<p>You might be wondering why IOPro uses so much memory? The problem is fixed-width string types:</p>
<p><pre><code>In [2]: adap = iopro.text_adapter('P00000001-ALL.csv')

In [3]: arr = adap[:]

In [4]: arr
Out[4]: 
array([ ('C00410118', 'P20002978', 'Bachmann, Michelle', 'HARVEY, WILLIAM', 'MOBILE', 'AL', '366010290', 'RETIRED', 'RETIRED', 250.0, '20-JUN-11', '', '', '', 'SA17A', 736166L),
       ('C00410118', 'P20002978', 'Bachmann, Michelle', 'HARVEY, WILLIAM', 'MOBILE', 'AL', '366010290', 'RETIRED', 'RETIRED', 50.0, '23-JUN-11', '', '', '', 'SA17A', 736166L),
       ('C00410118', 'P20002978', 'Bachmann, Michelle', 'SMITH, LANIER', 'LANETT', 'AL', '368633403', 'INFORMATION REQUESTED', 'INFORMATION REQUESTED', 250.0, '05-JUL-11', '', '', '', 'SA17A', 749073L),
       ...,
       ('C00500587', 'P20003281', 'Perry, Rick', 'GRANE, BRYAN F. MR.', 'INFO REQUESTED', 'XX', '99999', 'INFORMATION REQUESTED PER BEST EFFORTS', 'INFORMATION REQUESTED PER BEST EFFORTS', 500.0, '29-SEP-11', '', '', '', 'SA17A', 751678L),
       ('C00500587', 'P20003281', 'Perry, Rick', 'TOLBERT, DARYL MR.', 'INFO REQUESTED', 'XX', '99999', 'T.A.C.C.', 'LONGWALL MAINTENANCE FOREMAN', 500.0, '30-SEP-11', '', '', '', 'SA17A', 751678L),
       ('C00500587', 'P20003281', 'Perry, Rick', 'ANDERSON, MARILEE MRS.', 'INFO REQUESTED', 'XX', '99999', 'INFORMATION REQUESTED PER BEST EFFORTS', 'INFORMATION REQUESTED PER BEST EFFORTS', 2500.0, '31-AUG-11', '', '', '', 'SA17A', 751678L)], 
      dtype=[('cmte_id', '|S9'), ('cand_id', '|S9'), ('cand_nm', '|S30'), ('contbr_nm', '|S57'), ('contbr_city', '|S29'), ('contbr_st', '|S2'), ('contbr_zip', '|S9'), ('contbr_employer', '|S38'), ('contbr_occupation', '|S38'), ('contb_receipt_amt', '&lt;f8'), ('contb_receipt_dt', '|S9'), ('receipt_desc', '|S76'), ('memo_cd', '|S1'), ('memo_text', '|S76'), ('form_tp', '|S5'), ('file_num', '&lt;u8')])</code></pre></p>
<p>Oof. Dtypes of <kbd>S38</kbd> or <kbd>S76</kbd> means that field uses 76 bytes for <i>every entry</i>. This is not good, so let&#8217;s set a bunch of these fields to use Python objects like pandas:</p>
<div class="codecolorer-container text mac-classic" style="overflow:auto;white-space:nowrap;border:1px solid #9F9F9F;width:435px;"><div class="text codecolorer" style="padding:5px;font:normal 12px/1.4em Monaco, Lucida Console, monospace;white-space:nowrap">adap = iopro.text_adapter('P00000001-ALL.csv')<br />
adap.set_field_types({2: object, 3: object, 4: object, <br />
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 7: object, 8: object, 13: object})<br />
arr = adap[:]</div></div>
<p>Here&#8217;s the Massif peak usage which is reasonably inline with pandas:</p>
<p><pre><code>--------------------------------------------------------------------------------
  n        time(i)         total(B)   useful-heap(B) extra-heap(B)    stacks(B)
--------------------------------------------------------------------------------
 39 17,492,316,089      447,952,304      399,945,549    48,006,755            0
89.28% (399,945,549B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
-&gt;64.31% (288,096,256B) 0x4EADC3B: PyObject_Malloc (obmalloc.c:580)
| 
-&gt;10.73% (48,083,577B) 0x6E1F5B4: PyArray_NewFromDescr (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
| 
-&gt;10.73% (48,083,088B) 0x6E0BEAD: PyArray_Resize (in /home/wesm/epd/lib/python2.7/site-packages/numpy/core/multiarray.so)
| 
-&gt;02.34% (10,485,760B) 0x65391E1: open_text_adapter (text_adapter.c:58)
| 
-&gt;01.16% (5,196,868B) in 113 places, all below massif's threshold (01.00%)</code></pre></p>
<h1>Conclusions</h1>
<p>I&#8217;m very happy to see this project to completion, finally. Python users have been suffering for years from parsers that have 1) few features, 2) are slow, and 3) use a lot of memory. In pandas I focused first on features, then on speed, and now on both speed and memory. I&#8217;m very pleased with how it turned out. I&#8217;m excited to see the code hopefully pushed upstream into NumPy when I can get some help with the integration and plumbing (and <b>parameter hell</b>).</p>
<p>It will be a month or so before this code appears in a new release of pandas (we are about to release version 0.9.0) as I want to let folks on the bleeding edge find any bugs before releasing it to the masses. </p>
<h1>Future work and extensions</h1>
<p>Several things could (<i>should</i>) be added to the parser without too much effort comparatively:</p>
<li>Integrate a regular expression engine to tokenize lines with multi-character delimiters or regular expressions.</li>
<li>Code up the fixed-width-field version of the tokenizer</li>
<li>Add on-the-fly decompression of GZIP&#8217;d files</li>
<h2>Code used for performance and memory benchmarks</h2>
<p><script src="https://gist.github.com/3831420.js?file=gistfile1.py"></script></p>
<p>R code (just copy-pasted the output I got of each command). Version 2.14.0</p>
<p><pre><code>system.time(df &lt;- read.csv('parser_examples/zeros.csv', colClasses=rep(&quot;integer&quot;, 50)))
   user  system elapsed
  0.616   0.004   0.623

system.time(df &lt;- read.csv('parser_examples/matrix.csv', colClasses=rep(&quot;numeric&quot;, 10)))
   user  system elapsed
  6.920   0.136   7.071

system.time(df &lt;- read.csv('parser_examples/sdss6949386.csv', colClasses=rep(&quot;numeric&quot;, 8)))
   user  system elapsed
 37.030   0.804  37.866

system.time(df &lt;- read.table('parser_examples/pagecounts-20110331-220000', sep=&quot; &quot;,
			                 header=F,
                             colClasses=c(&quot;character&quot;, &quot;character&quot;, &quot;integer&quot;, &quot;numeric&quot;)))
   user  system elapsed
 42.250   0.356  42.651

system.time(df &lt;- read.csv('parser_examples/P00000001-ALL.csv'))
   user  system elapsed
 18.121   0.212  18.350</code></pre></p>
]]></content:encoded>
			<wfw:commentRss>http://wesmckinney.com/blog/?feed=rss2&#038;p=543</wfw:commentRss>
		<slash:comments>13</slash:comments>
		</item>
		<item>
		<title>Requirements for EuroSciPy 2012 pandas tutorial</title>
		<link>http://wesmckinney.com/blog/?p=540</link>
		<comments>http://wesmckinney.com/blog/?p=540#comments</comments>
		<pubDate>Thu, 23 Aug 2012 15:20:24 +0000</pubDate>
		<dc:creator>Wes McKinney</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://wesmckinney.com/blog/?p=540</guid>
		<description><![CDATA[Much belatedly, here are the requirements for the EuroSciPy tutorial on pandas in Brussels tomorrow. They are the same as for the SciPy 2012 (Austin) tutorial: NumPy 1.6.1 or higher matplotlib 1.0 or higher IPython 0.13 or higher, HTML notebook &#8230;<p class="read-more"><a href="http://wesmckinney.com/blog/?p=540">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p>Much belatedly, here are the requirements for the EuroSciPy tutorial on pandas in Brussels tomorrow. They are the same as for the SciPy 2012 (Austin) tutorial:</p>
<li>NumPy 1.6.1 or higher</li>
<li>matplotlib 1.0 or higher</li>
<li>IPython 0.13 or higher, HTML notebook dependencies</li>
<li>pandas 0.8.1 (or better, <a href="http://github.com/pydata/pandas" title="pandas github" target="_blank">GitHub master revision</a>) and dependencies (dateutil, pytz)</li>
<p>One of the easiest ways to get started from scratch is with <a href="http://www.enthought.com" title="EPDFree" target="_blank">EPDFree</a> and then installing pandas 0.8.1 or higher. If you&#8217;re using 0.8.1 we may run into a few minor bugs that I have fixed since the last release.</p>
]]></content:encoded>
			<wfw:commentRss>http://wesmckinney.com/blog/?feed=rss2&#038;p=540</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Finally, GitHub listens! The importance of attention to detail in UX</title>
		<link>http://wesmckinney.com/blog/?p=531</link>
		<comments>http://wesmckinney.com/blog/?p=531#comments</comments>
		<pubDate>Sun, 15 Jul 2012 14:36:49 +0000</pubDate>
		<dc:creator>Wes McKinney</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://wesmckinney.com/blog/?p=531</guid>
		<description><![CDATA[I like using GitHub, as do apparently a lot of other people. When doing pull requests, I&#8217;ve been annoyed for a long time by the amount of clicking necessary to get the git URL to the contributor&#8217;s repository (which must &#8230;<p class="read-more"><a href="http://wesmckinney.com/blog/?p=531">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p>I like using <a href="http://github.com/" title="GitHub" target="_blank">GitHub</a>, as do apparently a lot of other people. When doing pull requests, I&#8217;ve been annoyed for a long time by the amount of clicking necessary to get the git URL to the contributor&#8217;s repository (which must be added as a remote, etc.):</p>
<li>Click on new contributor</li>
<li>Locate pandas fork on their list of repositories</li>
<li>Copy git link from their fork, <kbd>git remote add</kbd> and I&#8217;m in business</li>
<p>But, GitHub is about using git, so why should I have to go fishing to get the git link? Earlier in the Pull Request UI, there was a help box that contained the git url (which I used), but then they took it away! So fishing it was. This might seem very minor, but after adding many dozens of remotes like that the annoyance had accumulated. It seemed obvious to me that the branch names should just be links:</p>
<p><a href="http://wesmckinney.com/blog/wp-content/uploads/2012/07/cropped_pr_ui.png"><img src="http://wesmckinney.com/blog/wp-content/uploads/2012/07/cropped_pr_ui.png" alt="" title="cropped_pr_ui" width="640" height="249" class="aligncenter size-full wp-image-535" /></a></p>
<p>Each time GitHub tweaked the UI, I complained on Twitter, like so:</p>
<p><!-- tweet id : 223590057318158336 --><br />
<style type='text/css'>#bbpBox_223590057318158336 a { text-decoration:none; color:#0084B4; }#bbpBox_223590057318158336 a:hover { text-decoration:underline; }</style>
<div id='bbpBox_223590057318158336' class='bbpBox' style='padding:20px; margin:5px 0; background-color:#C0DEED; background-image:url(http://a0.twimg.com/images/themes/theme1/bg.png); background-repeat:no-repeat'>
<div style='background:#fff; padding:10px; margin:0; min-height:48px; color:#333333; -moz-border-radius:5px; -webkit-border-radius:5px;'><span style='width:100%; font-size:18px; line-height:22px;'>Fffffff @<a href="http://twitter.com/intent/user?screen_name=github" class="twitter-action">github</a> changes the pull request UI again but the branch names still are not links to the repository?</span>
<div class='bbp-actions' style='font-size:12px; width:100%; padding:5px 0; margin:0 0 10px 0; border-bottom:1px solid #e6e6e6;'><img align='middle' src='http://wesmckinney.com/blog/wp-content/plugins/twitter-blackbird-pie//images/bird.png' /><a title='tweeted on July 13, 2012 1:29 am' href='http://twitter.com/#!/wesmckinn/status/223590057318158336' target='_blank'>July 13, 2012 1:29 am</a> via web<a href='https://twitter.com/intent/tweet?in_reply_to=223590057318158336' class='bbp-action bbp-reply-action' title='Reply'><span><em style='margin-left: 1em;'></em><strong>Reply</strong></span></a><a href='https://twitter.com/intent/retweet?tweet_id=223590057318158336' class='bbp-action bbp-retweet-action' title='Retweet'><span><em style='margin-left: 1em;'></em><strong>Retweet</strong></span></a><a href='https://twitter.com/intent/favorite?tweet_id=223590057318158336' class='bbp-action bbp-favorite-action' title='Favorite'><span><em style='margin-left: 1em;'></em><strong>Favorite</strong></span></a></div>
<div style='float:left; padding:0; margin:0'><a href='http://twitter.com/intent/user?screen_name=wesmckinn'><img style='width:48px; height:48px; padding-right:7px; border:none; background:none; margin:0' src='http://a0.twimg.com/profile_images/1836874815/upload_normal.png' /></a></div>
<div style='float:left; padding:0; margin:0'><a style='font-weight:bold' href='http://twitter.com/intent/user?screen_name=wesmckinn'>@wesmckinn</a>
<div style='margin:0; padding-top:2px'>Wes McKinney</div>
</div>
<div style='clear:both'></div>
</div>
</div>
<p><!-- end of tweet --></p>
<p>Finally, someone noticed:</p>
<p><!-- tweet id : 223817061258903552 --><br />
<style type='text/css'>#bbpBox_223817061258903552 a { text-decoration:none; color:#3D4E77; }#bbpBox_223817061258903552 a:hover { text-decoration:underline; }</style>
<div id='bbpBox_223817061258903552' class='bbpBox' style='padding:20px; margin:5px 0; background-color:#222736; background-image:url(http://a0.twimg.com/profile_background_images/34979092/twitter_background.jpg); background-repeat:no-repeat'>
<div style='background:#fff; padding:10px; margin:0; min-height:48px; color:#312F2D; -moz-border-radius:5px; -webkit-border-radius:5px;'><span style='width:100%; font-size:18px; line-height:22px;'>@<a href="http://twitter.com/intent/user?screen_name=wesmckinn" class="twitter-action">wesmckinn</a> Great idea. Links added as of a few minutes ago.</span>
<div class='bbp-actions' style='font-size:12px; width:100%; padding:5px 0; margin:0 0 10px 0; border-bottom:1px solid #e6e6e6;'><img align='middle' src='http://wesmckinney.com/blog/wp-content/plugins/twitter-blackbird-pie//images/bird.png' /><a title='tweeted on July 13, 2012 4:32 pm' href='http://twitter.com/#!/cameronmcefee/status/223817061258903552' target='_blank'>July 13, 2012 4:32 pm</a> via <a href="http://tapbots.com" rel="nofollow" target="blank">Tweetbot for Mac</a><a href='https://twitter.com/intent/tweet?in_reply_to=223817061258903552' class='bbp-action bbp-reply-action' title='Reply'><span><em style='margin-left: 1em;'></em><strong>Reply</strong></span></a><a href='https://twitter.com/intent/retweet?tweet_id=223817061258903552' class='bbp-action bbp-retweet-action' title='Retweet'><span><em style='margin-left: 1em;'></em><strong>Retweet</strong></span></a><a href='https://twitter.com/intent/favorite?tweet_id=223817061258903552' class='bbp-action bbp-favorite-action' title='Favorite'><span><em style='margin-left: 1em;'></em><strong>Favorite</strong></span></a></div>
<div style='float:left; padding:0; margin:0'><a href='http://twitter.com/intent/user?screen_name=cameronmcefee'><img style='width:48px; height:48px; padding-right:7px; border:none; background:none; margin:0' src='http://a0.twimg.com/profile_images/2132764956/mustache_normal.jpg' /></a></div>
<div style='float:left; padding:0; margin:0'><a style='font-weight:bold' href='http://twitter.com/intent/user?screen_name=cameronmcefee'>@cameronmcefee</a>
<div style='margin:0; padding-top:2px'>Cameron McEfee</div>
</div>
<div style='clear:both'></div>
</div>
</div>
<p><!-- end of tweet --></p>
<p>Small victories, I guess. Enjoy.</p>
]]></content:encoded>
			<wfw:commentRss>http://wesmckinney.com/blog/?feed=rss2&#038;p=531</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Latest Table of Contents for Python for Data Analysis</title>
		<link>http://wesmckinney.com/blog/?p=523</link>
		<comments>http://wesmckinney.com/blog/?p=523#comments</comments>
		<pubDate>Mon, 25 Jun 2012 16:15:51 +0000</pubDate>
		<dc:creator>Wes McKinney</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://wesmckinney.com/blog/?p=523</guid>
		<description><![CDATA[Making some progress on Python for Data Analysis]]></description>
			<content:encoded><![CDATA[<p>Making some progress on <a href="http://shop.oreilly.com/product/0636920023784.do" title="O'Reilly PyData book" target="_blank">Python for Data Analysis</a></p>
<p><object id="doc_713253885717957" name="doc_713253885717957" height="600" width="620" type="application/x-shockwave-flash" data="http://d1.scribdassets.com/ScribdViewer.swf" style="outline:none;"><param name="movie" value="http://d1.scribdassets.com/ScribdViewer.swf"><param name="wmode" value="opaque"><param name="bgcolor" value="#ffffff"><param name="allowFullScreen" value="true"><param name="allowScriptAccess" value="always"><param name="FlashVars" value="document_id=98197582&#038;access_key=key-y3bkn6gw2q0k1gio061&#038;page=1&#038;viewMode=list"><embed id="doc_713253885717957" name="doc_713253885717957" src="http://d1.scribdassets.com/ScribdViewer.swf?document_id=98197582&#038;access_key=key-y3bkn6gw2q0k1gio061&#038;page=1&#038;viewMode=list" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" height="600" width="620" wmode="opaque" bgcolor="#ffffff"></embed></object></p>
]]></content:encoded>
			<wfw:commentRss>http://wesmckinney.com/blog/?feed=rss2&#038;p=523</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Intro to Python for Financial Data Analysis at General Assembly</title>
		<link>http://wesmckinney.com/blog/?p=519</link>
		<comments>http://wesmckinney.com/blog/?p=519#comments</comments>
		<pubDate>Fri, 22 Jun 2012 18:13:35 +0000</pubDate>
		<dc:creator>Wes McKinney</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://wesmckinney.com/blog/?p=519</guid>
		<description><![CDATA[I taught a class this past Monday, June 18, at General Assembly. Here are the (very brief) slides and a link to the IPython notebooks. You&#8217;ll need at least pandas 0.8.0b2, though unfortunately I identified a few bugs during the &#8230;<p class="read-more"><a href="http://wesmckinney.com/blog/?p=519">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p>I taught a class this past Monday, June 18, at <a href="http://generalassemb.ly/education/" title="General Assembly" target="_blank">General Assembly</a>. Here are the (very brief) slides and a <a href="http://wesmckinney.com/files/20120618/notebooks.zip" title="6/18 Notebooks" target="_blank">link to the IPython notebooks</a>. You&#8217;ll need at least <a href="http://pandas.pydata.org/getpandas.html" title="pandas download" target="_blank">pandas 0.8.0b2</a>, though unfortunately I identified a few bugs during the class that have since been fixed. Look out for the final release of pandas 0.8.0 any day now. </p>
<div style="width:425px" id="__ss_13422169"> <strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/wesm/intro-to-python-for-financial-data-analysis" title="Intro to Python for Financial Data Analysis" target="_blank">Intro to Python for Financial Data Analysis</a></strong> <iframe src="http://www.slideshare.net/slideshow/embed_code/13422169" width="425" height="355" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0" allowfullscreen></iframe>
<div style="padding:5px 0 12px"> View more <a href="http://www.slideshare.net/" target="_blank">presentations</a> from <a href="http://www.slideshare.net/wesm" target="_blank">wesm</a> </div>
</p></div>
]]></content:encoded>
			<wfw:commentRss>http://wesmckinney.com/blog/?feed=rss2&#038;p=519</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Easy, high performance time zone handling in pandas 0.8.0</title>
		<link>http://wesmckinney.com/blog/?p=506</link>
		<comments>http://wesmckinney.com/blog/?p=506#comments</comments>
		<pubDate>Tue, 05 Jun 2012 01:01:16 +0000</pubDate>
		<dc:creator>Wes McKinney</dc:creator>
				<category><![CDATA[pandas]]></category>

		<guid isPermaLink="false">http://wesmckinney.com/blog/?p=506</guid>
		<description><![CDATA[Making time zone handling palatable is surprisingly difficult to get right. The generally agreed-upon &#8220;best practice&#8221; for storing timestamps is to use UTC. Otherwise, you have to worry about daylight savings time ambiguities or non-existent times. The misery of time &#8230;<p class="read-more"><a href="http://wesmckinney.com/blog/?p=506">Read more &#187;</a></p>]]></description>
			<content:encoded><![CDATA[<p>Making time zone handling palatable is surprisingly difficult to get right. The generally agreed-upon &#8220;best practice&#8221; for storing timestamps is to use <a href="http://en.wikipedia.org/wiki/Coordinated_Universal_Time" title="UTC" target="_blank">UTC</a>. Otherwise, you have to worry about daylight savings time ambiguities or non-existent times. The misery of time zone handling is well documented, and <a href="http://lucumr.pocoo.org/2011/7/15/eppur-si-muove/" title="Ronacher time zone handling" target="_blank">summarized nicely last year by Armin Ronacher</a>. When you work in UTC, most of your troubles go away; converting a single timestamp or array of timestamps between time zones becomes in essence a &#8220;free&#8221; operation since the time zone is simply metadata for the box containing the invariant UTC timestamp. </p>
<p>But it&#8217;s not all fun and games. The Python datetime API in this area is generally considered to be severely lacking. It&#8217;s so bad that <a href="https://github.com/nvie/times" title="nvie/times" target="_blank">77-line modules with half a dozen convenience functions</a> can get 245 watchers on GitHub. I often write that much code before I finish my first cup of coffee in the morning <img src='http://wesmckinney.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  But, for most applications you can suffer through the API and use <a href="http://pytz.sourceforge.net/" title="pytz" target="_blank">pytz</a>, which is an adequate solution in most cases. pytz notably ships the <a href="http://en.wikipedia.org/wiki/Tz_database" title="Olson database" target="_blank">Olson timezone database</a> which is the key piece of information for powering time zone conversions.</p>
<p>But what about pandas? Among other things, pandas is <b>really</b> good for time series data, including very large time series data in the millions of observations. I don&#8217;t want to make pandas users suffer because of Python&#8217;s datetime API, so I&#8217;m happy to provide a better one (a bit more on this later). The biggest issue is: as with many non-scientific Python libraries, pytz and other tools have a fatal illness known as TMP, a.k.a. <b>Too Much (pure) Python</b>. Let me explain:</p>
<p><pre><code>In [1]: dt = datetime(2009, 4, 15)

In [2]: import pytz

In [3]: tz = pytz.timezone('US/Eastern')

In [4]: tz.localize(dt)
Out[4]: datetime.datetime(2009, 4, 15, 0, 0, tzinfo=&lt;DstTzInfo 'US/Eastern' EDT-1 day, 20:00:00 DST&gt;)

In [5]: timeit tz.localize(dt)
10000 loops, best of 3: 33.4 us per loop</code></pre></p>
<p>So, localizing a single <kbd>datetime.datetime</kbd> value takes 33 microseconds, or ~33 seconds per million timestamps. Localize serves a couple of important, but annoying functions: checking for ambiguities (&#8220;fall back&#8221;) and non-existent times (&#8220;spring forward&#8221;) at DST transition times. </p>
<p><pre><code>In [7]: tz.localize(datetime(2012, 3, 11, 2, 30), is_dst=None)
---------------------------------------------------------------------------
NonExistentTimeError                      Traceback (most recent call last)
/home/wesm/code/pandas/&lt;ipython-input-7-e6d82ff6f746&gt; in &lt;module&gt;()
----&gt; 1 tz.localize(datetime(2012, 3, 11, 2, 30), is_dst=None)

/home/wesm/code/repos/matplotlib/lib/pytz/tzinfo.pyc in localize(self, dt, is_dst)
    320             # If we refuse to guess, raise an exception.
    321             if is_dst is None:
--&gt; 322                 raise NonExistentTimeError(dt)
    323 
    324             # If we are forcing the pre-DST side of the DST transition, we

NonExistentTimeError: 2012-03-11 02:30:00</code></pre></p>
<p>Now, one major problem that I found while examining pytz code is how many temporary datetime.datetime objects are created during a single call to <kbd>tz.localize</kbd>. How many do you think? </p>
<h2>15. 15 temporary <kbd>datetime.datetime</kbd> objects</h2>
<p>Don&#8217;t believe me? Look for yourself. Just following what is going on inside the function is enough to make your head hurt. The code is vastly complicated by the fact that tz-aware datetimes are not comparable with tz-naive datetimes. </p>
<p>Obviously, there must be a better and faster way. Some might argue that I should improve pytz, but the problem is that the implementation of time zone logic is dependent on the representation of the timestamps. Over the last few months I have stopped using <kbd>datetime.datetime</kbd> in pandas in favor of 64-bit integer timestamps via NumPy&#8217;s <kbd>datetime64</kbd> data type. Storing large arrays of datetime.datetime values is disastrously inefficient in terms of memory and performance in all time series operations. Obviously I can&#8217;t force this design decision on most Python programmers who are not engaged in high-performance data analysis work.</p>
<h1>Making time zone handling fast and easy</h1>
<p>So, here are my requirements for pandas&#8217;s time zone capabilities:</p>
<li>All operations must be vectorized and be as fast as possible on large arrays of irregular, not necessarily ordered 64-bit timestamps</li>
<li>API must be as simple and non-crappy as possible without sacrificing functionality.</li>
<p>pandas 0.8.0 has a new <kbd>Timestamp</kbd> data type which is a subclass of datetime.datetime providing nanosecond resolution support and, in my opinion, a strictly superior interface for working with dates and time:</p>
<p><pre><code>In [3]: stamp = Timestamp('3/11/2012 04:00')

In [4]: stamp
Out[4]: &lt;Timestamp: 2012-03-11 04:00:00&gt;

In [5]: stamp.value  # Naive timestamp
Out[5]: 1331438400000000000</code></pre></p>
<p>Timestamps can be created as local or converted to local using <kbd>tz_localize</kbd>. Conversions from one time zone to another use <kbd>tz_convert</kbd>:</p>
<p><pre><code>In [6]: stamp.tz_localize('US/Eastern')
Out[6]: &lt;Timestamp: 2012-03-11 04:00:00-0400 EDT, tz=US/Eastern&gt;

In [7]: eastern_stamp = Timestamp('3/11/2012 04:00', tz='US/Eastern')

In [8]: eastern_stamp
Out[8]: &lt;Timestamp: 2012-03-11 04:00:00-0400 EDT, tz=US/Eastern&gt;

In [9]: eastern_stamp.value
Out[9]: 1331452800000000000

In [10]: eastern_stamp.tz_convert('utc')
Out[10]: &lt;Timestamp: 2012-03-11 08:00:00+0000 UTC, tz=UTC&gt;</code></pre></p>
<p>Wonder what time it is right now in London (it&#8217;s 8:50 PM in New York as I type this)? </p>
<p><pre><code>In [11]: Timestamp('now', tz='Europe/London')
Out[11]: &lt;Timestamp: 2012-07-08 17:27:13+0100 BST, tz=Europe/London&gt;</code></pre></p>
<p>So that&#8217;s nice. Compared with datetime.datetime, Timestamp doesn&#8217;t get in your way as much. Timestamps are equal if and only if their UTC timestamps are equal:</p>
<p><pre><code>In [12]: eastern_stamp == eastern_stamp.tz_convert('utc')
Out[12]: True</code></pre></p>
<p>This makes sense, because they refer to the same moment in time. Also, adding timedeltas will do the right thing around DST transitions:</p>
<p><pre><code>In [13]: stamp = Timestamp('3/11/2012 01:00', tz='US/Eastern')

In [14]: stamp
Out[14]: &lt;Timestamp: 2012-03-11 01:00:00-0500 EST, tz=US/Eastern&gt;

In [17]: stamp + timedelta(hours=3)
Out[17]: &lt;Timestamp: 2012-03-11 05:00:00-0400 EDT, tz=US/Eastern&gt;</code></pre></p>
<p>OK, great. Scalar operations. I could have done all this with pytz. I&#8217;m really interested in vector operations on large time series. </p>
<p><pre><code>In [18]: rng = date_range('3/11/2012 03:00', '4/1/2012', freq='S', tz='US/Eastern')

In [19]: rng
Out[19]: 
&lt;class 'pandas.tseries.index.DatetimeIndex'&gt;
[2012-03-11 03:00:00, ..., 2012-04-01 00:00:00]
Length: 1803601, Freq: S, Timezone: US/Eastern</code></pre></p>
<p>Localizing all of 1.8 million timestamps (without taking advantage of the fact that this range is regular and lacks any DST transitions, which you <b>cannot</b> assume in the general, irregular case) would have taken about 1 full minute if we were working with pytz and datetime.datetime objects. Here it takes about 390 ms using a vectorized Cython routine of my devising:</p>
<p><pre><code>In [20]: timeit rng = date_range('3/11/2012 03:00', '4/1/2012', freq='S', tz='US/Eastern')
1 loops, best of 3: 415 ms per loop</code></pre></p>
<p>What&#8217;s nice about working in UTC is that time zone conversions are now nearly free and do not copy any data (the DatetimeIndex is immutable):</p>
<p><pre><code>In [22]: rng.tz_convert('Europe/Moscow')
Out[22]: 
&lt;class 'pandas.tseries.index.DatetimeIndex'&gt;
[2012-03-11 11:00:00, ..., 2012-04-01 08:00:00]
Length: 1803601, Freq: S, Timezone: Europe/Moscow</code></pre></p>
<p>Scalar values are converted to Timestamp objects with the right hour, minute, second:</p>
<p><pre><code>In [23]: rng = date_range('3/6/2012', periods=10, tz='US/Eastern')

In [24]: ts = Series(np.random.randn(len(rng)), rng)

In [25]: ts
Out[25]: 
2012-03-06 00:00:00-05:00    0.807059
2012-03-07 00:00:00-05:00    0.938366
2012-03-08 00:00:00-05:00   -1.262472
2012-03-09 00:00:00-05:00    1.942384
2012-03-10 00:00:00-05:00   -1.346362
2012-03-11 00:00:00-05:00   -2.570099
2012-03-12 00:00:00-04:00   -0.606283
2012-03-13 00:00:00-04:00    0.150267
2012-03-14 00:00:00-04:00    0.044596
2012-03-15 00:00:00-04:00    1.274109
Freq: D

In [26]: ts.index[8]
Out[26]: &lt;Timestamp: 2012-03-14 00:00:00-0400 EDT, tz=US/Eastern&gt;

In [27]: ts.index[3]
Out[27]: &lt;Timestamp: 2012-03-09 00:00:00-0500 EST, tz=US/Eastern&gt;

In [28]: ts.tz_convert('utc')
Out[28]: 
2012-03-06 05:00:00+00:00    0.807059
2012-03-07 05:00:00+00:00    0.938366
2012-03-08 05:00:00+00:00   -1.262472
2012-03-09 05:00:00+00:00    1.942384
2012-03-10 05:00:00+00:00   -1.346362
2012-03-11 05:00:00+00:00   -2.570099
2012-03-12 04:00:00+00:00   -0.606283
2012-03-13 04:00:00+00:00    0.150267
2012-03-14 04:00:00+00:00    0.044596
2012-03-15 04:00:00+00:00    1.274109
Freq: D</code></pre></p>
<p>Anyway, this is just a flavor of some of the things you can do in the almost-released version of pandas. Lots more easy-to-use and high-performance data analysis tooling to come.</p>
]]></content:encoded>
			<wfw:commentRss>http://wesmckinney.com/blog/?feed=rss2&#038;p=506</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
