Monthly Archives: June 2012

Latest Table of Contents for Python for Data Analysis

Making some progress on Python for Data Analysis

Intro to Python for Financial Data Analysis at General Assembly

I taught a class this past Monday, June 18, at General Assembly. Here are the (very brief) slides and a link to the IPython notebooks. You’ll need at least pandas 0.8.0b2, though unfortunately I identified a few bugs during the class that have since been fixed. Look out for the final release of pandas 0.8.0 any day now.

Easy, high performance time zone handling in pandas 0.8.0

Making time zone handling palatable is surprisingly difficult to get right. The generally agreed-upon “best practice” for storing timestamps is to use UTC. Otherwise, you have to worry about daylight savings time ambiguities or non-existent times. The misery of time zone handling is well documented, and summarized nicely last year by Armin Ronacher. When you work in UTC, most of your troubles go away; converting a single timestamp or array of timestamps between time zones becomes in essence a “free” operation since the time zone is simply metadata for the box containing the invariant UTC timestamp.

But it’s not all fun and games. The Python datetime API in this area is generally considered to be severely lacking. It’s so bad that 77-line modules with half a dozen convenience functions can get 245 watchers on GitHub. I often write that much code before I finish my first cup of coffee in the morning :) But, for most applications you can suffer through the API and use pytz, which is an adequate solution in most cases. pytz notably ships the Olson timezone database which is the key piece of information for powering time zone conversions.

But what about pandas? Among other things, pandas is really good for time series data, including very large time series data in the millions of observations. I don’t want to make pandas users suffer because of Python’s datetime API, so I’m happy to provide a better one (a bit more on this later). The biggest issue is: as with many non-scientific Python libraries, pytz and other tools have a fatal illness known as TMP, a.k.a. Too Much (pure) Python. Let me explain:

So, localizing a single datetime.datetime value takes 33 microseconds, or ~33 seconds per million timestamps. Localize serves a couple of important, but annoying functions: checking for ambiguities (“fall back”) and non-existent times (“spring forward”) at DST transition times.

Now, one major problem that I found while examining pytz code is how many temporary datetime.datetime objects are created during a single call to tz.localize. How many do you think?

15. 15 temporary datetime.datetime objects

Don’t believe me? Look for yourself. Just following what is going on inside the function is enough to make your head hurt. The code is vastly complicated by the fact that tz-aware datetimes are not comparable with tz-naive datetimes.

Obviously, there must be a better and faster way. Some might argue that I should improve pytz, but the problem is that the implementation of time zone logic is dependent on the representation of the timestamps. Over the last few months I have stopped using datetime.datetime in pandas in favor of 64-bit integer timestamps via NumPy’s datetime64 data type. Storing large arrays of datetime.datetime values is disastrously inefficient in terms of memory and performance in all time series operations. Obviously I can’t force this design decision on most Python programmers who are not engaged in high-performance data analysis work.

Making time zone handling fast and easy

So, here are my requirements for pandas’s time zone capabilities:

  • All operations must be vectorized and be as fast as possible on large arrays of irregular, not necessarily ordered 64-bit timestamps
  • API must be as simple and non-crappy as possible without sacrificing functionality.
  • pandas 0.8.0 has a new Timestamp data type which is a subclass of datetime.datetime providing nanosecond resolution support and, in my opinion, a strictly superior interface for working with dates and time:

    Timestamps can be created as local or converted to local using tz_localize. Conversions from one time zone to another use tz_convert:

    Wonder what time it is right now in London (it’s 8:50 PM in New York as I type this)?

    So that’s nice. Compared with datetime.datetime, Timestamp doesn’t get in your way as much. Timestamps are equal if and only if their UTC timestamps are equal:

    This makes sense, because they refer to the same moment in time. Also, adding timedeltas will do the right thing around DST transitions:

    OK, great. Scalar operations. I could have done all this with pytz. I’m really interested in vector operations on large time series.

    Localizing all of 1.8 million timestamps (without taking advantage of the fact that this range is regular and lacks any DST transitions, which you cannot assume in the general, irregular case) would have taken about 1 full minute if we were working with pytz and datetime.datetime objects. Here it takes about 390 ms using a vectorized Cython routine of my devising:

    What’s nice about working in UTC is that time zone conversions are now nearly free and do not copy any data (the DatetimeIndex is immutable):

    Scalar values are converted to Timestamp objects with the right hour, minute, second:

    Anyway, this is just a flavor of some of the things you can do in the almost-released version of pandas. Lots more easy-to-use and high-performance data analysis tooling to come.