Easy, high performance time zone handling in pandas 0.8.0

Making time zone handling palatable is surprisingly difficult to get right. The generally agreed-upon “best practice” for storing timestamps is to use UTC. Otherwise, you have to worry about daylight savings time ambiguities or non-existent times. The misery of time zone handling is well documented, and summarized nicely last year by Armin Ronacher. When you work in UTC, most of your troubles go away; converting a single timestamp or array of timestamps between time zones becomes in essence a “free” operation since the time zone is simply metadata for the box containing the invariant UTC timestamp.

But it’s not all fun and games. The Python datetime API in this area is generally considered to be severely lacking. It’s so bad that 77-line modules with half a dozen convenience functions can get 245 watchers on GitHub. I often write that much code before I finish my first cup of coffee in the morning :) But, for most applications you can suffer through the API and use pytz, which is an adequate solution in most cases. pytz notably ships the Olson timezone database which is the key piece of information for powering time zone conversions.

But what about pandas? Among other things, pandas is really good for time series data, including very large time series data in the millions of observations. I don’t want to make pandas users suffer because of Python’s datetime API, so I’m happy to provide a better one (a bit more on this later). The biggest issue is: as with many non-scientific Python libraries, pytz and other tools have a fatal illness known as TMP, a.k.a. Too Much (pure) Python. Let me explain:

1
2
3
4
5
6
7
8
9
10
11
In [1]: dt = datetime(2009, 4, 15)
In [2]: import pytz
In [3]: tz = pytz.timezone('US/Eastern')
In [4]: tz.localize(dt)
Out[4]: datetime.datetime(2009, 4, 15, 0, 0, tzinfo=<DstTzInfo 'US/Eastern' EDT-1 day, 20:00:00 DST>)
In [5]: timeit tz.localize(dt)
10000 loops, best of 3: 33.4 us per loop

So, localizing a single datetime.datetime value takes 33 microseconds, or ~33 seconds per million timestamps. Localize serves a couple of important, but annoying functions: checking for ambiguities (“fall back”) and non-existent times (“spring forward”) at DST transition times.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
In [7]: tz.localize(datetime(2012, 3, 11, 2, 30), is_dst=None)
---------------------------------------------------------------------------
NonExistentTimeError Traceback (most recent call last)
/home/wesm/code/pandas/<ipython-input-7-e6d82ff6f746> in <module>()
----> 1 tz.localize(datetime(2012, 3, 11, 2, 30), is_dst=None)
/home/wesm/code/repos/matplotlib/lib/pytz/tzinfo.pyc in localize(self, dt, is_dst)
320 # If we refuse to guess, raise an exception.
321 if is_dst is None:
--> 322 raise NonExistentTimeError(dt)
323
324 # If we are forcing the pre-DST side of the DST transition, we
NonExistentTimeError: 2012-03-11 02:30:00

Now, one major problem that I found while examining pytz code is how many temporary datetime.datetime objects are created during a single call to tz.localize. How many do you think?

15. 15 temporary datetime.datetime objects

Don’t believe me? Look for yourself. Just following what is going on inside the function is enough to make your head hurt. The code is vastly complicated by the fact that tz-aware datetimes are not comparable with tz-naive datetimes.

Obviously, there must be a better and faster way. Some might argue that I should improve pytz, but the problem is that the implementation of time zone logic is dependent on the representation of the timestamps. Over the last few months I have stopped using datetime.datetime in pandas in favor of 64-bit integer timestamps via NumPy’s datetime64 data type. Storing large arrays of datetime.datetime values is disastrously inefficient in terms of memory and performance in all time series operations. Obviously I can’t force this design decision on most Python programmers who are not engaged in high-performance data analysis work.

Making time zone handling fast and easy

So, here are my requirements for pandas’s time zone capabilities:

  • All operations must be vectorized and be as fast as possible on large arrays of irregular, not necessarily ordered 64-bit timestamps
  • API must be as simple and non-crappy as possible without sacrificing functionality.
  • pandas 0.8.0 has a new Timestamp data type which is a subclass of datetime.datetime providing nanosecond resolution support and, in my opinion, a strictly superior interface for working with dates and time:

    1
    2
    3
    4
    5
    6
    7
    In [3]: stamp = Timestamp('3/11/2012 04:00')
    In [4]: stamp
    Out[4]: <Timestamp: 2012-03-11 04:00:00>
    In [5]: stamp.value # Naive timestamp
    Out[5]: 1331438400000000000

    Timestamps can be created as local or converted to local using tz_localize. Conversions from one time zone to another use tz_convert:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    In [6]: stamp.tz_localize('US/Eastern')
    Out[6]: <Timestamp: 2012-03-11 04:00:00-0400 EDT, tz=US/Eastern>
    In [7]: eastern_stamp = Timestamp('3/11/2012 04:00', tz='US/Eastern')
    In [8]: eastern_stamp
    Out[8]: <Timestamp: 2012-03-11 04:00:00-0400 EDT, tz=US/Eastern>
    In [9]: eastern_stamp.value
    Out[9]: 1331452800000000000
    In [10]: eastern_stamp.tz_convert('utc')
    Out[10]: <Timestamp: 2012-03-11 08:00:00+0000 UTC, tz=UTC>

    Wonder what time it is right now in London (it’s 8:50 PM in New York as I type this)?

    1
    2
    In [11]: Timestamp('now', tz='Europe/London')
    Out[11]: <Timestamp: 2012-07-08 17:27:13+0100 BST, tz=Europe/London>

    So that’s nice. Compared with datetime.datetime, Timestamp doesn’t get in your way as much. Timestamps are equal if and only if their UTC timestamps are equal:

    1
    2
    In [12]: eastern_stamp == eastern_stamp.tz_convert('utc')
    Out[12]: True

    This makes sense, because they refer to the same moment in time. Also, adding timedeltas will do the right thing around DST transitions:

    1
    2
    3
    4
    5
    6
    7
    In [13]: stamp = Timestamp('3/11/2012 01:00', tz='US/Eastern')
    In [14]: stamp
    Out[14]: <Timestamp: 2012-03-11 01:00:00-0500 EST, tz=US/Eastern>
    In [17]: stamp + timedelta(hours=3)
    Out[17]: <Timestamp: 2012-03-11 05:00:00-0400 EDT, tz=US/Eastern>

    OK, great. Scalar operations. I could have done all this with pytz. I’m really interested in vector operations on large time series.

    1
    2
    3
    4
    5
    6
    7
    In [18]: rng = date_range('3/11/2012 03:00', '4/1/2012', freq='S', tz='US/Eastern')
    In [19]: rng
    Out[19]:
    <class 'pandas.tseries.index.DatetimeIndex'>
    [2012-03-11 03:00:00, ..., 2012-04-01 00:00:00]
    Length: 1803601, Freq: S, Timezone: US/Eastern

    Localizing all of 1.8 million timestamps (without taking advantage of the fact that this range is regular and lacks any DST transitions, which you cannot assume in the general, irregular case) would have taken about 1 full minute if we were working with pytz and datetime.datetime objects. Here it takes about 390 ms using a vectorized Cython routine of my devising:

    1
    2
    In [20]: timeit rng = date_range('3/11/2012 03:00', '4/1/2012', freq='S', tz='US/Eastern')
    1 loops, best of 3: 415 ms per loop

    What’s nice about working in UTC is that time zone conversions are now nearly free and do not copy any data (the DatetimeIndex is immutable):

    1
    2
    3
    4
    5
    In [22]: rng.tz_convert('Europe/Moscow')
    Out[22]:
    <class 'pandas.tseries.index.DatetimeIndex'>
    [2012-03-11 11:00:00, ..., 2012-04-01 08:00:00]
    Length: 1803601, Freq: S, Timezone: Europe/Moscow

    Scalar values are converted to Timestamp objects with the right hour, minute, second:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    In [23]: rng = date_range('3/6/2012', periods=10, tz='US/Eastern')
    In [24]: ts = Series(np.random.randn(len(rng)), rng)
    In [25]: ts
    Out[25]:
    2012-03-06 00:00:00-05:00 0.807059
    2012-03-07 00:00:00-05:00 0.938366
    2012-03-08 00:00:00-05:00 -1.262472
    2012-03-09 00:00:00-05:00 1.942384
    2012-03-10 00:00:00-05:00 -1.346362
    2012-03-11 00:00:00-05:00 -2.570099
    2012-03-12 00:00:00-04:00 -0.606283
    2012-03-13 00:00:00-04:00 0.150267
    2012-03-14 00:00:00-04:00 0.044596
    2012-03-15 00:00:00-04:00 1.274109
    Freq: D
    In [26]: ts.index[8]
    Out[26]: <Timestamp: 2012-03-14 00:00:00-0400 EDT, tz=US/Eastern>
    In [27]: ts.index[3]
    Out[27]: <Timestamp: 2012-03-09 00:00:00-0500 EST, tz=US/Eastern>
    In [28]: ts.tz_convert('utc')
    Out[28]:
    2012-03-06 05:00:00+00:00 0.807059
    2012-03-07 05:00:00+00:00 0.938366
    2012-03-08 05:00:00+00:00 -1.262472
    2012-03-09 05:00:00+00:00 1.942384
    2012-03-10 05:00:00+00:00 -1.346362
    2012-03-11 05:00:00+00:00 -2.570099
    2012-03-12 04:00:00+00:00 -0.606283
    2012-03-13 04:00:00+00:00 0.150267
    2012-03-14 04:00:00+00:00 0.044596
    2012-03-15 04:00:00+00:00 1.274109
    Freq: D

    Anyway, this is just a flavor of some of the things you can do in the almost-released version of pandas. Lots more easy-to-use and high-performance data analysis tooling to come.