Making time zone handling palatable is surprisingly difficult to get right. The generally agreed-upon "best practice" for storing timestamps is to use UTC. Otherwise, you have to worry about daylight savings time ambiguities or non-existent times. The misery of time zone handling is well documented, and summarized nicely last year by Armin Ronacher. When you work in UTC, most of your troubles go away; converting a single timestamp or array of timestamps between time zones becomes in essence a "free" operation since the time zone is simply metadata for the box containing the invariant UTC timestamp.
But it's not all fun and games. The Python datetime API in this area is generally considered to be severely lacking. But, for most applications you can suffer through the API and use pytz, which is an adequate solution in most cases. pytz notably ships the Olson timezone database which is the key piece of information for powering time zone conversions.
But what about pandas? Among other things, pandas is really good for time series data, including very large time series data in the millions of observations. I don't want to make pandas users suffer because of Python's datetime API, so I'm happy to provide a better one (a bit more on this later). The biggest issue is: as with many non-scientific Python libraries, pytz and other tools have a fatal illness known as TMP, a.k.a. Too Much (pure) Python. Let me explain:
In : dt = datetime(2009, 4, 15) In : import pytz In : tz = pytz.timezone('US/Eastern') In : tz.localize(dt) Out: datetime.datetime(2009, 4, 15, 0, 0, tzinfo=<DstTzInfo 'US/Eastern' EDT-1 day, 20:00:00 DST>) In : timeit tz.localize(dt) 10000 loops, best of 3: 33.4 us per loop
So, localizing a single
datetime.datetime value takes 33 microseconds, or ~33 seconds per million timestamps. Localize serves a couple of important, but annoying functions: checking for ambiguities ("fall back") and non-existent times ("spring forward") at DST transition times.
In : tz.localize(datetime(2012, 3, 11, 2, 30), is_dst=None) --------------------------------------------------------------------------- NonExistentTimeError Traceback (most recent call last) /home/wesm/code/pandas/<ipython-input-7-e6d82ff6f746> in <module>() ----> 1 tz.localize(datetime(2012, 3, 11, 2, 30), is_dst=None) /home/wesm/code/repos/matplotlib/lib/pytz/tzinfo.pyc in localize(self, dt, is_dst) 320 # If we refuse to guess, raise an exception. 321 if is_dst is None: --> 322 raise NonExistentTimeError(dt) 323 324 # If we are forcing the pre-DST side of the DST transition, we NonExistentTimeError: 2012-03-11 02:30:00
Now, one major problem that I found while examining pytz code is how many temporary datetime.datetime objects are created during a single call to
tz.localize. How many do you think?
15. 15 temporary
Don't believe me? Look for yourself. Just following what is going on inside the function is enough to make your head hurt. The code is vastly complicated by the fact that tz-aware datetimes are not comparable with tz-naive datetimes.
Obviously, there must be a better and faster way. Some might argue that I should improve pytz, but the problem is that the implementation of time zone logic is dependent on the representation of the timestamps. Over the last few months I have stopped using
datetime.datetime in pandas in favor of 64-bit integer timestamps via NumPy's
datetime64 data type. Storing large arrays of datetime.datetime values is disastrously inefficient in terms of memory and performance in all time series operations. Obviously I can't force this design decision on most Python programmers who are not engaged in high-performance data analysis work.
Making time zone handling fast and easy
So, here are my requirements for pandas's time zone capabilities:
- All operations must be vectorized and be as fast as possible on large arrays of irregular, not necessarily ordered 64-bit timestamps
- API must be as simple and non-crappy as possible without sacrificing functionality.
pandas 0.8.0 has a new
Timestampdata type which is a subclass of datetime.datetime providing nanosecond resolution support and, in my opinion, a strictly superior interface for working with dates and time:
In : stamp = Timestamp('3/11/2012 04:00') In : stamp Out: <Timestamp: 2012-03-11 04:00:00> In : stamp.value # Naive timestamp Out: 1331438400000000000
Timestamps can be created as local or converted to local using
tz_localize. Conversions from one time zone to another use
In : stamp.tz_localize('US/Eastern') Out: <Timestamp: 2012-03-11 04:00:00-0400 EDT, tz=US/Eastern> In : eastern_stamp = Timestamp('3/11/2012 04:00', tz='US/Eastern') In : eastern_stamp Out: <Timestamp: 2012-03-11 04:00:00-0400 EDT, tz=US/Eastern> In : eastern_stamp.value Out: 1331452800000000000 In : eastern_stamp.tz_convert('utc') Out: <Timestamp: 2012-03-11 08:00:00+0000 UTC, tz=UTC>
Wonder what time it is right now in London (it's 8:50 PM in New York as I type this)?
In : Timestamp('now', tz='Europe/London') Out: <Timestamp: 2012-07-08 17:27:13+0100 BST, tz=Europe/London>
So that's nice. Compared with datetime.datetime, Timestamp doesn't get in your way as much. Timestamps are equal if and only if their UTC timestamps are equal:
In : eastern_stamp == eastern_stamp.tz_convert('utc') Out: True
This makes sense, because they refer to the same moment in time. Also, adding timedeltas will do the right thing around DST transitions:
In : stamp = Timestamp('3/11/2012 01:00', tz='US/Eastern') In : stamp Out: <Timestamp: 2012-03-11 01:00:00-0500 EST, tz=US/Eastern> In : stamp + timedelta(hours=3) Out: <Timestamp: 2012-03-11 05:00:00-0400 EDT, tz=US/Eastern>
OK, great. Scalar operations. I could have done all this with pytz. I'm really interested in vector operations on large time series.
In : rng = date_range('3/11/2012 03:00', '4/1/2012', freq='S', tz='US/Eastern') In : rng Out: <class 'pandas.tseries.index.DatetimeIndex'> [2012-03-11 03:00:00, ..., 2012-04-01 00:00:00] Length: 1803601, Freq: S, Timezone: US/Eastern
Localizing all of 1.8 million timestamps (without taking advantage of the fact that this range is regular and lacks any DST transitions, which you cannot assume in the general, irregular case) would have taken about 1 full minute if we were working with pytz and datetime.datetime objects. Here it takes about 390 ms using a vectorized Cython routine of my devising:
In : timeit rng = date_range('3/11/2012 03:00', '4/1/2012', freq='S', tz='US/Eastern') 1 loops, best of 3: 415 ms per loop
What's nice about working in UTC is that time zone conversions are now nearly free and do not copy any data (the DatetimeIndex is immutable):
In : rng.tz_convert('Europe/Moscow') Out: <class 'pandas.tseries.index.DatetimeIndex'> [2012-03-11 11:00:00, ..., 2012-04-01 08:00:00] Length: 1803601, Freq: S, Timezone: Europe/Moscow
Scalar values are converted to Timestamp objects with the right hour, minute, second:
In : rng = date_range('3/6/2012', periods=10, tz='US/Eastern') In : ts = Series(np.random.randn(len(rng)), rng) In : ts Out: 2012-03-06 00:00:00-05:00 0.807059 2012-03-07 00:00:00-05:00 0.938366 2012-03-08 00:00:00-05:00 -1.262472 2012-03-09 00:00:00-05:00 1.942384 2012-03-10 00:00:00-05:00 -1.346362 2012-03-11 00:00:00-05:00 -2.570099 2012-03-12 00:00:00-04:00 -0.606283 2012-03-13 00:00:00-04:00 0.150267 2012-03-14 00:00:00-04:00 0.044596 2012-03-15 00:00:00-04:00 1.274109 Freq: D In : ts.index Out: <Timestamp: 2012-03-14 00:00:00-0400 EDT, tz=US/Eastern> In : ts.index Out: <Timestamp: 2012-03-09 00:00:00-0500 EST, tz=US/Eastern> In : ts.tz_convert('utc') Out: 2012-03-06 05:00:00+00:00 0.807059 2012-03-07 05:00:00+00:00 0.938366 2012-03-08 05:00:00+00:00 -1.262472 2012-03-09 05:00:00+00:00 1.942384 2012-03-10 05:00:00+00:00 -1.346362 2012-03-11 05:00:00+00:00 -2.570099 2012-03-12 04:00:00+00:00 -0.606283 2012-03-13 04:00:00+00:00 0.150267 2012-03-14 04:00:00+00:00 0.044596 2012-03-15 04:00:00+00:00 1.274109 Freq: D
Anyway, this is just a flavor of some of the things you can do in the almost-released version of pandas. Lots more easy-to-use and high-performance data analysis tooling to come.