I taught a class this past Monday, June 18, at General Assembly. Here are the (very brief) slides and a link to the IPython notebooks. You’ll need at least pandas 0.8.0b2, though unfortunately I identified a few bugs during the class that have since been fixed. Look out for the final release of pandas 0.8.0 any day now.
Intro to Python for Financial Data Analysis at General Assembly
Easy, high performance time zone handling in pandas 0.8.0
Making time zone handling palatable is surprisingly difficult to get right. The generally agreed-upon “best practice” for storing timestamps is to use UTC. Otherwise, you have to worry about daylight savings time ambiguities or non-existent times. The misery of time zone handling is well documented, and summarized nicely last year by Armin Ronacher. When you work in UTC, most of your troubles go away; converting a single timestamp or array of timestamps between time zones becomes in essence a “free” operation since the time zone is simply metadata for the box containing the invariant UTC timestamp.
But it’s not all fun and games. The Python datetime API in this area is generally considered to be severely lacking. It’s so bad that 77-line modules with half a dozen convenience functions can get 245 watchers on GitHub. I often write that much code before I finish my first cup of coffee in the morning But, for most applications you can suffer through the API and use pytz, which is an adequate solution in most cases. pytz notably ships the Olson timezone database which is the key piece of information for powering time zone conversions.
But what about pandas? Among other things, pandas is really good for time series data, including very large time series data in the millions of observations. I don’t want to make pandas users suffer because of Python’s datetime API, so I’m happy to provide a better one (a bit more on this later). The biggest issue is: as with many non-scientific Python libraries, pytz and other tools have a fatal illness known as TMP, a.k.a. Too Much (pure) Python. Let me explain:
1 2 3 4 5 6 7 8 9 10 11 |
In [1]: dt = datetime(2009, 4, 15) In [2]: import pytz In [3]: tz = pytz.timezone('US/Eastern') In [4]: tz.localize(dt) Out[4]: datetime.datetime(2009, 4, 15, 0, 0, tzinfo=<DstTzInfo 'US/Eastern' EDT-1 day, 20:00:00 DST>) In [5]: timeit tz.localize(dt) 10000 loops, best of 3: 33.4 us per loop |
So, localizing a single datetime.datetime value takes 33 microseconds, or ~33 seconds per million timestamps. Localize serves a couple of important, but annoying functions: checking for ambiguities (“fall back”) and non-existent times (“spring forward”) at DST transition times.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
In [7]: tz.localize(datetime(2012, 3, 11, 2, 30), is_dst=None) --------------------------------------------------------------------------- NonExistentTimeError Traceback (most recent call last) /home/wesm/code/pandas/<ipython-input-7-e6d82ff6f746> in <module>() ----> 1 tz.localize(datetime(2012, 3, 11, 2, 30), is_dst=None) /home/wesm/code/repos/matplotlib/lib/pytz/tzinfo.pyc in localize(self, dt, is_dst) 320 # If we refuse to guess, raise an exception. 321 if is_dst is None: --> 322 raise NonExistentTimeError(dt) 323 324 # If we are forcing the pre-DST side of the DST transition, we NonExistentTimeError: 2012-03-11 02:30:00 |
Now, one major problem that I found while examining pytz code is how many temporary datetime.datetime objects are created during a single call to tz.localize. How many do you think?
15. 15 temporary datetime.datetime objects
Don’t believe me? Look for yourself. Just following what is going on inside the function is enough to make your head hurt. The code is vastly complicated by the fact that tz-aware datetimes are not comparable with tz-naive datetimes.
Obviously, there must be a better and faster way. Some might argue that I should improve pytz, but the problem is that the implementation of time zone logic is dependent on the representation of the timestamps. Over the last few months I have stopped using datetime.datetime in pandas in favor of 64-bit integer timestamps via NumPy’s datetime64 data type. Storing large arrays of datetime.datetime values is disastrously inefficient in terms of memory and performance in all time series operations. Obviously I can’t force this design decision on most Python programmers who are not engaged in high-performance data analysis work.
Making time zone handling fast and easy
So, here are my requirements for pandas’s time zone capabilities:
pandas 0.8.0 has a new Timestamp data type which is a subclass of datetime.datetime providing nanosecond resolution support and, in my opinion, a strictly superior interface for working with dates and time:
1 2 3 4 5 6 7 |
In [3]: stamp = Timestamp('3/11/2012 04:00') In [4]: stamp Out[4]: <Timestamp: 2012-03-11 04:00:00> In [5]: stamp.value # Naive timestamp Out[5]: 1331438400000000000 |
Timestamps can be created as local or converted to local using tz_localize. Conversions from one time zone to another use tz_convert:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
In [6]: stamp.tz_localize('US/Eastern') Out[6]: <Timestamp: 2012-03-11 04:00:00-0400 EDT, tz=US/Eastern> In [7]: eastern_stamp = Timestamp('3/11/2012 04:00', tz='US/Eastern') In [8]: eastern_stamp Out[8]: <Timestamp: 2012-03-11 04:00:00-0400 EDT, tz=US/Eastern> In [9]: eastern_stamp.value Out[9]: 1331452800000000000 In [10]: eastern_stamp.tz_convert('utc') Out[10]: <Timestamp: 2012-03-11 08:00:00+0000 UTC, tz=UTC> |
Wonder what time it is right now in London (it’s 8:50 PM in New York as I type this)?
1 2 |
In [11]: Timestamp('now', tz='Europe/London') Out[11]: <Timestamp: 2012-07-08 17:27:13+0100 BST, tz=Europe/London> |
So that’s nice. Compared with datetime.datetime, Timestamp doesn’t get in your way as much. Timestamps are equal if and only if their UTC timestamps are equal:
1 2 |
In [12]: eastern_stamp == eastern_stamp.tz_convert('utc') Out[12]: True |
This makes sense, because they refer to the same moment in time. Also, adding timedeltas will do the right thing around DST transitions:
1 2 3 4 5 6 7 |
In [13]: stamp = Timestamp('3/11/2012 01:00', tz='US/Eastern') In [14]: stamp Out[14]: <Timestamp: 2012-03-11 01:00:00-0500 EST, tz=US/Eastern> In [17]: stamp + timedelta(hours=3) Out[17]: <Timestamp: 2012-03-11 05:00:00-0400 EDT, tz=US/Eastern> |
OK, great. Scalar operations. I could have done all this with pytz. I’m really interested in vector operations on large time series.
1 2 3 4 5 6 7 |
In [18]: rng = date_range('3/11/2012 03:00', '4/1/2012', freq='S', tz='US/Eastern') In [19]: rng Out[19]: <class 'pandas.tseries.index.DatetimeIndex'> [2012-03-11 03:00:00, ..., 2012-04-01 00:00:00] Length: 1803601, Freq: S, Timezone: US/Eastern |
Localizing all of 1.8 million timestamps (without taking advantage of the fact that this range is regular and lacks any DST transitions, which you cannot assume in the general, irregular case) would have taken about 1 full minute if we were working with pytz and datetime.datetime objects. Here it takes about 390 ms using a vectorized Cython routine of my devising:
1 2 |
In [20]: timeit rng = date_range('3/11/2012 03:00', '4/1/2012', freq='S', tz='US/Eastern') 1 loops, best of 3: 415 ms per loop |
What’s nice about working in UTC is that time zone conversions are now nearly free and do not copy any data (the DatetimeIndex is immutable):
1 2 3 4 5 |
In [22]: rng.tz_convert('Europe/Moscow') Out[22]: <class 'pandas.tseries.index.DatetimeIndex'> [2012-03-11 11:00:00, ..., 2012-04-01 08:00:00] Length: 1803601, Freq: S, Timezone: Europe/Moscow |
Scalar values are converted to Timestamp objects with the right hour, minute, second:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
In [23]: rng = date_range('3/6/2012', periods=10, tz='US/Eastern') In [24]: ts = Series(np.random.randn(len(rng)), rng) In [25]: ts Out[25]: 2012-03-06 00:00:00-05:00 0.807059 2012-03-07 00:00:00-05:00 0.938366 2012-03-08 00:00:00-05:00 -1.262472 2012-03-09 00:00:00-05:00 1.942384 2012-03-10 00:00:00-05:00 -1.346362 2012-03-11 00:00:00-05:00 -2.570099 2012-03-12 00:00:00-04:00 -0.606283 2012-03-13 00:00:00-04:00 0.150267 2012-03-14 00:00:00-04:00 0.044596 2012-03-15 00:00:00-04:00 1.274109 Freq: D In [26]: ts.index[8] Out[26]: <Timestamp: 2012-03-14 00:00:00-0400 EDT, tz=US/Eastern> In [27]: ts.index[3] Out[27]: <Timestamp: 2012-03-09 00:00:00-0500 EST, tz=US/Eastern> In [28]: ts.tz_convert('utc') Out[28]: 2012-03-06 05:00:00+00:00 0.807059 2012-03-07 05:00:00+00:00 0.938366 2012-03-08 05:00:00+00:00 -1.262472 2012-03-09 05:00:00+00:00 1.942384 2012-03-10 05:00:00+00:00 -1.346362 2012-03-11 05:00:00+00:00 -2.570099 2012-03-12 04:00:00+00:00 -0.606283 2012-03-13 04:00:00+00:00 0.150267 2012-03-14 04:00:00+00:00 0.044596 2012-03-15 04:00:00+00:00 1.274109 Freq: D |
Anyway, this is just a flavor of some of the things you can do in the almost-released version of pandas. Lots more easy-to-use and high-performance data analysis tooling to come.
Mastering high performance data algorithms I: Group By
I’m on my way back from R/Finance 2012. Those guys did a nice job of organizing the conference and was great to meet everyone there.
As part of pandas development, I have had to develop a suite of high performance data algorithms and implementation strategies which are the heart and soul of the library. I get asked a lot why pandas’s performance is much better than R and other data manipulation tools. The reasons are two-fold: careful implementation (in Cython and and C, so minimizing the computational friction) and carefully-thought-out algorithms.
Here are some of the more important tools and ideas that I make use of on a day-to-day basis:
Let me give you a prime example from a commit yesterday of me applying these ideas to great effect. Suppose I had a time series (or DataFrame containing time series) that I want to group by year, month, and day:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
In [6]: rng = date_range('1/1/2000', periods=20, freq='4h') In [7]: ts = Series(np.random.randn(len(rng)), index=rng) In [8]: ts Out[8]: 2000-01-01 00:00:00 -0.891761 2000-01-01 04:00:00 0.204853 2000-01-01 08:00:00 0.690581 2000-01-01 12:00:00 0.454010 2000-01-01 16:00:00 -0.123102 2000-01-01 20:00:00 0.300111 2000-01-02 00:00:00 -1.052215 2000-01-02 04:00:00 0.094484 2000-01-02 08:00:00 0.318417 2000-01-02 12:00:00 0.779984 2000-01-02 16:00:00 -1.514042 2000-01-02 20:00:00 2.550011 2000-01-03 00:00:00 0.983423 2000-01-03 04:00:00 -0.710861 2000-01-03 08:00:00 -1.350554 2000-01-03 12:00:00 -0.464388 2000-01-03 16:00:00 0.817372 2000-01-03 20:00:00 1.057514 2000-01-04 00:00:00 0.743033 2000-01-04 04:00:00 0.925849 Freq: 4H In [9]: by = lambda x: lambda y: getattr(y, x) In [10]: ts.groupby([by('year'), by('month'), by('day')]).mean() Out[10]: 2000 1 1 0.105782 2 0.196106 3 0.055418 4 0.834441 |
This is a small dataset, but imagine you have millions of observations and thousands or even millions of groups. How does that look algorithmically? I guarantee if you take a naive approach, you will crash and burn when the data increases in size. I know, because I did just that (take a look at the vbenchmarks). Laying down the infrastructure for doing a better job is not simple. Here are the steps for efficiently aggregating data like this:
I worked on speeding up the latter part of this last bullet point yesterday. The resulting code looked like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
def _get_indices_dict(label_list, keys): # Accepts factorized labels and unique key values shape = [len(x) for x in keys] group_index = get_group_index(label_list, shape) # Compute group index sorter, _ = lib.groupsort_indexer(com._ensure_int64(group_index), np.prod(shape)) # Reorder key labels and group index sorted_labels = [lab.take(sorter) for lab in label_list] group_index = group_index.take(sorter) # Compute dict of {group tuple -> location NumPy array for group} index = np.arange(len(group_index)).take(sorter) return lib.indices_fast(index, group_index, keys, sorted_labels) |
The details of lib.indices_fast aren’t that interesting; it chops up np.arange(len(group_index)).take(sorter), the sorted indirect indices, to produce the index dictionary. Running %lprun to get a line profiling on a large-ish data set:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
In [11]: rng = date_range('1/1/2000', '12/31/2005', freq='H') In [12]: year, month, day = rng.year, rng.month, rng.day In [13]: ts = Series(np.random.randn(len(rng)), index=rng) In [14]: lprun -f gp._get_indices_dict for i in xrange(100): ts.groupby([year, month, day]).indices Timer unit: 1e-06 s File: pandas/core/groupby.py Function: _get_indices_dict at line 1975 Total time: 0.628506 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 1975 def _get_indices_dict(label_list, keys): 1976 400 695 1.7 0.1 shape = [len(x) for x in keys] 1977 100 114388 1143.9 18.2 group_index = get_group_index(label_list, shape) 1978 1979 100 320 3.2 0.1 sorter, _ = lib.groupsort_indexer(com._ensure_int64(group_index), 1980 100 62007 620.1 9.9 np.prod(shape)) 1981 1982 400 53530 133.8 8.5 sorted_labels = [lab.take(sorter) for lab in label_list] 1983 100 19516 195.2 3.1 group_index = group_index.take(sorter) 1984 100 20189 201.9 3.2 index = np.arange(len(group_index)).take(sorter) 1985 1986 100 357861 3578.6 56.9 return lib.indices_fast(index, group_index, keys, sorted_labels) |
You might say, well, this seems like a lot of work and maybe we should just zip the keys (forming an array of Python tuples) and do a dumb algorithm? The speed difference ends up being something like an order of magnitude or more faster by being careful in this way and working with indirect integer index arrays.
Anyway, in conclusion, it’s these kinds of algorithms and ideas why pandas is perhaps the best-performing open-source data analysis toolkit for in memory data (I’m going to get to out-of-core data processing and “big data” eventually, just hang tight). It goes beyond language features and data structure internals (though this naturally also has a major impact, a lot of the things I do are easy to express in Python but would be very awkward or impossible to do in, say, R. Maybe I should write a whole article on this.); carefully thought-out algorithms are a major piece of the puzzle.
A O(n log n) NA-friendly time series “as of” using array operations
In time series data, it’s fairly common to need to compute the last known value “as of” a particular date. However, missing data is the norm, so it’s a touch more complicated than doing a simple binary search. Here is an implementation using array operations that takes these things into account:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
def asof_locs(stamps, where, mask): """ Parameters ---------- stamps : array of timestamps where : array of timestamps Values to determine the "as of" for mask : array of booleans where data is not NA Returns ------- locs : array of ints Locations of each "as of" value, -1 for NA """ locs = stamps[mask].searchsorted(where, side='right') locs = np.where(locs > 0, locs - 1, -1) result = np.arange(len(stamps))[mask].take(locs) first = mask.argmax() result[(locs == 0) & (where < stamps[first])] = -1 return result |
Some algorithmic notes. First, let’s run this through line_profiler on a large time series to see where time is being taken up:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
In [9]: rng = date_range('1/1/2000', '6/1/2000', freq='10s') In [10]: s = Series(np.random.randn(len(rng)), index=rng) In [11]: rng Out[11]: <class 'pandas.tseries.index.DatetimeIndex'> [2000-01-01 00:00:00, ..., 2000-06-01 00:00:00] Length: 1313281, Freq: 10S, Timezone: None In [12]: s = s.resample('s') In [13]: len(s) Out[13]: 13132801 In [24]: mask = notnull(s) In [25]: lprun -f asof_locs asof_locs(s.index, s.index[5::5], mask) Timer unit: 1e-06 s File: foo.py Function: asof_locs at line 3 Total time: 0.793543 s Line # % Time Line Contents ============================================================== 3 def asof_locs(stamps, where, mask): 4 """ 5 Parameters 6 ---------- 7 stamps : array of timestamps 8 where : array of timestamps 9 Values to determine the "as of" for 10 mask : array of booleans where data is NA 11 12 Returns 13 ------- 14 locs : array of ints 15 Locations of each "as of" value, -1 for NA 16 """ 17 63.3 locs = stamps[mask].searchsorted(where, side='right') 18 8.3 locs = np.where(locs > 0, locs - 1, -1) 19 20 23.3 result = np.arange(len(stamps))[mask].take(locs) 21 22 1.6 first = mask.argmax() 23 3.6 result[(locs == 0) & (where < stamps[first])] = -1 24 25 0.0 return result |
The main trickiness here is this step:
1 |
result = np.arange(len(stamps))[mask].take(locs) |
Since the indices returned by searchsorted are relative to the filtered values, you need to remap to what would have been the indices in the original array, and this does exactly that. Lastly, you might be interested in breaking up stamps[mask].searchsorted(where, side='right') to see how much time is spend in stamps[mask] versus searchsorted:
1 2 3 4 5 6 7 |
In [28]: timeit stamps[mask] 10 loops, best of 3: 127 ms per loop In [29]: filt = stamps[mask] In [31]: timeit filt.searchsorted(where) 1 loops, best of 3: 345 ms per loop |
So, this could definitely be sped up by a factor or 2 or more by doing the whole operation in C or Cython, but it’s nice to be able to express it concisely and speedily with array operations.
Why I’m not on the Julia bandwagon (yet)
EDIT (2013-07-11): The following article may or may not been accurate anymore since it’s been more than a year. I recommend running your own benchmarks on array expressions with the latest Julia builds and draw your own conclusions.
I apologize in advance for writing what might be the first neutral to negative article about Julia that you’ve read. For those of you who don’t know, Julia is a new dynamic language with an LLVM-based JIT compiler designed for technical computing. It’s intent is to try to bridge the gap between high level languages and low level ones in terms of giving you the power, flexibility, and general-purpose usefulness of a Lisp or Python-like language but with low-level performance. It has MATLAB-like syntax but has a lot of the features of Lisp (macros, metaprogramming) and Python that we’ve all come to love. Before anything else, I’d like to say that I’m very impressed with the project, the team, and overall their technology choices.
Many people are very excited about Julia, and one of the main reasons is because of a number of very impressive benchmarks. This is great and I’m also impressed with the benchmarks, but they are in my opinion fairly misleading and don’t represent the reality of a lot of technical computing.
I compiled Julia in advance of Stefan Karpinski’s (one of the creators) talk in New York this week. I ran a few benchmarks and then proceeded to raise hell on the Julia mailing list.
TL;DR: if you’re an R, MATLAB, or scientific Python user, it’s probably not time to start programming in Julia if the highest performance code is your goal for minimum effort. It may be in a couple of years when the JIT-compiler improves. However, Julia may soon be a reasonable free computational alternative to MATLAB. I say this even in a perfect world where Julia had all the libraries that Python and R have; as far as a sole language for building large scientific production applications in industry right now, I think Python + Cython are pretty much the only game in town.
Why am I complaining about the benchmarks? Well, one of the first things I did was look at the performance of array operations. In scientific and technical computing, we work a lot with arrays, right? In fact, convention wisdom is “don’t write loops”. Many people have told me that I should be unrolling array operations into loops and my experience is that this is the way to get the best performance. Keep in mind that I am a performance-obsessed tool maker, and in libraries like pandas I am in control of the vast majority of the areas of people’s applications that would otherwise be bottlenecks.
Here’s a simple benchmark: compute the inner product of two arrays, in other words, the sum of the element-wise product. Julia code (I disabled the garbage collector to remove its effect from the equation):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
x = randn(10000000) y = randn(10000000) function inner(x, y) result = 0.0 for i=1:length(x) result += x[i] * y[i] end result end gc_disable() tic() for i=1:50 result = sum(x .* y); end timing = toc(); println("Array operations took ", (timing / 50) * 1000, " ms") tic() timing = @time for i=1:50 result = inner(x, y); end timing = toc(); println("inner took ", (timing / 50) * 1000, " ms") |
And the result:
1 2 3 |
11:28 ~/code/repos/julia (master)$ ./julia bench.jl Array operations took 104.71534252166748 ms inner took 36.00101947784424ms |
Lastly, you might recognize that an array inner product can be expressed as a matrix operation and carried out by BLAS in Fortran:
1 2 3 4 5 6 7 8 9 |
x = randn(1, 10000000) y = randn(10000000, 1) tic() timing = @time for i=1:50 result = x * y; end timing = toc(); println("BLAS took ", (timing / 50) * 1000, "ms") => BLAS took 50.72967529296875ms |
So most of you know me as a Python hacker. So I’ll run the same array benchmark with NumPy:
1 2 3 4 5 6 |
In [9]: x = randn(10000000) In [10]: y = randn(10000000) In [11]: timeit (x * y).sum() 10 loops, best of 3: 57.8 ms per loop |
OK, so NumPy is faster than Julia for array operations, temporary array and all. But how much faster can you go by eliminating temporaries like Julia’s inner function? Here is Cython code:
1 2 3 4 5 6 |
def inner(ndarray[float64_t] x, ndarray[float64_t] y): cdef Py_ssize_t i, n = len(x) cdef float64_t result = 0 for i in range(n): result += x[i] * y[i] return result |
And the timing:
1 2 3 4 5 6 7 8 9 10 11 |
In [1]: x = randn(10000000) y In [2]: y = randn(10000000) In [3]: from pandas._sandbox import inner In [4]: inner(x, y) Out[4]: -3117.5608515640943 In [5]: timeit inner(x, y) 100 loops, best of 3: 14.5 ms per loop |
So Julia is about 2.5 times slower than Cython in this trivial benchmark (straight C wouldn’t that much faster). You can do even better by using the Intel MKL in NumPy:
1 2 |
In [7]: timeit np.dot(x, y) 100 loops, best of 3: 9.43 ms per loop |
So anyway, Julia is not a magic solution. It _is_ pretty good though, especially if you never want to have to compile code. Having to unroll array expressions into loops is still not good and I don’t think it’s a habit that should be encouraged. And the JIT will get better over time. The Julia devs are spending a lot of their time I believe more fruitfully on language design, libraries and features, and “big science” topics like distributed computing and parallel processing problems which I think will be very impactful. But if you’re looking for a high performance alternative to R for implementing scientific algorithms with low effort, the sweet spot (at least in my opinion) between performance and fast development time is Cython. But if a perfect marriage of Python, Julia, and Cython is made, perhaps we can all have our cake and eat it too.
The need for an embedded array expression compiler for NumPy
Yesterday I found myself adding some additional Cython methods for doing fast grouped aggregations in pandas. To my disappointment, I found myself duplicating a lot of code and not having much alternative beyond cooking up some kind of ad hoc code generation framework. So, here’s the problem: array of data (possibly with NAs), array of labels, with the number of distinct labels known (range from 0 to N – 1). Aggregate data based on label using some function in {sum, min, max, product, mean, variance, …}. Here is a Cython function to add all the values in a group:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
def group_add(ndarray[float64_t] out, ndarray[int32_t] counts, ndarray[float64_t] values, ndarray[int32_t] labels): cdef: Py_ssize_t i, N, lab float64_t val, count ndarray[float64_t] sumx, nobs nobs = np.zeros_like(out) sumx = np.zeros_like(out) N = len(values) for i in range(N): lab = labels[i] if lab < 0: continue counts[lab] += 1 val = values[i] # not NA if val == val: nobs[lab] += 1 sumx[lab] += val for i in range(len(counts)): if nobs[i] == 0: out[i] = nan else: out[i] = sumx[i] |
Now suppose we want to take the product of the values:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
def group_prod(ndarray[float64_t] out, ndarray[int32_t] counts, ndarray[float64_t] values, ndarray[int32_t] labels): cdef: Py_ssize_t i, N, lab float64_t val, count ndarray[float64_t] prodx, nobs nobs = np.zeros_like(out) prodx = np.ones_like(out) N = len(values) for i in range(N): lab = labels[i] if lab < 0: continue counts[lab] += 1 val = values[i] # not nan if val == val: nobs[lab] += 1 prodx[lab] *= val for i in range(len(counts)): if nobs[i] == 0: out[i] = nan else: out[i] = prodx[i] |
However, the amount of code duplication here is kind of ridiculous; the only changes are: different reduction function, and different starting values. Ideally I would like to write code that looks like:
1 2 3 4 5 6 7 8 |
def group_reduce(how): def reducer(data, labels, out): filtered = notna(data) grouped = groupby(filtered, labels) return reduce(how, grouped, out=out) return reducer group_add = group_reduce(add) group_prod = group_reduce(multiply) |
IMHO there is plenty of information here to generate C or LLVM that is at least as fast as the above code. However, this isn’t completely extensible– suppose we wanted to compute group variance. In that case what you actually need is
1 2 3 4 |
nobs = reduce(count, grouped) X = reduce(add, grouped) XX = reduce(add, grouped * grouped) variance = (nobs * XX - X**2) / (nos * (nobs - 1)) |
The only thing you care about here is the variance result. Ideally, you want the generated code to:
Under this framework, something like mean becomes even more trivial:
1 2 3 4 |
grouped = groupby(notna(data), labels) nobs = reduce(count, grouped) X = reduce(add, grouped) mean = X / nobs |
Now, I can live with creating a C/Cython code generation framework, enabling me to write code like the above and get code that’s at least as good as something hand-written. But it would be best, of course, if the translation / machine code generation were dynamic to remove the compile step. Is there an existing toolchain here that can help with the machine code generation? Perhaps Numba is the place to do this with LLVM: Numba.
Perhaps the eventual result of something like this is an embedded JIT-ing APL-like language with Python syntax running inside CPython. You would like to make Numexpr (which is an awesome project!) effective obsolete while you’re at it. In the meantime, I’ll keep thinking about the best way forward.
vbench Lightning Talk Slides from PyCon 2012
I prepared this lightning talk for PyCon on vbench, my performance monitoring tool, but didn’t have a chance to give it. If I get energetic I might make a screencast of it, but here are the slides:
Even easier frequency tables in pandas 0.7.0
I put in a little work on a new crosstab function in the main pandas namespace. It’s basically a convenient shortcut to calling pivot_table to make it easy to compute cross-tabulations for a set of factors using pandas DataFrame or even vanilla NumPy arrays!
Here’s an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
In [1]: from pandas import * In [2]: df = DataFrame({'a' : np.random.randint(0, 2, size=20), ...: 'b' : np.random.randint(0, 5, size=20), ...: 'c' : np.random.randint(0, 3, size=20)}) In [3]: df Out[3]: a b c 0 0 1 2 1 1 4 1 2 1 4 2 3 1 2 2 4 0 0 0 5 0 0 2 6 1 2 2 7 1 2 0 8 1 1 1 9 1 2 0 10 1 4 1 11 1 1 2 12 0 1 2 13 1 1 0 14 1 4 1 15 1 4 0 16 0 4 1 17 0 3 1 18 1 0 1 19 0 4 1 In [4]: crosstab(df['b'], [df['a'], df['c']]) Out[4]: a 0 1 c 0 1 2 0 1 2 b 0 1 0 1 0 1 0 1 0 0 2 1 1 1 2 0 0 0 2 0 2 3 0 1 0 0 0 0 4 0 2 0 1 3 1 |
This makes it very easy to produce an easy-on-the-eyes frequency table. crosstab can also take NumPy arrays. Suppose we had 1 million draws from a normal distribution, and we wish to produce a histogram-like table showing the number of draws whose absolute values fall into the bins defined by [0, 1, 2, 3]. Also, let’s divide things up by sign. Using crosstab this becomes dead simple.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
In [19]: arr = np.random.randn(1000000) In [20]: buckets = [0, 1, 2, 3] In [22]: crosstab(np.digitize(np.abs(arr), buckets), np.sign(arr), ....: rownames=['bucket'], colnames=['sign']) ....: Out[22]: sign -1.0 1.0 bucket 1 341678 340498 2 136104 135999 3 21424 21607 4 1334 1356 |
Of course since the result is a DataFrame you can customize the indexes:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
In [28]: table = crosstab(np.digitize(np.abs(arr), buckets), ....: np.sign(arr), rownames=['bucket'], ....: colnames=['sign']) ....: In [29]: table.index = Index(['[0, 1)', '[1, 2)', '[2, 3)', ....: '[3, inf)'], name="bin") ....: In [30]: table.columns = Index(['negative', 'positive'], name="sign") In [31]: table Out[31]: sign negative positive bin [0, 1) 341678 340498 [1, 2) 136104 135999 [2, 3) 21424 21607 [3, inf) 1334 1356 |
Contingency tables and cross-tabulations in pandas
Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.
When calling help(xtabs) (or help(table)) in R, you get a number of examples for using table, xtabs, and ftable:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
> head(esoph) agegp alcgp tobgp ncases ncontrols 1 25-34 0-39g/day 0-9g/day 0 40 2 25-34 0-39g/day 10-19 0 10 3 25-34 0-39g/day 20-29 0 6 4 25-34 0-39g/day 30+ 0 5 5 25-34 40-79 0-9g/day 0 27 6 25-34 40-79 10-19 0 7 > ftable(xtabs(cbind(ncases, ncontrols) ~ agegp + tobgp, data = esoph)) ncases ncontrols agegp tobgp 25-34 0-9g/day 0 70 10-19 1 19 20-29 0 11 30+ 0 16 35-44 0-9g/day 2 109 10-19 4 46 20-29 3 27 30+ 0 17 45-54 0-9g/day 14 104 10-19 13 57 20-29 8 33 30+ 11 19 55-64 0-9g/day 25 117 10-19 23 65 20-29 12 38 30+ 16 22 65-74 0-9g/day 31 99 10-19 12 38 20-29 10 20 30+ 2 4 75+ 0-9g/day 6 26 10-19 5 11 20-29 0 3 30+ 2 4 > ftable(xtabs(cbind(ncases, ncontrols) ~ agegp, data = esoph)) ncases ncontrols agegp 25-34 1 116 35-44 9 199 45-54 46 213 55-64 76 242 65-74 55 161 75+ 13 44 |
Here are the same examples using pandas:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 |
In [19]: table = esoph.pivot_table(['ncases', 'ncontrols'], rows=['agegp', 'tobgp'], aggfunc=np.sum) In [20]: table Out[20]: ncases ncontrols agegp tobgp 25-34 0-9g/day 0. 70. 10-19 1. 19. 20-29 0. 11. 30+ 0. 16. 35-44 0-9g/day 2. 109 10-19 4. 46. 20-29 3. 27. 30+ 0. 17. 45-54 0-9g/day 14 104 10-19 13 57. 20-29 8. 33. 30+ 11 19. 55-64 0-9g/day 25 117 10-19 23 65. 20-29 12 38. 30+ 16 22. 65-74 0-9g/day 31 99. 10-19 12 38. 20-29 10 20. 30+ 2. 4.0 75+ 0-9g/day 6. 26. 10-19 5. 11. 20-29 0. 3.0 30+ 2. 4.0 In [22]: table2 = esoph.pivot_table(['ncases', 'ncontrols'], rows='agegp', aggfunc=np.sum) In [23]: table2 Out[23]: ncases ncontrols agegp 25-34 1. 116 35-44 9. 199 45-54 46 213 55-64 76 242 65-74 55 161 75+ 13 44. |
One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:
1 2 3 4 5 6 7 8 |
In [26]: table['ncases'].unstack('agegp') Out[26]: agegp 25-34 35-44 45-54 55-64 65-74 75+ tobgp 0-9g/day 0 2 14 25 31 6 10-19 1 4 13 23 12 5 20-29 0 3 8. 12 10 0 30+ 0 0 11 16 2. 2 |
Here is another fun example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
In [31]: import pandas.rpy.common as com In [32]: wp = com.load_data('warpbreaks') In [33]: wp['replicate'] = np.tile(range(1, 10), 6) In [34]: wp.pivot_table('breaks', rows=['wool', 'tension'], cols='replicate', aggfunc=np.sum) Out[34]: replicate 1 2 3 4 5 6 7 8 9 wool tension A H 36 21 24 18 10 43 28 15 26 L 26 30 54 25 70 52 51 26 67 M 18 21 29 17 12 18 35 30 36 B H 20 21 24 17 13 15 15 16 28 L 27 14 29 19 29 31 41 20 44 M 42 26 19 16 39 28 21 39 29 |
Here is the equivalent R code:
1 2 3 4 5 6 7 8 9 10 |
> warpbreaks$replicate <- rep(1:9, len = 54) > ftable(xtabs(breaks ~ wool + tension + replicate, data = warpbreaks)) replicate 1 2 3 4 5 6 7 8 9 wool tension A L 26 30 54 25 70 52 51 26 67 M 18 21 29 17 12 18 35 30 36 H 36 21 24 18 10 43 28 15 26 B L 27 14 29 19 29 31 41 20 44 M 42 26 19 16 39 28 21 39 29 H 20 21 24 17 13 15 15 16 28 |
As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.
NYCPython 1/10/2012: A look inside pandas design and development
I had the privilege of speaking last night at the NYCPython meetup group. I’ve given tons of “use pandas!” talks so I thought I would take a slightly different angle and talk about some of the design and implementation work that I’ve done for getting good performance in critical data manipulations. I’ll turn some of this material into some blog articles in the near future.
Wes McKinney: pandas design and development from Adam Klein on Vimeo.
Here’s some more video footable shot by my awesome friend Emily Paup with her HDSLR.
I did a little interactive demo (using the ever-amazing IPython HTML Notebook) on Ashley Williams’s Food Nutrient JSON Database:
If you want to run the code in the IPython notebook, you’ll have to download the food database file above.
The audience and I learned from this demo that if you’re after Tryptophan, “Sea lion, Steller, meat with fat (Alaska Native)” is where to get it (in the highest density).