#### MonthSeptember 2011

generic pandas data alignment is about 10-15x faster than the #rstats zoo package in initial tests. interesting #python
@wesmckinn
Wes McKinney

I tweeted that yesterday and figured it would be prudent to justify that with some code and real benchmarks. I’m really proud of pandas’s performance after investing years of development building a tool that is both easy-to-use and fast. So here we go.

### The test case

The basic set-up is: you have two labeled vectors of different lengths and you add them together. The algorithm matches the labels and adds together the corresponding values. Simple, right?

### R/zoo benchmarks

Here’s the R code:

library(zoo)

indices = rep(NA, 100000)
for (i in 1:100000)
indices[i] <- paste(sample(letters, 10), collapse="")

timings <- numeric()

x <- zoo(rnorm(100000), indices)
y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)])

for (i in 1:10) {
gc()
timings[i] = system.time(x + y)[3]
}

In this benchmark, I get a timing of:

> mean(timings)
[1] 1.1518

So, 1.15 seconds per iteration. There are a couple things to note here:

• The zoo package pre-sorts the objects by the index/label. As you will see below this makes a big performance difference as you can write a faster algorithm for ordered data.
• zoo returns an object whose index is the intersection of the indexes. I disagree with this design choice as I feel that it is discarding information. pandas returns the union (the “outer join”, if you will) by default.

### Python benchmark

Here’s the code doing basically the same thing, except using objects that are not pre-sorted by label:

from pandas import *
from pandas.util.testing import rands

n = 100000
indices = Index([rands(10) for _ in xrange(n)])

def sample(values, k):
from random import shuffle
sampler = np.arange(len(values))
shuffle(sampler)
return values.take(sampler[:k])

subsample_size = 90000

x = Series(np.random.randn(100000), indices)
y = Series(np.random.randn(subsample_size),
index=sample(indices, subsample_size))

And the timing:

In [11]: timeit x + y
10 loops, best of 3: 110 ms per loop

Now, if I first sort the objects by index, a more specialized algorithm will be used:

In [12]: xs = x.sort_index()

In [13]: ys = y.sort_index()

In [14]: timeit xs + ys
10 loops, best of 3: 44.1 ms per loop

Note that I’m also the fastest (that I know of) among Python libraries. Here’s the above example using the labeled array package:

In [12]: import la

In [13]: lx = la.larry(x.values, [list(x.index)])

In [14]: ly = la.larry(y.values, [list(y.index)])

In [15]: timeit la.add(lx, ly, join="outer")
1 loops, best of 3: 214 ms per loop

In [16]: timeit la.add(lx, ly, join="inner")
10 loops, best of 3: 176 ms per loop

### The verdict

So in a apples-to-apples comparison, in this benchmark pandas is 26x faster than zoo. Even in the completely unordered case (which is not apples-to-apples), it’s 10x faster. I actually have a few tricks up my sleeve (as soon as I can find the time to implement them) to make the above operations even faster still =)

• R is much more magical than Python. What do I mean by this? In R, things like this are a part of everyday life:

> a <- rnorm(10)
> b <- rnorm(10)
> cbind(a, b)
a          b
[1,]  0.8729978  0.5170078
[2,] -0.6885048 -0.4430447
[3,]  0.4017740  1.8985843
[4,]  2.1088905 -1.4121763
[5,]  0.9375273  0.4703302
[6,]  0.5558276 -0.5825152
[7,] -2.1606252  0.7379874
[8,] -0.7651046 -0.4534345
[9,] -4.2604901  0.9561077
[10,]  0.3940632 -0.8331285

If you’re a seasoned Python programmer, you might have the sort of visceral negative reaction that I do to this. Seriously, just where in the hell did those variable names come from? So when I say magic here I’m talking about computing on the language’s abstract syntax tree. R takes a different design philosophy to Python in this regard. As any Python programmer knows: Explicit is better than implicit. I happen to agree. There is also a bit of a semantic difference in R versus Python in that assignment in R typically copies data, whereas variables in Python are simply references (labels) for a particular object. So you could make the argument that the names a and b above are more strongly linked to the underlying data.

While building pandas over the last several years, I occasionally grapple with issues like the above. Maybe I should just break from Python ethos and embrace magic? I mean, how hard would it be to get the above behavior in Python? Python gives you stack frames and the ast module after all. So I went down the rabbit hole and wrote this little code snippet:

While this is woefully unpythonic, it’s also kind of cool:

In [27]: merge(a, b)
Out[27]:
a         b
2000-01-03 -1.35      0.8398
2000-01-04  0.999    -1.617
2000-01-05  0.2537    1.433
2000-01-06  0.6273   -0.3959
2000-01-07  0.7963   -0.789
2000-01-10  0.004295 -1.446

This can even parse and format more complicated expressions (harder than it looks, because you have to walk the whole AST):

In [30]: merge(a, np.log(b))
Out[30]:
a        np.log(b)
2000-01-03  0.6243   0.7953
2000-01-04  0.3593  -1.199
2000-01-05  2.805   -1.059
2000-01-06  0.6369  -0.9067
2000-01-07 -0.2734   NaN
2000-01-10 -1.023    0.3326

Now, I am *not* suggesting we do this any time soon. I’m going to prefer the explicit approach (cf. the Zen of Python) any day of the week:

In [32]: DataFrame({'a' : a, 'log(b)' : np.log(b)})
Out[32]:
a        log(b)
2000-01-03  0.6243   0.7953
2000-01-04  0.3593  -1.199
2000-01-05  2.805   -1.059
2000-01-06  0.6369  -0.9067
2000-01-07 -0.2734   NaN
2000-01-10 -1.023    0.3326

In anticipation of integrating NumPy’s shiny new datetime64 dtype into pandas, I set about writing some faster alignment and merging functions for ordered time series data indexed by datetime64 timestamps. Many people have pointed me to the widely used R xts package as a baseline for highly optimized joining functions.

Anyway, long story short, with a little NumPy- and Cython-fu I think I’ve matched or beaten xts for almost all of its supported join types by up to 40% (left/outer/inner) using the merge.xts function.

In a blog article earlier today I wrote about some of the performance problems I had to address to do this. The rest of the joining code is pretty straightforward Cython code. Though it’ll probably be a few weeks before this new code gets incorporated into DataFrame.join. You’ll just have to take my word for it that I’m doing an apples-to-apples comparison (or read the source yourself) =)

### Python benchmarks

Here are the Python timings in milliseconds for joining two time series data sets. The column labels are the lengths (in scientific notation, from 100 through 1,000,000). The two timings are with two univariate time series and two collections of 5 time series.

EDIT (9/24/2011): after corresponding with the xts author, Jeff Ryan, I reran the benchmarks with the code modified to ensure that garbage collection time isn’t being included in the runtime. The results after the change to the benchmark are less disparate than before. I also tweaked the Cython algos to determine the outer/inner join time index and re-ran the benchmarks. In the 1e6 outer join case the new algo trimmed 8 ms off, 4-5ms in the inner join case. Whenever I develop a strong desire to hack up a pile of spaghetti-like Cython code (combining the index union/intersection routines with the take / row-copying code) I can probably shave off another few millis…

Python Benchmark on 9/25

Joining two univariate time series
1e2      1e3      1e4     1e5    1e6
outer  0.0605   0.0637   0.1966  1.898  26.26
left   0.0187   0.02282  0.1157  1.023  13.89
inner  0.04526  0.05052  0.1523  1.382  22.25

Joining two 5-variate time series
1e2      1e3      1e4     1e5    1e6
outer  0.07944  0.0638   0.3178  6.498  67.46
left   0.0255   0.03512  0.2467  4.711  51.88
inner  0.06176  0.05262  0.2283  5.267  56.46

EDIT 9/28: I put in some work integrating the new merging routines throughout DataFrame and friends in pandas and added a new Int64Index class to facilitate fast joining of time series data. Here are the updated benchmarks, which now have pandas a bit slower than xts for outer/inner joins in the univariate case but still significantly faster in the multivariate case:

Python Benchmark on 9/28 post integration

Joining two univariate time series
1e2     1e3     1e4     1e5    1e6
outer  0.4501  0.314   0.5362  3.162  30.78
left   0.2742  0.2879  0.408   2.025  19.84
inner  0.2715  0.2863  0.4306  2.504  26.64

Joining two 5-variate time series
1e2      1e3    1e4     1e5    1e6
outer  0.4507  0.3375  0.6158  7.028  71.34
left   0.2959  0.3184  0.4927  5.068  55.2
inner  0.2767  0.305   0.5368  5.782  59.65

As you can see in the 1 million row case there is an additional 4-5 ms of overhead across the board which largely has to do with handling types other than floating point. With some effort I could eliminate this overhead but I’m going to leave it for now.

And the source code for the benchmark:

from pandas import *
import gc
import numpy as np

def bench_python(pct_overlap=0.20, K=1):
ns = [2, 3, 4, 5, 6]
iterations = 50
pct_overlap = 0.2
kinds = ['outer', 'left', 'inner']

all_results = {}
for logn in ns:
n = 10**logn
a = np.arange(n, dtype=np.int64)
b = np.arange(n * pct_overlap, n * pct_overlap + n, dtype=np.int64)
a_frame = DataFrame(np.random.randn(n, K), index=a, columns=range(K))
b_frame = DataFrame(np.random.randn(n, K), index=b, columns=range(K, 2 * K))
all_results[logn] = result = {}
for kind in kinds:
gc.disable(); _s = time.clock()
# do the join
for _ in range(iterations):
a_frame.join(b_frame, how=kind)

elapsed = time.clock() - _s; gc.enable()
result[kind] = (elapsed / iterations) * 1000
return DataFrame(all_results, index=kinds)

### R/xts benchmarks

And the R benchmark using xts. The results for the smaller datasets are unreliable due to the low precision of system.time.

R Benchmark

Joining two univariate time series
1e2  1e3  1e4  1e5   1e6
outer 0.30 0.26 0.48 3.12 28.58
left  0.22 0.24 0.36 2.78 24.18
inner 0.20 0.24 0.40 2.42 21.06

Joining two 5-variate time series
1e2  1e3  1e4   1e5   1e6
outer 0.26 0.46 1.30 11.56 97.02
left  0.34 0.28 1.06 10.04 85.72
inner 0.30 0.28 0.94  8.02 67.22

The Python code for the benchmark is all found here.

Here is the R code for the benchmark (GitHub link):

library(xts)

iterations <- 50

ns = c(100, 1000, 10000, 100000, 1000000)
kinds = c("outer", "left", "inner")

result = matrix(0, nrow=3, ncol=length(ns))
n <- 100000
pct.overlap <- 0.2

k <- 1

for (ni in 1:length(ns)){
n <- ns[ni]
rng1 <- 1:n
offset <- as.integer(n * pct.overlap)
rng2 <- rng1 + offset
x <- xts(matrix(rnorm(n * k), nrow=n, ncol=k),
as.POSIXct(Sys.Date()) + rng1)
y <- xts(matrix(rnorm(n * k), nrow=n, ncol=k),
as.POSIXct(Sys.Date()) + rng2)
timing <- numeric()
for (i in 1:3) {
kind = kinds[i]
for(j in 1:iterations) {
gc()  # just to be sure
timing[j] <- system.time(merge(x,y,join=kind))[3]
}
#timing <- system.time(for (j in 1:iterations) merge.xts(x, y, join=kind),
#                      gcFirst=F)
#timing <- as.list(timing)
result[i, ni] <- mean(timing) * 1000
#result[i, ni] = (timing\$elapsed / iterations) * 1000
}
}

rownames(result) <- kinds
colnames(result) <- log10(ns)

Many scientific Python users are surprised when I tell them that ndarray.take is faster than __getitem__-based (a.k.a. “fancy” as I call it) indexing.

import numpy as np
import random

arr = np.random.randn(10000, 5)
indexer = np.arange(10000)
random.shuffle(indexer)

In [26]: timeit arr[indexer]
1000 loops, best of 3: 1.25 ms per loop

In [27]: timeit arr.take(indexer, axis=0)
10000 loops, best of 3: 127 us per loop

It’s actually kind of unbelievable when you think about it. What’s going on here that take is almost 10x faster? I really should take a closer at the internals of what __getitem__ does because this has always struck me as pretty bad. Maybe I shouldn’t be complaining? I mean, R 2.13’s indexing falls somewhere in the middle:

mat <- matrix(rnorm(50000), nrow=10000, ncol=5)
set.seed(12345)
indexer <- sample(1:10000)
> system.time(for (i in 1:1000) mat[indexer,])
user  system elapsed
0.460   0.197   0.656

So 656 microseconds per iteration. (In an earlier version of this post I used rpy2 to do the benchmark and got 1.05 ms, but there was apparently some overhead from rpy2)

Another peculiarity that I noticed with take is that performance gets worse when you use the out argument, which tells the function to use an array you pass in to write out the result:

out = np.empty_like(arr)

In [50]: timeit np.take(arr, indexer, axis=0, out=out)
10000 loops, best of 3: 200 us per loop

EDIT: I’ve been informed that using mode='clip' or mode='wrap' makes this run as fast as without the out argument.

Weird! I was dissatisfied by this, so I got curious how fast a hand-coded little Cython function can do this:

@cython.wraparound(False)
@cython.boundscheck(False)
def take_axis0(ndarray[float64_t, ndim=2] values,
ndarray[int32_t] indexer,
out=None):
cdef:
Py_ssize_t i, j, k, n, idx
ndarray[float64_t, ndim=2] outbuf

if out is None:
outbuf = np.empty_like(values)
else:
outbuf = out

n = len(indexer)
k = values.shape[1]
for i from 0 <= i < n:
idx = indexer[i]

if idx == -1:
for j from 0 <= j < k:
outbuf[i, j] = NaN
else:
for j from 0 <= j < k:
outbuf[i, j] = values[idx, j]

Don’t worry about the -1 thing– that’s a specialization that I’m using inside pandas. Curiously, this function is a lot faster than take using out but faster than the regular take by a handful of microseconds.

In [53]: timeit lib.take_axis0(arr, indexer)
10000 loops, best of 3: 115 us per loop

In [54]: timeit lib.take_axis0(arr, indexer, out)
10000 loops, best of 3: 109 us per loop

Very interesting.

### TL;DR

• Use take not []-based indexing to get best performance
• Cython is just as fast for my specific application and a lot faster if you’re passing an out array (which I will be for the application that I needed this for)
• R’s matrix indexing performance is better than NumPy’s fancy indexing, but about 5-6x slower than ndarray.take. This can probably be improved.
• ### Link to recording of talk

I had the privilege of speaking yesterday at the first meeting of the former NYC R Meetup, now dubbed the NYC Open Statistical Programming Meetup. I was really excited to share some of the new stuff I’ve built lately in pandas, though I didn’t have enough time to demo features as I would have liked I also ran out of time to talk about statsmodels, but I’ll post some materials here for those who are interested.

One thing I showed off which a lot of people have been asking me about is the totally awesome new IPython HTML notebook. I don’t think I’ll ever do a Python demo in anything else, it’s just that good for teaching and doing interactive demos. Brian Granger, Fernando Perez, and crew have developed a notebook file format (.ipynb) for saving and sharing these interactive notebooks. I’ll post links to my IPython notebooks for the pandas demo and the statsmodels demo I didn’t have time for at the end, along with instructions for running them.

Here are my slides from the talk:

Now, here’s the exciting part: the IPython notebooks. You’ll need to install the latest development version of IPython from GitHub. Make sure you have the Python ZeroMQ bindings and install the tornado web server. This ought to work for those (using setuptools):

easy_install pyzmq

Once you’ve done all that, fire up a terminal and run:

ipython notebook --pylab=inline

You should see the server fire up and give you a web address like 127.0.0.1:8888. Now, you need my demo workbooks. Here is a zip file with everything you need:

Expand this zip file and launch the IPython notebook in the same directory where the .ipynb live. You should be able to run all the code in the pandas demo workbook yourself if you have pandas 0.4 (click “Run All”), but the statsmodels notebook uses some very new code from the pandas-integration branch on GitHub and some formula code that is not part of the statsmodels repo. So you can see all the code and output, but if you want to know more, I encourage you to get involved in the project!

For all my Python zealotry, I never claimed it was perfect. In fact, there are a number of places you can shoot yourself in the foot if you aren’t careful. But I was kind of bummed about this one:

import numpy as np

class MainScreen(object):

def __init__(self, cats):
self.cats = cats
self._signal = None

def turn_on(self):
if self._signal is None:
signal = Signal(self)
self._signal = signal
return self._signal

class Signal(object):

def __init__(self, main_screen):
# well we want to know which screen we came from!
self.main_screen = main_screen

def we_get(self):
return True

for i in xrange(50):
cats = np.empty((10000, 500), dtype=float)
cats.fill(0)
main_screen = MainScreen(cats)
signal = main_screen.turn_on()

Try it.

EDIT: I’ve been informed by an astute reader that the GC is actually capable of dealing with the above case. Yang Zhang says “The collector actually works with cycles, including this case. (Actually, most objects are managed via reference-counting; the gc is specifically for cyclic references.) The problem is that it’s triggered by # allocations, not # bytes. In this case, each cats array is large enough that you need several GB of allocations before the threshold is triggered. Try adding gc.collect() or gc.get_count() in the loop (and inspect gc.get_threshold()).”

All is not lost however. This pattern can be implemented using weak references:

class MainScreen(object):

def __init__(self, cats):
self.cats = cats
self._signal = None

def turn_on(self):
import weakref
if self._signal is None:
signal = Signal(self)
self._signal = weakref.ref(signal)
return self._signal()