I tweeted that yesterday and figured it would be prudent to justify that with some code and real benchmarks. I’m really proud of pandas’s performance after investing years of development building a tool that is both easy-to-use and fast. So here we go.
The test case
The basic set-up is: you have two labeled vectors of different lengths and you add them together. The algorithm matches the labels and adds together the corresponding values. Simple, right?
R/zoo benchmarks
Here’s the R code:
indices = rep(NA, 100000)
for (i in 1:100000)
indices[i] <- paste(sample(letters, 10), collapse="")
timings <- numeric()
x <- zoo(rnorm(100000), indices)
y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)])
for (i in 1:10) {
gc()
timings[i] = system.time(x + y)[3]
}
In this benchmark, I get a timing of:
[1] 1.1518
So, 1.15 seconds per iteration. There are a couple things to note here:
Python benchmark
Here’s the code doing basically the same thing, except using objects that are not pre-sorted by label:
from pandas.util.testing import rands
n = 100000
indices = Index([rands(10) for _ in xrange(n)])
def sample(values, k):
from random import shuffle
sampler = np.arange(len(values))
shuffle(sampler)
return values.take(sampler[:k])
subsample_size = 90000
x = Series(np.random.randn(100000), indices)
y = Series(np.random.randn(subsample_size),
index=sample(indices, subsample_size))
And the timing:
10 loops, best of 3: 110 ms per loop
Now, if I first sort the objects by index, a more specialized algorithm will be used:
In [13]: ys = y.sort_index()
In [14]: timeit xs + ys
10 loops, best of 3: 44.1 ms per loop
Note that I’m also the fastest (that I know of) among Python libraries. Here’s the above example using the labeled array package:
In [13]: lx = la.larry(x.values, [list(x.index)])
In [14]: ly = la.larry(y.values, [list(y.index)])
In [15]: timeit la.add(lx, ly, join="outer")
1 loops, best of 3: 214 ms per loop
In [16]: timeit la.add(lx, ly, join="inner")
10 loops, best of 3: 176 ms per loop
The verdict
So in a apples-to-apples comparison, in this benchmark pandas is 26x faster than zoo. Even in the completely unordered case (which is not apples-to-apples), it’s 10x faster. I actually have a few tricks up my sleeve (as soon as I can find the time to implement them) to make the above operations even faster still =)


Wes McKinney Reply:
September 30th, 2011 at 4:04 am
I think the larger problem here is that the data alignment algorithms in zoo are not very good. Changing the index type from string to long int or double won’t make a bad algorithm good (though faster, yes).
[Reply]