WM in 2015: Woefully out of date, but I preserve this post for posterity.
generic pandas data alignment is about 10-15x faster than the #rstats zoo package in initial tests. interesting #python
— Wes McKinney (@wesmckinn) September 29, 2011
I tweeted that yesterday and figured it would be prudent to justify that with some code and real benchmarks. I'm really proud of pandas's performance after investing years of development building a tool that is both easy-to-use and fast. So here we go.
The test case
The basic set-up is: you have two labeled vectors of different lengths and you add them together. The algorithm matches the labels and adds together the corresponding values. Simple, right?
R/zoo benchmarks
Here's the R code:
library(zoo)
indices = rep(NA, 100000)
for (i in 1:100000)
indices[i] <- paste(sample(letters, 10), collapse="")
timings <- numeric()
x <- zoo(rnorm(100000), indices)
y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)])
for (i in 1:10) {
gc()
timings[i] = system.time(x + y)[3]
}
In this benchmark, I get a timing of:
> mean(timings)
[1] 1.1518
So, 1.15 seconds per iteration. There are a couple things to note here:
- The zoo package pre-sorts the objects by the index/label. As you will see below this makes a big performance difference as you can write a faster algorithm for ordered data.
- zoo returns an object whose index is the intersection of the indexes. I disagree with this design choice as I feel that it is discarding information. pandas returns the union (the "outer join", if you will) by default.
Python benchmark
Here's the code doing basically the same thing, except using objects that are not pre-sorted by label:
from pandas import *
from pandas.util.testing import rands
n = 100000
indices = Index([rands(10) for _ in xrange(n)])
def sample(values, k):
from random import shuffle
sampler = np.arange(len(values))
shuffle(sampler)
return values.take(sampler[:k])
subsample_size = 90000
x = Series(np.random.randn(100000), indices)
y = Series(np.random.randn(subsample_size),
index=sample(indices, subsample_size))
And the timing:
In [11]: timeit x + y
10 loops, best of 3: 110 ms per loop
Now, if I first sort the objects by index, a more specialized algorithm will be used:
In [12]: xs = x.sort_index()
In [13]: ys = y.sort_index()
In [14]: timeit xs + ys
10 loops, best of 3: 44.1 ms per loop
Note that I'm also the fastest (that I know of) among Python libraries. Here's the above example using the labeled array package:
In [12]: import la
In [13]: lx = la.larry(x.values, [list(x.index)])
In [14]: ly = la.larry(y.values, [list(y.index)])
In [15]: timeit la.add(lx, ly, join="outer")
1 loops, best of 3: 214 ms per loop
In [16]: timeit la.add(lx, ly, join="inner")
10 loops, best of 3: 176 ms per loop
The verdict
So in a apples-to-apples comparison, in this benchmark pandas is 26x faster than zoo. Even in the completely unordered case (which is not apples-to-apples), it's 10x faster. I actually have a few tricks up my sleeve (as soon as I can find the time to implement them) to make the above operations even faster still =)