WM in 2015: Woefully out of date, but I preserve this post for posterity.

generic pandas data alignment is about 10-15x faster than the #rstats zoo package in initial tests. interesting #python

— Wes McKinney (@wesmckinn) September 29, 2011

I tweeted that yesterday and figured it would be prudent to justify that with some code and real benchmarks. I'm really proud of pandas's performance after investing years of development building a tool that is both **easy-to-use** and **fast**. So here we go.

### The test case

The basic set-up is: you have two labeled vectors of different lengths and you add them together. The algorithm matches the labels and adds together the corresponding values. Simple, right?

### R/zoo benchmarks

Here's the R code:

```
library(zoo)
indices = rep(NA, 100000)
for (i in 1:100000)
indices[i] <- paste(sample(letters, 10), collapse="")
timings <- numeric()
x <- zoo(rnorm(100000), indices)
y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)])
for (i in 1:10) {
gc()
timings[i] = system.time(x + y)[3]
}
```

In this benchmark, I get a timing of:

```
> mean(timings)
[1] 1.1518
```

So, 1.15 seconds per iteration. There are a couple things to note here:

- The zoo package
**pre-sorts**the objects by the index/label. As you will see below this makes a**big**performance difference as you can write a faster algorithm for ordered data. - zoo returns an object whose index is the
**intersection**of the indexes. I disagree with this design choice as I feel that it is discarding information. pandas returns the union (the "outer join", if you will) by default.

### Python benchmark

Here's the code doing basically the same thing, except using objects that are **not** pre-sorted by label:

```
from pandas import *
from pandas.util.testing import rands
n = 100000
indices = Index([rands(10) for _ in xrange(n)])
def sample(values, k):
from random import shuffle
sampler = np.arange(len(values))
shuffle(sampler)
return values.take(sampler[:k])
subsample_size = 90000
x = Series(np.random.randn(100000), indices)
y = Series(np.random.randn(subsample_size),
index=sample(indices, subsample_size))
```

And the timing:

```
In [11]: timeit x + y
10 loops, best of 3: 110 ms per loop
```

Now, if I first sort the objects by index, a more specialized algorithm will be used:

```
In [12]: xs = x.sort_index()
In [13]: ys = y.sort_index()
In [14]: timeit xs + ys
10 loops, best of 3: 44.1 ms per loop
```

Note that I'm also the fastest (that I know of) among Python libraries. Here's the above example using the labeled array package:

```
In [12]: import la
In [13]: lx = la.larry(x.values, [list(x.index)])
In [14]: ly = la.larry(y.values, [list(y.index)])
In [15]: timeit la.add(lx, ly, join="outer")
1 loops, best of 3: 214 ms per loop
In [16]: timeit la.add(lx, ly, join="inner")
10 loops, best of 3: 176 ms per loop
```

### The verdict

So in a apples-to-apples comparison, in this benchmark pandas is **26x** faster than zoo. Even in the completely unordered case (which is not apples-to-apples), it's 10x faster. I actually have a few tricks up my sleeve (as soon as I can find the time to implement them) to make the above operations even faster still =)