WM in 2015: Woefully out of date, but I preserve this post for posterity.

generic pandas data alignment is about 10-15x faster than the #rstats zoo package in initial tests. interesting #python

— Wes McKinney (@wesmckinn) September 29, 2011

I tweeted that yesterday and figured it would be prudent to justify that with some code and real benchmarks. I'm really proud of pandas's performance after investing years of development building a tool that is both **easy-to-use** and **fast**. So here we go.

### The test case

The basic set-up is: you have two labeled vectors of different lengths and you add them together. The algorithm matches the labels and adds together the corresponding values. Simple, right?

### R/zoo benchmarks

Here's the R code:

library(zoo) indices = rep(NA, 100000) for (i in 1:100000) indices[i] <- paste(sample(letters, 10), collapse="") timings <- numeric() x <- zoo(rnorm(100000), indices) y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)]) for (i in 1:10) { gc() timings[i] = system.time(x + y)[3] }

In this benchmark, I get a timing of:

> mean(timings) [1] 1.1518

So, 1.15 seconds per iteration. There are a couple things to note here:

- The zoo package
**pre-sorts**the objects by the index/label. As you will see below this makes a**big**performance difference as you can write a faster algorithm for ordered data. - zoo returns an object whose index is the
**intersection**of the indexes. I disagree with this design choice as I feel that it is discarding information. pandas returns the union (the "outer join", if you will) by default.

### Python benchmark

Here's the code doing basically the same thing, except using objects that are **not** pre-sorted by label:

from pandas import * from pandas.util.testing import rands n = 100000 indices = Index([rands(10) for _ in xrange(n)]) def sample(values, k): from random import shuffle sampler = np.arange(len(values)) shuffle(sampler) return values.take(sampler[:k]) subsample_size = 90000 x = Series(np.random.randn(100000), indices) y = Series(np.random.randn(subsample_size), index=sample(indices, subsample_size))

And the timing:

In [11]: timeit x + y 10 loops, best of 3: 110 ms per loop

Now, if I first sort the objects by index, a more specialized algorithm will be used:

In [12]: xs = x.sort_index() In [13]: ys = y.sort_index() In [14]: timeit xs + ys 10 loops, best of 3: 44.1 ms per loop

Note that I'm also the fastest (that I know of) among Python libraries. Here's the above example using the labeled array package:

In [12]: import la In [13]: lx = la.larry(x.values, [list(x.index)]) In [14]: ly = la.larry(y.values, [list(y.index)]) In [15]: timeit la.add(lx, ly, join="outer") 1 loops, best of 3: 214 ms per loop In [16]: timeit la.add(lx, ly, join="inner") 10 loops, best of 3: 176 ms per loop

### The verdict

So in a apples-to-apples comparison, in this benchmark pandas is **26x** faster than zoo. Even in the completely unordered case (which is not apples-to-apples), it's 10x faster. I actually have a few tricks up my sleeve (as soon as I can find the time to implement them) to make the above operations even faster still =)