## The pandas escaped the zoo: Python’s pandas vs. R’s zoo benchmarks

generic pandas data alignment is about 10-15x faster than the #rstats zoo package in initial tests. interesting #python
@wesmckinn
Wes McKinney

I tweeted that yesterday and figured it would be prudent to justify that with some code and real benchmarks. I’m really proud of pandas’s performance after investing years of development building a tool that is both easy-to-use and fast. So here we go.

### The test case

The basic set-up is: you have two labeled vectors of different lengths and you add them together. The algorithm matches the labels and adds together the corresponding values. Simple, right?

### R/zoo benchmarks

Here’s the R code:

library(zoo)

indices = rep(NA, 100000)
for (i in 1:100000)
indices[i] <- paste(sample(letters, 10), collapse="")

timings <- numeric()

x <- zoo(rnorm(100000), indices)
y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)])

for (i in 1:10) {
gc()
timings[i] = system.time(x + y)[3]
}

In this benchmark, I get a timing of:

> mean(timings)
[1] 1.1518

So, 1.15 seconds per iteration. There are a couple things to note here:

• The zoo package pre-sorts the objects by the index/label. As you will see below this makes a big performance difference as you can write a faster algorithm for ordered data.
• zoo returns an object whose index is the intersection of the indexes. I disagree with this design choice as I feel that it is discarding information. pandas returns the union (the “outer join”, if you will) by default.

### Python benchmark

Here’s the code doing basically the same thing, except using objects that are not pre-sorted by label:

from pandas import *
from pandas.util.testing import rands

n = 100000
indices = Index([rands(10) for _ in xrange(n)])

def sample(values, k):
from random import shuffle
sampler = np.arange(len(values))
shuffle(sampler)
return values.take(sampler[:k])

subsample_size = 90000

x = Series(np.random.randn(100000), indices)
y = Series(np.random.randn(subsample_size),
index=sample(indices, subsample_size))

And the timing:

In [11]: timeit x + y
10 loops, best of 3: 110 ms per loop

Now, if I first sort the objects by index, a more specialized algorithm will be used:

In [12]: xs = x.sort_index()

In [13]: ys = y.sort_index()

In [14]: timeit xs + ys
10 loops, best of 3: 44.1 ms per loop

Note that I’m also the fastest (that I know of) among Python libraries. Here’s the above example using the labeled array package:

In [12]: import la

In [13]: lx = la.larry(x.values, [list(x.index)])

In [14]: ly = la.larry(y.values, [list(y.index)])

In [15]: timeit la.add(lx, ly, join="outer")
1 loops, best of 3: 214 ms per loop

In [16]: timeit la.add(lx, ly, join="inner")
10 loops, best of 3: 176 ms per loop

### The verdict

So in a apples-to-apples comparison, in this benchmark pandas is 26x faster than zoo. Even in the completely unordered case (which is not apples-to-apples), it’s 10x faster. I actually have a few tricks up my sleeve (as soon as I can find the time to implement them) to make the above operations even faster still =)