I tweeted that yesterday and figured it would be prudent to justify that with some code and real benchmarks. I’m really proud of pandas’s performance after investing years of development building a tool that is both **easy-to-use** and **fast**. So here we go.

### The test case

The basic set-up is: you have two labeled vectors of different lengths and you add them together. The algorithm matches the labels and adds together the corresponding values. Simple, right?

### R/zoo benchmarks

Here’s the R code:

indices = rep(NA, 100000)

for (i in 1:100000)

indices[i] <- paste(sample(letters, 10), collapse="")

timings <- numeric()

x <- zoo(rnorm(100000), indices)

y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)])

for (i in 1:10) {

gc()

timings[i] = system.time(x + y)[3]

}

In this benchmark, I get a timing of:

[1] 1.1518

So, 1.15 seconds per iteration. There are a couple things to note here:

**pre-sorts**the objects by the index/label. As you will see below this makes a

**big**performance difference as you can write a faster algorithm for ordered data.

**intersection**of the indexes. I disagree with this design choice as I feel that it is discarding information. pandas returns the union (the “outer join”, if you will) by default.

### Python benchmark

Here’s the code doing basically the same thing, except using objects that are **not** pre-sorted by label:

from pandas.util.testing import rands

n = 100000

indices = Index([rands(10) for _ in xrange(n)])

def sample(values, k):

from random import shuffle

sampler = np.arange(len(values))

shuffle(sampler)

return values.take(sampler[:k])

subsample_size = 90000

x = Series(np.random.randn(100000), indices)

y = Series(np.random.randn(subsample_size),

index=sample(indices, subsample_size))

And the timing:

10 loops, best of 3: 110 ms per loop

Now, if I first sort the objects by index, a more specialized algorithm will be used:

In [13]: ys = y.sort_index()

In [14]: timeit xs + ys

10 loops, best of 3: 44.1 ms per loop

Note that I’m also the fastest (that I know of) among Python libraries. Here’s the above example using the labeled array package:

In [13]: lx = la.larry(x.values, [list(x.index)])

In [14]: ly = la.larry(y.values, [list(y.index)])

In [15]: timeit la.add(lx, ly, join="outer")

1 loops, best of 3: 214 ms per loop

In [16]: timeit la.add(lx, ly, join="inner")

10 loops, best of 3: 176 ms per loop

### The verdict

So in a apples-to-apples comparison, in this benchmark pandas is **26x** faster than zoo. Even in the completely unordered case (which is not apples-to-apples), it’s 10x faster. I actually have a few tricks up my sleeve (as soon as I can find the time to implement them) to make the above operations even faster still =)