WM in 2015: Woefully out of date, but I preserve this post for posterity.

I tweeted that yesterday and figured it would be prudent to justify that with some code and real benchmarks. I'm really proud of pandas's performance after investing years of development building a tool that is both easy-to-use and fast. So here we go.

The test case

The basic set-up is: you have two labeled vectors of different lengths and you add them together. The algorithm matches the labels and adds together the corresponding values. Simple, right?

R/zoo benchmarks

Here's the R code:


indices = rep(NA, 100000)
for (i in 1:100000)
  indices[i] <- paste(sample(letters, 10), collapse="")

timings <- numeric()

x <- zoo(rnorm(100000), indices)
y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)])

for (i in 1:10) {
  timings[i] = system.time(x + y)[3]

In this benchmark, I get a timing of:

> mean(timings)
[1] 1.1518

So, 1.15 seconds per iteration. There are a couple things to note here:

  • The zoo package pre-sorts the objects by the index/label. As you will see below this makes a big performance difference as you can write a faster algorithm for ordered data.
  • zoo returns an object whose index is the intersection of the indexes. I disagree with this design choice as I feel that it is discarding information. pandas returns the union (the "outer join", if you will) by default.

Python benchmark

Here's the code doing basically the same thing, except using objects that are not pre-sorted by label:

from pandas import *
from pandas.util.testing import rands

n = 100000
indices = Index([rands(10) for _ in xrange(n)])

def sample(values, k):
    from random import shuffle
    sampler = np.arange(len(values))
    return values.take(sampler[:k])

subsample_size = 90000

x = Series(np.random.randn(100000), indices)
y = Series(np.random.randn(subsample_size),
           index=sample(indices, subsample_size))

And the timing:

In [11]: timeit x + y
10 loops, best of 3: 110 ms per loop

Now, if I first sort the objects by index, a more specialized algorithm will be used:

In [12]: xs = x.sort_index()

In [13]: ys = y.sort_index()

In [14]: timeit xs + ys
10 loops, best of 3: 44.1 ms per loop

Note that I'm also the fastest (that I know of) among Python libraries. Here's the above example using the labeled array package:

In [12]: import la

In [13]: lx = la.larry(x.values, [list(x.index)])

In [14]: ly = la.larry(y.values, [list(y.index)])

In [15]: timeit la.add(lx, ly, join="outer")
1 loops, best of 3: 214 ms per loop

In [16]: timeit la.add(lx, ly, join="inner")
10 loops, best of 3: 176 ms per loop

The verdict

So in a apples-to-apples comparison, in this benchmark pandas is 26x faster than zoo. Even in the completely unordered case (which is not apples-to-apples), it's 10x faster. I actually have a few tricks up my sleeve (as soon as I can find the time to implement them) to make the above operations even faster still =)