The pandas escaped the zoo: Python’s pandas vs. R’s zoo benchmarks

generic pandas data alignment is about 10-15x faster than the #rstats zoo package in initial tests. interesting #python
@wesmckinn
Wes McKinney

I tweeted that yesterday and figured it would be prudent to justify that with some code and real benchmarks. I’m really proud of pandas’s performance after investing years of development building a tool that is both easy-to-use and fast. So here we go.

The test case

The basic set-up is: you have two labeled vectors of different lengths and you add them together. The algorithm matches the labels and adds together the corresponding values. Simple, right?

R/zoo benchmarks

Here’s the R code:

library(zoo)

indices = rep(NA, 100000)
for (i in 1:100000)
  indices[i] <- paste(sample(letters, 10), collapse="")

timings <- numeric()

x <- zoo(rnorm(100000), indices)
y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)])

for (i in 1:10) {
  gc()
  timings[i] = system.time(x + y)[3]
}

In this benchmark, I get a timing of:

> mean(timings)
[1] 1.1518

So, 1.15 seconds per iteration. There are a couple things to note here:

  • The zoo package pre-sorts the objects by the index/label. As you will see below this makes a big performance difference as you can write a faster algorithm for ordered data.
  • zoo returns an object whose index is the intersection of the indexes. I disagree with this design choice as I feel that it is discarding information. pandas returns the union (the “outer join”, if you will) by default.

    Python benchmark

    Here’s the code doing basically the same thing, except using objects that are not pre-sorted by label:

    from pandas import *
    from pandas.util.testing import rands

    n = 100000
    indices = Index([rands(10) for _ in xrange(n)])

    def sample(values, k):
        from random import shuffle
        sampler = np.arange(len(values))
        shuffle(sampler)
        return values.take(sampler[:k])

    subsample_size = 90000

    x = Series(np.random.randn(100000), indices)
    y = Series(np.random.randn(subsample_size),
               index=sample(indices, subsample_size))

    And the timing:

    In [11]: timeit x + y
    10 loops, best of 3: 110 ms per loop

    Now, if I first sort the objects by index, a more specialized algorithm will be used:

    In [12]: xs = x.sort_index()

    In [13]: ys = y.sort_index()

    In [14]: timeit xs + ys
    10 loops, best of 3: 44.1 ms per loop

    Note that I’m also the fastest (that I know of) among Python libraries. Here’s the above example using the labeled array package:

    In [12]: import la

    In [13]: lx = la.larry(x.values, [list(x.index)])

    In [14]: ly = la.larry(y.values, [list(y.index)])

    In [15]: timeit la.add(lx, ly, join="outer")
    1 loops, best of 3: 214 ms per loop

    In [16]: timeit la.add(lx, ly, join="inner")
    10 loops, best of 3: 176 ms per loop

    The verdict

    So in a apples-to-apples comparison, in this benchmark pandas is 26x faster than zoo. Even in the completely unordered case (which is not apples-to-apples), it’s 10x faster. I actually have a few tricks up my sleeve (as soon as I can find the time to implement them) to make the above operations even faster still =)