The pandas escaped the zoo: Python’s pandas vs. R’s zoo benchmarks

generic pandas data alignment is about 10-15x faster than the #rstats zoo package in initial tests. interesting #python
@wesmckinn
Wes McKinney

I tweeted that yesterday and figured it would be prudent to justify that with some code and real benchmarks. I’m really proud of pandas’s performance after investing years of development building a tool that is both easy-to-use and fast. So here we go.

The test case

The basic set-up is: you have two labeled vectors of different lengths and you add them together. The algorithm matches the labels and adds together the corresponding values. Simple, right?

R/zoo benchmarks

Here’s the R code:

library(zoo)

indices = rep(NA, 100000)
for (i in 1:100000)
  indices[i] <- paste(sample(letters, 10), collapse="")

timings <- numeric()

x <- zoo(rnorm(100000), indices)
y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)])

for (i in 1:10) {
  gc()
  timings[i] = system.time(x + y)[3]
}

In this benchmark, I get a timing of:

> mean(timings)
[1] 1.1518

So, 1.15 seconds per iteration. There are a couple things to note here:

  • The zoo package pre-sorts the objects by the index/label. As you will see below this makes a big performance difference as you can write a faster algorithm for ordered data.
  • zoo returns an object whose index is the intersection of the indexes. I disagree with this design choice as I feel that it is discarding information. pandas returns the union (the “outer join”, if you will) by default.

    Python benchmark

    Here’s the code doing basically the same thing, except using objects that are not pre-sorted by label:

    from pandas import *
    from pandas.util.testing import rands

    n = 100000
    indices = Index([rands(10) for _ in xrange(n)])

    def sample(values, k):
        from random import shuffle
        sampler = np.arange(len(values))
        shuffle(sampler)
        return values.take(sampler[:k])

    subsample_size = 90000

    x = Series(np.random.randn(100000), indices)
    y = Series(np.random.randn(subsample_size),
               index=sample(indices, subsample_size))

    And the timing:

    In [11]: timeit x + y
    10 loops, best of 3: 110 ms per loop

    Now, if I first sort the objects by index, a more specialized algorithm will be used:

    In [12]: xs = x.sort_index()

    In [13]: ys = y.sort_index()

    In [14]: timeit xs + ys
    10 loops, best of 3: 44.1 ms per loop

    Note that I’m also the fastest (that I know of) among Python libraries. Here’s the above example using the labeled array package:

    In [12]: import la

    In [13]: lx = la.larry(x.values, [list(x.index)])

    In [14]: ly = la.larry(y.values, [list(y.index)])

    In [15]: timeit la.add(lx, ly, join="outer")
    1 loops, best of 3: 214 ms per loop

    In [16]: timeit la.add(lx, ly, join="inner")
    10 loops, best of 3: 176 ms per loop

    The verdict

    So in a apples-to-apples comparison, in this benchmark pandas is 26x faster than zoo. Even in the completely unordered case (which is not apples-to-apples), it’s 10x faster. I actually have a few tricks up my sleeve (as soon as I can find the time to implement them) to make the above operations even faster still =)

    • http://dirk.eddelbuettel.com Dirk Eddelbuettel

      I know a lot of people who are very fond of zoo (and xts); not one of them would use _character labels_ as indices. Yes, you can. No, you shouldn’t. We use this with time-based indices, almost always POSIXct. So you’re making fruit salad here from apples and oranges.

      [Reply]

      Wes McKinney Reply:

      I think the larger problem here is that the data alignment algorithms in zoo are not very good. Changing the index type from string to long int or double won’t make a bad algorithm good (though faster, yes).

      [Reply]

    • Gabor Grothendieck

      Every xts object is also a zoo object and xts has hard coded the time classes for speed (and also re-implemented certain performance critical operations in C) so its not as general (e.g. it won’t support character string indexes) but it supports the important index classes. If speed were one’s main criterion I think they would likely be using xts so that would be the meaningful comparison. I tried xts/zoo and zoo alone with n = 5000 and changing the character string class to Date class and xts/zoo ran 111x faster than zoo alone so if pandas runs 26x faster then you can get a 4x speedup over pandas while still staying in the zoo ecosystem by using xts/zoo. Regarding merges, merge.zoo handles many different types of merges.

      [Reply]

      Wes McKinney Reply:

      Could you post the code for your benchmark someplace? It’s been over a year since this analysis and would be good to rerun to see where things have improved in the meantime.

      [Reply]