Monthly Archives: September 2011

The pandas escaped the zoo: Python’s pandas vs. R’s zoo benchmarks

generic pandas data alignment is about 10-15x faster than the #rstats zoo package in initial tests. interesting #python
@wesmckinn
Wes McKinney

I tweeted that yesterday and figured it would be prudent to justify that with some code and real benchmarks. I’m really proud of pandas’s performance after investing years of development building a tool that is both easy-to-use and fast. So here we go.

The test case

The basic set-up is: you have two labeled vectors of different lengths and you add them together. The algorithm matches the labels and adds together the corresponding values. Simple, right?

R/zoo benchmarks

Here’s the R code:

library(zoo)

indices = rep(NA, 100000)
for (i in 1:100000)
  indices[i] <- paste(sample(letters, 10), collapse="")

timings <- numeric()

x <- zoo(rnorm(100000), indices)
y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)])

for (i in 1:10) {
  gc()
  timings[i] = system.time(x + y)[3]
}

In this benchmark, I get a timing of:

> mean(timings)
[1] 1.1518

So, 1.15 seconds per iteration. There are a couple things to note here:

  • The zoo package pre-sorts the objects by the index/label. As you will see below this makes a big performance difference as you can write a faster algorithm for ordered data.
  • zoo returns an object whose index is the intersection of the indexes. I disagree with this design choice as I feel that it is discarding information. pandas returns the union (the “outer join”, if you will) by default.

    Python benchmark

    Here’s the code doing basically the same thing, except using objects that are not pre-sorted by label:

    from pandas import *
    from pandas.util.testing import rands

    n = 100000
    indices = Index([rands(10) for _ in xrange(n)])

    def sample(values, k):
        from random import shuffle
        sampler = np.arange(len(values))
        shuffle(sampler)
        return values.take(sampler[:k])

    subsample_size = 90000

    x = Series(np.random.randn(100000), indices)
    y = Series(np.random.randn(subsample_size),
               index=sample(indices, subsample_size))

    And the timing:

    In [11]: timeit x + y
    10 loops, best of 3: 110 ms per loop

    Now, if I first sort the objects by index, a more specialized algorithm will be used:

    In [12]: xs = x.sort_index()

    In [13]: ys = y.sort_index()

    In [14]: timeit xs + ys
    10 loops, best of 3: 44.1 ms per loop

    Note that I’m also the fastest (that I know of) among Python libraries. Here’s the above example using the labeled array package:

    In [12]: import la

    In [13]: lx = la.larry(x.values, [list(x.index)])

    In [14]: ly = la.larry(y.values, [list(y.index)])

    In [15]: timeit la.add(lx, ly, join="outer")
    1 loops, best of 3: 214 ms per loop

    In [16]: timeit la.add(lx, ly, join="inner")
    10 loops, best of 3: 176 ms per loop

    The verdict

    So in a apples-to-apples comparison, in this benchmark pandas is 26x faster than zoo. Even in the completely unordered case (which is not apples-to-apples), it’s 10x faster. I actually have a few tricks up my sleeve (as soon as I can find the time to implement them) to make the above operations even faster still =)

  • Python, R, and the allure of magic

    R is much more magical than Python. What do I mean by this? In R, things like this are a part of everyday life:

    > a <- rnorm(10)
    > b <- rnorm(10)
    > cbind(a, b)
                   a          b
     [1,]  0.8729978  0.5170078
     [2,] -0.6885048 -0.4430447
     [3,]  0.4017740  1.8985843
     [4,]  2.1088905 -1.4121763
     [5,]  0.9375273  0.4703302
     [6,]  0.5558276 -0.5825152
     [7,] -2.1606252  0.7379874
     [8,] -0.7651046 -0.4534345
     [9,] -4.2604901  0.9561077
    [10,]  0.3940632 -0.8331285

    If you’re a seasoned Python programmer, you might have the sort of visceral negative reaction that I do to this. Seriously, just where in the hell did those variable names come from? So when I say magic here I’m talking about abusing the language’s parser. There is nothing special about R that makes the above behavior possible, but rather taking a fundamentally different design philosophy to, say, Python. As any Python programmer knows: Explicit is better than implicit. I happen to agree. There is also a bit of a semantic difference in R versus Python in that assignment in R typically copies data, whereas variables in Python are simply references (labels) for a particular object. So you could make the argument that the names a and b above are more strongly linked to the underlying data.

    While building pandas over the last several years, I occasionally grapple with issues like the above. Maybe I should just break from Python ethos and embrace magic? I mean, how hard would it be to get the above behavior in Python? Python gives you stack frames and the ast module after all. So I went down the rabbit hole and wrote this little code snippet:

    While this is woefully unpythonic, it’s also kind of cool:

    In [27]: merge(a, b)
    Out[27]:
                a         b      
    2000-01-03 -1.35      0.8398
    2000-01-04  0.999    -1.617  
    2000-01-05  0.2537    1.433  
    2000-01-06  0.6273   -0.3959
    2000-01-07  0.7963   -0.789  
    2000-01-10  0.004295 -1.446

    This can even parse and format more complicated expressions (harder than it looks, because you have to walk the whole AST):

    In [30]: merge(a, np.log(b))
    Out[30]:
                a        np.log(b)
    2000-01-03  0.6243   0.7953  
    2000-01-04  0.3593  -1.199    
    2000-01-05  2.805   -1.059    
    2000-01-06  0.6369  -0.9067  
    2000-01-07 -0.2734   NaN      
    2000-01-10 -1.023    0.3326

    Now, I am *not* suggesting we do this any time soon. I’m going to prefer the explicit approach (cf. the Zen of Python) any day of the week:

    In [32]: DataFrame({'a' : a, 'log(b)' : np.log(b)})
    Out[32]:
                a        log(b)
    2000-01-03  0.6243   0.7953
    2000-01-04  0.3593  -1.199  
    2000-01-05  2.805   -1.059  
    2000-01-06  0.6369  -0.9067
    2000-01-07 -0.2734   NaN    
    2000-01-10 -1.023    0.3326

    Faster time series alignment / joins for pandas, beating R’s xts package

    In anticipation of integrating NumPy’s shiny new datetime64 dtype into pandas, I set about writing some faster alignment and merging functions for ordered time series data indexed by datetime64 timestamps. Many people have pointed me to the widely used R xts package as a baseline for highly optimized joining functions.

    Anyway, long story short, with a little NumPy- and Cython-fu I think I’ve matched or beaten xts for almost all of its supported join types by up to 40% (left/outer/inner) using the merge.xts function.

    In a blog article earlier today I wrote about some of the performance problems I had to address to do this. The rest of the joining code is pretty straightforward Cython code. Though it’ll probably be a few weeks before this new code gets incorporated into DataFrame.join. You’ll just have to take my word for it that I’m doing an apples-to-apples comparison (or read the source yourself) =)

    Python benchmarks

    Here are the Python timings in milliseconds for joining two time series data sets. The column labels are the lengths (in scientific notation, from 100 through 1,000,000). The two timings are with two univariate time series and two collections of 5 time series.

    EDIT (9/24/2011): after corresponding with the xts author, Jeff Ryan, I reran the benchmarks with the code modified to ensure that garbage collection time isn’t being included in the runtime. The results after the change to the benchmark are less disparate than before. I also tweaked the Cython algos to determine the outer/inner join time index and re-ran the benchmarks. In the 1e6 outer join case the new algo trimmed 8 ms off, 4-5ms in the inner join case. Whenever I develop a strong desire to hack up a pile of spaghetti-like Cython code (combining the index union/intersection routines with the take / row-copying code) I can probably shave off another few millis…

    Python Benchmark on 9/25

    Joining two univariate time series
           1e2      1e3      1e4     1e5    1e6    
    outer  0.0605   0.0637   0.1966  1.898  26.26
    left   0.0187   0.02282  0.1157  1.023  13.89
    inner  0.04526  0.05052  0.1523  1.382  22.25

    Joining two 5-variate time series
           1e2      1e3      1e4     1e5    1e6    
    outer  0.07944  0.0638   0.3178  6.498  67.46
    left   0.0255   0.03512  0.2467  4.711  51.88
    inner  0.06176  0.05262  0.2283  5.267  56.46

    EDIT 9/28: I put in some work integrating the new merging routines throughout DataFrame and friends in pandas and added a new Int64Index class to facilitate fast joining of time series data. Here are the updated benchmarks, which now have pandas a bit slower than xts for outer/inner joins in the univariate case but still significantly faster in the multivariate case:

    Python Benchmark on 9/28 post integration

    Joining two univariate time series
           1e2     1e3     1e4     1e5    1e6    
    outer  0.4501  0.314   0.5362  3.162  30.78
    left   0.2742  0.2879  0.408   2.025  19.84
    inner  0.2715  0.2863  0.4306  2.504  26.64

    Joining two 5-variate time series
           1e2      1e3    1e4     1e5    1e6    
    outer  0.4507  0.3375  0.6158  7.028  71.34
    left   0.2959  0.3184  0.4927  5.068  55.2
    inner  0.2767  0.305   0.5368  5.782  59.65

    As you can see in the 1 million row case there is an additional 4-5 ms of overhead across the board which largely has to do with handling types other than floating point. With some effort I could eliminate this overhead but I’m going to leave it for now.

    And the source code for the benchmark:

    from pandas import *
    import gc
    import numpy as np

    def bench_python(pct_overlap=0.20, K=1):
        ns = [2, 3, 4, 5, 6]
        iterations = 50
        pct_overlap = 0.2
        kinds = ['outer', 'left', 'inner']

        all_results = {}
        for logn in ns:
            n = 10**logn
            a = np.arange(n, dtype=np.int64)
            b = np.arange(n * pct_overlap, n * pct_overlap + n, dtype=np.int64)
            a_frame = DataFrame(np.random.randn(n, K), index=a, columns=range(K))
            b_frame = DataFrame(np.random.randn(n, K), index=b, columns=range(K, 2 * K))
            all_results[logn] = result = {}
            for kind in kinds:
                gc.disable(); _s = time.clock()
                # do the join
                for _ in range(iterations):
                    a_frame.join(b_frame, how=kind)
               
                elapsed = time.clock() - _s; gc.enable()
                result[kind] = (elapsed / iterations) * 1000
        return DataFrame(all_results, index=kinds)

    R/xts benchmarks

    And the R benchmark using xts. The results for the smaller datasets are unreliable due to the low precision of system.time.

    R Benchmark

    Joining two univariate time series
           1e2  1e3  1e4  1e5   1e6
    outer 0.30 0.26 0.48 3.12 28.58
    left  0.22 0.24 0.36 2.78 24.18
    inner 0.20 0.24 0.40 2.42 21.06

    Joining two 5-variate time series
           1e2  1e3  1e4   1e5   1e6
    outer 0.26 0.46 1.30 11.56 97.02
    left  0.34 0.28 1.06 10.04 85.72
    inner 0.30 0.28 0.94  8.02 67.22

    The Python code for the benchmark is all found here.

    Here is the R code for the benchmark (GitHub link):

    library(xts)

    iterations <- 50

    ns = c(100, 1000, 10000, 100000, 1000000)
    kinds = c("outer", "left", "inner")

    result = matrix(0, nrow=3, ncol=length(ns))
    n <- 100000
    pct.overlap <- 0.2

    k <- 1

    for (ni in 1:length(ns)){
     n <- ns[ni]
     rng1 <- 1:n
     offset <- as.integer(n * pct.overlap)
     rng2 <- rng1 + offset
     x <- xts(matrix(rnorm(n * k), nrow=n, ncol=k),
              as.POSIXct(Sys.Date()) + rng1)
     y <- xts(matrix(rnorm(n * k), nrow=n, ncol=k),
              as.POSIXct(Sys.Date()) + rng2)
     timing <- numeric()
     for (i in 1:3) {
         kind = kinds[i]
         for(j in 1:iterations) {
           gc()  # just to be sure
           timing[j] <- system.time(merge(x,y,join=kind))[3]
         }
         #timing <- system.time(for (j in 1:iterations) merge.xts(x, y, join=kind),
         #                      gcFirst=F)
         #timing <- as.list(timing)
         result[i, ni] <- mean(timing) * 1000
         #result[i, ni] = (timing$elapsed / iterations) * 1000
       }
    }

    rownames(result) <- kinds
    colnames(result) <- log10(ns)

    NumPy indexing peculiarities

    Many scientific Python users are surprised when I tell them that ndarray.take is faster than __getitem__-based (a.k.a. “fancy” as I call it) indexing.

    import numpy as np
    import random

    arr = np.random.randn(10000, 5)
    indexer = np.arange(10000)
    random.shuffle(indexer)

    In [26]: timeit arr[indexer]
    1000 loops, best of 3: 1.25 ms per loop

    In [27]: timeit arr.take(indexer, axis=0)
    10000 loops, best of 3: 127 us per loop

    It’s actually kind of unbelievable when you think about it. What’s going on here that take is almost 10x faster? I really should take a closer at the internals of what __getitem__ does because this has always struck me as pretty bad. Maybe I shouldn’t be complaining? I mean, R 2.13′s indexing falls somewhere in the middle:

    mat <- matrix(rnorm(50000), nrow=10000, ncol=5)
    set.seed(12345)
    indexer <- sample(1:10000)
    > system.time(for (i in 1:1000) mat[indexer,])
       user  system elapsed
      0.460   0.197   0.656

    So 656 microseconds per iteration. (In an earlier version of this post I used rpy2 to do the benchmark and got 1.05 ms, but there was apparently some overhead from rpy2)

    Another peculiarity that I noticed with take is that performance gets worse when you use the out argument, which tells the function to use an array you pass in to write out the result:

    out = np.empty_like(arr)

    In [50]: timeit np.take(arr, indexer, axis=0, out=out)
    10000 loops, best of 3: 200 us per loop

    EDIT: I’ve been informed that using mode='clip' or mode='wrap' makes this run as fast as without the out argument.

    Weird! I was dissatisfied by this, so I got curious how fast a hand-coded little Cython function can do this:

    @cython.wraparound(False)
    @cython.boundscheck(False)
    def take_axis0(ndarray[float64_t, ndim=2] values,
                   ndarray[int32_t] indexer,
                   out=None):
        cdef:
            Py_ssize_t i, j, k, n, idx
            ndarray[float64_t, ndim=2] outbuf

        if out is None:
            outbuf = np.empty_like(values)
        else:
            outbuf = out

        n = len(indexer)
        k = values.shape[1]
        for i from 0 <= i < n:
            idx = indexer[i]

            if idx == -1:
                for j from 0 <= j < k:
                    outbuf[i, j] = NaN
            else:
                for j from 0 <= j < k:
                    outbuf[i, j] = values[idx, j]

    Don’t worry about the -1 thing– that’s a specialization that I’m using inside pandas. Curiously, this function is a lot faster than take using out but faster than the regular take by a handful of microseconds.

    In [53]: timeit lib.take_axis0(arr, indexer)
    10000 loops, best of 3: 115 us per loop

    In [54]: timeit lib.take_axis0(arr, indexer, out)
    10000 loops, best of 3: 109 us per loop

    Very interesting.

    TL;DR

  • Use take not []-based indexing to get best performance
  • Cython is just as fast for my specific application and a lot faster if you’re passing an out array (which I will be for the application that I needed this for)
  • R’s matrix indexing performance is better than NumPy’s fancy indexing, but about 5-6x slower than ndarray.take. This can probably be improved.
  • NYC Open Statistical Programming Meetup on 9/14/2011

    Link to recording of talk

    I had the privilege of speaking yesterday at the first meeting of the former NYC R Meetup, now dubbed the NYC Open Statistical Programming Meetup. I was really excited to share some of the new stuff I’ve built lately in pandas, though I didn’t have enough time to demo features as I would have liked :) I also ran out of time to talk about statsmodels, but I’ll post some materials here for those who are interested.

    One thing I showed off which a lot of people have been asking me about is the totally awesome new IPython HTML notebook. I don’t think I’ll ever do a Python demo in anything else, it’s just that good for teaching and doing interactive demos. Brian Granger, Fernando Perez, and crew have developed a notebook file format (.ipynb) for saving and sharing these interactive notebooks. I’ll post links to my IPython notebooks for the pandas demo and the statsmodels demo I didn’t have time for at the end, along with instructions for running them.

    Here are my slides from the talk:

    Now, here’s the exciting part: the IPython notebooks. You’ll need to install the latest development version of IPython from GitHub. Make sure you have the Python ZeroMQ bindings and install the tornado web server. This ought to work for those (using setuptools):

    easy_install pyzmq
    easy_install tornado

    Once you’ve done all that, fire up a terminal and run:

    ipython notebook --pylab=inline

    You should see the server fire up and give you a web address like 127.0.0.1:8888. Now, you need my demo workbooks. Here is a zip file with everything you need:

    Demo Notebooks
    PDF of pandas demo notebook
    PDF of statsmodels demo notebook

    Expand this zip file and launch the IPython notebook in the same directory where the .ipynb live. You should be able to run all the code in the pandas demo workbook yourself if you have pandas 0.4 (click “Run All”), but the statsmodels notebook uses some very new code from the pandas-integration branch on GitHub and some formula code that is not part of the statsmodels repo. So you can see all the code and output, but if you want to know more, I encourage you to get involved in the project!

    IPython HTML Notebook

    Leaking memory like a pro in Python

    For all my Python zealotry, I never claimed it was perfect. In fact, there are a number of places you can shoot yourself in the foot if you aren’t careful. But I was kind of bummed about this one:

    import numpy as np

    class MainScreen(object):

        def __init__(self, cats):
            self.cats = cats
            self._signal = None

        def turn_on(self):
            if self._signal is None:
                signal = Signal(self)
                self._signal = signal
            return self._signal

    class Signal(object):

        def __init__(self, main_screen):
            # well we want to know which screen we came from!
            self.main_screen = main_screen

        def we_get(self):
            return True

    for i in xrange(50):
        cats = np.empty((10000, 500), dtype=float)
        cats.fill(0)
        main_screen = MainScreen(cats)
        signal = main_screen.turn_on()
        # all your base...

    Try it.

    EDIT: I’ve been informed by an astute reader that the GC is actually capable of dealing with the above case. Yang Zhang says “The collector actually works with cycles, including this case. (Actually, most objects are managed via reference-counting; the gc is specifically for cyclic references.) The problem is that it’s triggered by # allocations, not # bytes. In this case, each cats array is large enough that you need several GB of allocations before the threshold is triggered. Try adding gc.collect() or gc.get_count() in the loop (and inspect gc.get_threshold()).”

    All is not lost however. This pattern can be implemented using weak references:

    class MainScreen(object):

        def __init__(self, cats):
            self.cats = cats
            self._signal = None

        def turn_on(self):
            import weakref
            if self._signal is None:
                signal = Signal(self)
                self._signal = weakref.ref(signal)
            return self._signal()