Tag Archives: python

Mastering high performance data algorithms I: Group By

I’m on my way back from R/Finance 2012. Those guys did a nice job of organizing the conference and was great to meet everyone there.

As part of pandas development, I have had to develop a suite of high performance data algorithms and implementation strategies which are the heart and soul of the library. I get asked a lot why pandas’s performance is much better than R and other data manipulation tools. The reasons are two-fold: careful implementation (in Cython and and C, so minimizing the computational friction) and carefully-thought-out algorithms.

Here are some of the more important tools and ideas that I make use of on a day-to-day basis:

  • Hash tables I use klib. It’s awesome and easy to use. I have this for critical algorithms like unique, factorize (unique + integer label assignment)
  • O(n) sort, known in pandas as groupsort, on integers with known range. This is a variant of counting sort; if you have N integers with known range from 0 to K – 1, this can be sorted in O(N) time. Combining this tool with factorize (hash table-based), you can categorize and sort a large data set in linear time. Failure to understand these two algorithms will force you to pay O(N log N), dominating the runtime of your algorithm.
  • Vectorized data movement and subsetting routines: take, put, putmask, replace, etc.
  • Let me give you a prime example from a commit yesterday of me applying these ideas to great effect. Suppose I had a time series (or DataFrame containing time series) that I want to group by year, month, and day:

    This is a small dataset, but imagine you have millions of observations and thousands or even millions of groups. How does that look algorithmically? I guarantee if you take a naive approach, you will crash and burn when the data increases in size. I know, because I did just that (take a look at the vbenchmarks). Laying down the infrastructure for doing a better job is not simple. Here are the steps for efficiently aggregating data like this:

  • Factorize the group keys, computing integer label arrays (O(N) per key)
  • Compute a “cartesian product number” or group index for each observation (since you could theoretically have K_1 * … * K_p groups observed, where K_i is the number of unique values in key i). This is again O(n) work.
  • If the maximum number of groups is large, “compress” the group index by using the factorize algorithm on it again. imagine you have 1000 uniques per key and 3 keys; most likely you do not actually observe 1 billion key combinations but rather some much smaller number. O(n) work again
  • For simple aggregations, like mean, go ahead and aggregate the data in one linear sweep without moving anything around.
  • Otherwise, for arbitrary user-defined aggregations, we have two options for chopping up the data set: sorting it (O(N) using groupsort!) or create index arrays indicating which observations belong to each group. The latter would be preferable in many cases to sorting a large data set (reordering the data, even though linear time, can be very costly).
  • I worked on speeding up the latter part of this last bullet point yesterday. The resulting code looked like this:

    The details of lib.indices_fast aren’t that interesting; it chops up np.arange(len(group_index)).take(sorter), the sorted indirect indices, to produce the index dictionary. Running %lprun to get a line profiling on a large-ish data set:

    You might say, well, this seems like a lot of work and maybe we should just zip the keys (forming an array of Python tuples) and do a dumb algorithm? The speed difference ends up being something like an order of magnitude or more faster by being careful in this way and working with indirect integer index arrays.

    Anyway, in conclusion, it’s these kinds of algorithms and ideas why pandas is perhaps the best-performing open-source data analysis toolkit for in memory data (I’m going to get to out-of-core data processing and “big data” eventually, just hang tight). It goes beyond language features and data structure internals (though this naturally also has a major impact, a lot of the things I do are easy to express in Python but would be very awkward or impossible to do in, say, R. Maybe I should write a whole article on this.); carefully thought-out algorithms are a major piece of the puzzle.

    Contingency tables and cross-tabulations in pandas

    Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.

    When calling help(xtabs) (or help(table)) in R, you get a number of examples for using table, xtabs, and ftable:

    Here are the same examples using pandas:

    One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:

    Here is another fun example:

    Here is the equivalent R code:

    As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.

    NYCPython 1/10/2012: A look inside pandas design and development

    I had the privilege of speaking last night at the NYCPython meetup group. I’ve given tons of “use pandas!” talks so I thought I would take a slightly different angle and talk about some of the design and implementation work that I’ve done for getting good performance in critical data manipulations. I’ll turn some of this material into some blog articles in the near future.

    Wes McKinney: pandas design and development from Adam Klein on Vimeo.

    Here’s some more video footable shot by my awesome friend Emily Paup with her HDSLR.

    I did a little interactive demo (using the ever-amazing IPython HTML Notebook) on Ashley Williams’s Food Nutrient JSON Database:

    Link to PDF output of Demo

    Link to IPython Notebook file

    If you want to run the code in the IPython notebook, you’ll have to download the food database file above.

    The audience and I learned from this demo that if you’re after Tryptophan, “Sea lion, Steller, meat with fat (Alaska Native)” is where to get it (in the highest density).

    Some pandas Database Join (merge) Benchmarks vs. R base::merge

    Over the last week I have completely retooled pandas’s “database” join infrastructure / algorithms in order to support the full gamut of SQL-style many-to-many merges (pandas has had one-to-one and many-to-one joins for a long time). I was curious about the performance with reasonably large data sets as compared with base::merge.data.frame which I’ve used many times in R. So I just ran a little benchmark for joining a 100000-row DataFrame with a 10000-row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.

    Note this new functionality will be shipping in the upcoming 0.7.0 release (!).

    There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it’s completely apples to apples. These are mean runtimes in seconds:


    Sort by key columns
    pandas R Ratio
    inner 0.07819 0.2286 2.924
    outer 0.09013 1.284 14.25
    left 0.0853 0.7766 9.104
    right 0.08112 0.3371 4.156


    Don’t sort by key columns
    pandas R Ratio
    inner 0.02808 0.2297 8.18
    outer 0.03879 1.181 30.45
    left 0.0365 0.6706 18.37
    right 0.03021 0.2995 9.912

    As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn’t really an indictment of R’s merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven’t run benchmarks against SQLite or another SQL database yet; that will happen eventually.

    I’m going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast–it was not easy!

    Here is the R code.

    N <- 10000
    indices = rep(NA, N)
    for (i in 1:N)
      indices[i] <- paste(sample(letters, 10), collapse="")

    left <- data.frame(key=rep(indices, 10),
                       key2=sample(rep(indices, 10)),
                       value=rnorm(100000))
    right <- data.frame(key=indices,
                        key2=sample(indices),
                        value2=rnorm(10000))
    timeit <- function(func, niter=10) {
      timing = rep(NA, niter)
      for (i in 1:niter) {
        gc()
        timing[i] <- system.time(func())[3]
      }
      mean(timing)
    }

    left.join <- function(sort=TRUE) {
      result <- merge(left, right, all.x=TRUE, sort=sort)
    }
    right.join <- function(sort=TRUE) {
      result <- merge(left, right, all.y=TRUE, sort=sort)
    }
    outer.join <- function(sort=TRUE) {
      result <- merge(left, right, all=TRUE, sort=sort)
    }
    inner.join <- function(sort=TRUE) {
      reuslt <- merge(left, right, sort=sort)
    }

    sort.options <- c(FALSE, TRUE)

    results <- matrix(nrow=4, ncol=2)
    colnames(results) <- c("dont_sort", "sort")
    rownames(results) <- c("inner", "outer", "left", "right")

    join.functions <- c(inner.join, outer.join, left.join, right.join)
    for (i in 1:4) {
      results[1, 1] <- timeit(function() {inner.join(sort=sort.options[1])})
      results[1, 2] <- timeit(function() {inner.join(sort=sort.options[2])})
      results[2, 1] <- timeit(function() {outer.join(sort=sort.options[1])})
      results[2, 2] <- timeit(function() {outer.join(sort=sort.options[2])})
      results[3, 1] <- timeit(function() {left.join(sort=sort.options[1])})
      results[3, 2] <- timeit(function() {left.join(sort=sort.options[2])})
      results[4, 1] <- timeit(function() {right.join(sort=sort.options[1])})
      results[4, 2] <- timeit(function() {right.join(sort=sort.options[2])})
    }

    And the Python code

    from collections import defaultdict
    import gc
    import time
    from pandas.util.testing import rands
    N = 10000
    indices = np.array([rands(10) for _ in xrange(N)], dtype='O')

    key = np.tile(indices, 10)
    key2 = key.copy()
    random.shuffle(key2)
    indices2 = indices.copy()
    random.shuffle(indices2)
    left = DataFrame({'key' : key, 'key2':key2,
                      'value' : np.random.randn(100000)})
    right = DataFrame({'key': indices, 'key2':indices2,
                       'value2' : np.random.randn(10000)})
    join_methods = ['inner', 'outer', 'left', 'right']
    results = DataFrame(index=join_methods, columns=[False, True])
    niter = 10
    for sort in [False, True]:
        for join_method in join_methods:
            f = lambda: merge(left, right, how=join_method, sort=sort)
            gc.disable()
            start = time.time()
            for _ in xrange(niter):
                f()
            elapsed = (time.time() - start) / niter
            gc.enable()
            results[sort][join_method] = elapsed
    results.columns = ['dont_sort', 'sort']

    # R results
    from StringIO import StringIO
    r_results = read_table(StringIO("""dont_sort   sort
    inner    0.2297 0.2286
    outer    1.1811 1.2843
    left     0.6706 0.7766
    right    0.2995 0.3371
    """
    ), sep='\s+')

    sort_results = DataFrame.from_items([('pandas', results['sort']),
                                         ('R', r_results['sort'])])
    sort_results['Ratio'] = sort_results['R'] / sort_results['pandas']

    nosort_results = DataFrame.from_items([('pandas', results['dont_sort']),
                                           ('R', r_results['dont_sort'])])
    nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas']

    Introducing vbench, new code performance analysis and monitoring tool

    Do you know how fast your code is? Is it faster than it was last week? Or a month ago? How do you know if you accidentally made a function slower by changes elsewhere? Unintentional performance regressions are extremely common in my experience: it’s hard to unit test the performance of your code. Over time I have gotten tired of playing the game of “performance whack-a-mole”. Thus, I started hacking together a little weekend project that I’m calling vbench. If someone thinks up a cleverer name, I’m all ears.

    Link to pandas benchmarks page produced using vbench

    What is vbench?

    vbench is a super-lightweight Python library for running a collection of performance benchmarks over the course of your source repository’s history. Since I’m a GitHub user, it only does git for now, but it could be generalized to support other VCSs. Basically, you define a benchmark:

    common_setup = """
    from pandas import *
    import pandas.util.testing as tm
    import random
    import numpy as np
    """


    setup = common_setup + """

    N = 100000
    ngroups = 100

    def get_test_data(ngroups=100, n=N):
        unique_groups = range(ngroups)
        arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object)

        if len(arr) < n:
            arr = np.asarray(list(arr) + unique_groups[:n - len(arr)],
                             dtype=object)

        random.shuffle(arr)
        return arr

    df = DataFrame({'key1' : get_test_data(ngroups=ngroups),
                    'key2' : get_test_data(ngroups=ngroups),
                    'data' : np.random.randn(N)})
    def f():
        df.groupby(['key1', 'key2']).agg(lambda x: x.values.sum())
    """

    stmt2 = "df.groupby(['key1', 'key2']).sum()"
    bm_groupby2 = Benchmark(stmt2, setup, name="GroupBy test 2",
                            start_date=datetime(2011, 7, 1))

    Then you write down the information about your repository and how to build any relevant DLLs, etc., that vary from revision to revision:

    REPO_PATH = '/home/wesm/code/pandas'
    REPO_URL = 'git@github.com:wesm/pandas.git'
    DB_PATH = '/home/wesm/code/pandas/gb_suite/benchmarks.db'
    TMP_DIR = '/home/wesm/tmp/gb_pandas'
    PREPARE = """
    python setup.py clean
    """
    BUILD = """
    python setup.py build_ext --inplace
    """
    START_DATE = datetime(2011, 3, 1)

    Then you pass this info, plus a list of your benchmark objects, to the BenchmarkRunner class:

    runner = BenchmarkRunner(benchmarks, REPO_PATH, REPO_URL,
                             BUILD, DB_PATH, TMP_DIR, PREPARE,
                             run_option='eod', start_date=START_DATE)
    runner.run()

    Now, the BenchmarkRunner makes a clone of your repo, then runs all of the benchmarks once for each revision in the repository (or some other rule, e.g. I’ve set run_option='eod' to only take the last snapshot on each day). It persists the results in a SQLite database so that you can rerun the process and it will skip benchmarks it’s already run (this is key when you add new benchmarks, only the new ones will be updated). Benchmarks are uniquely identified by the MD5 hash of their source code.

    This is the resulting plot over time for the above GroupBy benchmark related to some Cython code that I worked on late last week (where I made a major performance improvement in this case):

    Here is a fully-formed vbench suite in the pandas git repository.

    Kind of like codespeed and speed.pypy.org?

    Before starting to write a new project I looked briefly at codespeed and the excellent work that the PyPy guys have done with speed.pypy.org. But then I started thinking, you know, Django web application, JSON requests to upload benchmark results? Seemed like far too much work to do something relatively simple. The dealbreaker is that codespeed is just a web application. It doesn’t actually (to my knowledge, someone correct me if I’m wrong?) have any kind of a framework for orchestrating the running of benchmarks throughout your code history. That is what this new project is for. I actually see a natural connection between vbench and codespeed, all you need to do is write a script to upload your vbench results to a codespeed web app!

    At some point I’d like to build a simple web front end or wx/Qt viewer for the generated vbench database. I’ve never done any JavaScript, but it would be a good opportunity to learn. Knowing me, I might break down and hack out a stupid little wxPython app with an embedded matplotlib widget anyway.

    Anyway, I’m really excited about this project. It’s very prototype-y at the moment but I tried to design it in a nice and extensible way. I also plan to put all my git repo analysis tools in there (like code churn graphs etc.) so it should become a nice little collection of tools.

    Other ideas for extending

  • Dealing with API changes
  • Multiple library versions (e.g. NumPy) or Python versions
  • PyHPC 2011 Pre-print paper on pandas

    I’ll be presenting this paper in a couple of weeks at SC11. It’s the most up-to-date white paper about pandas.

    PDF Download Link

    pandas: a Foundational Python Library for Data Analysis and Statistics

    Fast and easy pivot tables in pandas 0.5.0

    Over the last several months, I’ve invested a great deal in the GroupBy and indexing infrastructure of pandas. GroupBy in particular could still use more work to be made even higher performance, but even for quite large data sets, most users will find it more than satisfactory compared with other Python alternatives (which are mainly DIY approaches these days) or even in other data analysis packages, e.g. those in R, which I’ll talk a bit about in this article.

    An easy application of GroupBy is creating pivot tables from a tabular data set. This involves aggregating a data set on some number of keys, then reshaping it to form a 2D table with the stratified aggregated values. I added a new function

     

    pivot_table

     

    which provides a high level interface for creating pivot tables. Let’s consider the tips data set used in some examples of the Hadley Wickham’s excellent R reshape2 package:

    In [1]: import pandas.rpy.common as com

    In [2]: tips = com.load_data('tips', package='reshape2')

    In [3]: tips['tip_pct'] = tips['tip'] / tips['total_bill']

    In [4]: tips.head()
    Out[4]:
       total_bill  tip   sex     smoker  day  time    size  tip_pct
    1  16.99       1.01  Female  No      Sun  Dinner  2     0.05945
    2  10.34       1.66  Male    No      Sun  Dinner  3     0.1605
    3  21.01       3.5   Male    No      Sun  Dinner  3     0.1666
    4  23.68       3.31  Male    No      Sun  Dinner  2     0.1398
    5  24.59       3.61  Female  No      Sun  Dinner  4     0.1468

    Since most readers won’t have rpy2, you can get the data file here and read it with:

    tips = read_csv('tips.csv')

    All of this new functionality is available now on GitHub but will be a part of the upcoming pandas 0.5.0 release. Note that I am only using rpy2 here to pull the R data set into my Python sesson– all of the below code is implemented in pure Python (with a little Cython magic to make things fast).

    Pivot table basics

    To use pivot_table, pass the pandas DataFrame and specify the column you wish to aggregate and the columns to group on the rows and columns of the table. Note the default aggregation is mean, though this can be specified:

    In [9]: table = pivot_table(tips, 'tip_pct', rows=['time', 'sex'],
                                cols='smoker')

    In [10]: table
    Out[10]:
      smoker       No      Yes  
    time   sex                  
    Dinner Female  0.1568  0.1851
           Male    0.1594  0.1489
    Lunch  Female  0.1571  0.1753
           Male    0.1657  0.1667

    In [17]: pivot_table(tips, 'tip_pct', rows=['day', 'time'],
                         cols='sex')
    Out[17]:
      sex        Female  Male  
    day  time                  
    Fri  Dinner  0.1991  0.1302
         Lunch   0.1997  0.1741
    Sat  Dinner  0.1565  0.1516
    Sun  Dinner  0.1816  0.1623
    Thur Dinner  0.1597  NaN  
         Lunch   0.1575  0.1653

    Note that the returned object is still a DataFrame, so you can use all of the standard indexing technology to select portions of the resulting table:

    In [22]: table.ix['Dinner']
    Out[22]:
      smoker  No      Yes  
    sex                    
    Female    0.1568  0.1851
    Male      0.1594  0.1489

    If you don’t specify a particular values column, it will try to aggregate all the other columns:

    In [24]: pivot_table(tips, rows=['sex', 'smoker'])
    Out[24]:
                   size   tip    tip_pct  total_bill
    sex    smoker                                  
    Female No      2.593  2.774  0.1569   18.11    
           Yes     2.242  2.932  0.1822   17.98    
    Male   No      2.711  3.113  0.1607   19.79    
           Yes     2.5    3.051  0.1528   22.28

    Other aggregations can be performed, like using len to get group counts:

    In [26]: pivot_table(tips, 'tip_pct', rows='sex', cols='smoker', aggfunc=len)
    Out[26]:
      smoker  No  Yes
    sex              
    Female    54  33
    Male      97  60

    Once you’ve done the aggregation, you’re free to use the built-in reshaping capability to reshape the data in the table:

    In [29]: table = pivot_table(tips, 'tip_pct', rows=['sex', 'day'],
                                 cols='smoker', aggfunc=len)

    In [30]: table
    Out[30]:
      smoker     No  Yes
    sex    day          
    Female Fri   2   7  
           Sat   13  15
           Sun   14  4  
           Thur  25  7  
    Male   Fri   2   8  
           Sat   32  27
           Sun   43  15
           Thur  20  10

    In [31]: table.unstack('sex')
    Out[31]:
      smoker  No            Yes        
      sex     Female  Male  Female  Male
    day                                
    Fri       2       2     7       8  
    Sat       13      32    15      27  
    Sun       14      43    4       15  
    Thur      25      20    7       10

    If you know a bit more about groupby, you can get clever and pass a dict of functions to perform multiple aggregations at once:

    In [59]: pivot_table(tips, rows=['sex', 'smoker'],
                         aggfunc={'tip_pct' : 'mean', 'size' : 'sum'})
    Out[59]:
                   size  tip_pct
    sex    smoker              
    Female No      140   0.1569
           Yes     74    0.1822
    Male   No      263   0.1607
           Yes     150   0.1528

    I thought that was pretty cool.

    Comparison with reshape2

    R’s reshape2 package relies on two functions, melt and cast, to do the reshaping. Its implementation is, at least in my opinion, a slightly more ad hoc approach than in pandas (the actual pivot_table function is only about 10 lines and basically falls out of the groupby and hierarchical-indexing based reshaping). However, it can take advantage of R’s built-in formulas which makes the syntax very intuitive. Here are some of the same examples using reshape2:

    > tips$tip.pct <- tips$tip / tips$total_bill
    > mtips = melt(tips, id=c("sex", "smoker", "day", "time"))
    > acast(mtips, day ~ sex + smoker, length, subset=.(variable=="tip"))
         Female_No Female_Yes Male_No Male_Yes
    Fri          2          7       2        8
    Sat         13         15      32       27
    Sun         14          4      43       15
    Thur        25          7      20       10

    > acast(mtips, sex + smoker ~ variable, mean)
               total_bill      tip     size   tip.pct
    Female_No    18.10519 2.773519 2.592593 0.1569210
    Female_Yes   17.97788 2.931515 2.242424 0.1821504
    Male_No      19.79124 3.113402 2.711340 0.1606687
    Male_Yes     22.28450 3.051167 2.500000 0.1527712

    R lacks hierarchical indexing as far as I know, so reshape2 munges the group labels together into a row label.

    In R there is also the xtabs function for doing cross-tabulation:

    > ftable(xtabs(size ~ time + sex + smoker + day, data=tips))
                         day Fri Sat Sun Thur
    time   sex    smoker                    
    Dinner Female No           2  30  43    2
                  Yes          8  33  10    0
           Male   No           4  85 124    0
                  Yes         12  71  39    0
    Lunch  Female No           3   0   0   60
                  Yes          6   0   0   17
           Male   No           0   0   0   50
                  Yes          5   0   0   23

    ftable just creates a flat table which can be more easily viewed. This table could be created with pandas by:

    In [54]: pivot_table(tips, 'size', rows=['time', 'sex', 'smoker'],
       ....:             cols='day', aggfunc=np.sum, fill_value=0)
    Out[54]:
      day                 Fri  Sat  Sun  Thur
    time   sex    smoker                    
    Dinner Female No      2    30   43   2  
                  Yes     8    33   10   0  
    Dinner Male   No      4    85   124  0  
                  Yes     12   71   39   0  
    Lunch  Female No      3    0    0    60  
                  Yes     6    0    0    17  
    Lunch  Male   No      0    0    0    50  
                  Yes     5    0    0    23

    Performance

    pandas is very fast as I’ve invested a great deal in optimizing the indexing infrastructure and other core algorithms related to things such as this. I don’t have a lot of points of comparison, but here is a simple benchmark of reshape2 versus pandas.pivot_table on a data set with 100000 entries and 25 groups. Here is the R code for the benchmark:

    library(reshape2)
    n <- 100000
    a.size <- 5
    b.size <- 5
    data <- data.frame(a=sample(letters[1:a.size], n, replace=T),
                       b=sample(letters[1:b.size], n, replace=T),
                       c=rnorm(n),
                       d=rnorm(n))
    timings <- numeric()

    for (i in 1:10) {
      gc()
      tim <- system.time(acast(melt(data, id=c("a", "b")), a + b ~ variable, mean))
      timings[i] = tim[3]
    }
    mean(timings)

    Here is what the actual table looks like:

    > table <- acast(melt(data, id=c("a", "b")), a + b ~ variable, mean)
    > head(table)
                   c            d
    a_a  0.002761963 -0.008793516
    a_b -0.004618980 -0.026978744
    a_c  0.022491767 -0.002163986
    a_d  0.005027362 -0.002532782
    a_e  0.006146438  0.031699238
    b_a -0.025094899  0.003334301

    Here’s the equivalent Python code (sans timings, which I’ll do in IPython):

    from pandas import *
    import string
    n = 100000
    asize = 5
    bsize = 5

    letters = np.asarray(list(string.letters), dtype=object)

    data = DataFrame(dict(foo=letters[:asize][np.random.randint(0, asize, n)],
                          bar=letters[:bsize][np.random.randint(0, bsize, n)],
                          baz=np.random.randn(n),
                          qux=np.random.randn(n)))

    table = pivot_table(data, rows=['foo', 'bar'])

    Similarly:

    In [36]: table = pivot_table(data, rows=['foo', 'bar'])

    In [37]: table[:10]
    Out[37]:
             baz       qux      
    foo bar                    
    a   a   -0.02162   0.02186  
        b   -0.002155  0.01173  
        c   -0.002331  0.006608
        d   -0.01734   0.01109  
        e    0.02837   0.01093  
    b   a    0.01703  -0.005845
        b   -0.001043  0.01283  
        c   -0.006015 -0.0005866
        d    0.009683  0.02211  
        e    0.02347  -8.346e-05

    And the R timing

    > mean(timings)
    [1] 0.42

    And pandas:

    In [38]: timeit table = pivot_table(data, rows=['foo', 'bar'])
    10 loops, best of 3: 117 ms per loop

    So it’s about 3-4x faster than reshape2. This obviously depends on the data set and how you structure the aggregation. Here’s a couple slightly different benchmarks on the same data:

    > acast(melt(data, id=c("a", "b")), a~b, mean, subset=.(variable=="c"))
                 a            b            c           d            e
    a  0.002761963 -0.004618980  0.022491767 0.005027362  0.006146438
    b -0.025094899  0.015966167 -0.007663059 0.006795551  0.006648496
    c  0.012273797  0.006139820 -0.019169968 0.009035374 -0.006079683
    d -0.014174747 -0.009844566  0.022450738 0.025967638 -0.006572791
    e -0.003623382  0.005924704  0.008156482 0.013288116  0.009839315

    Timing
    > mean(timings)
    [1] 0.3036

    and pandas:

    In [41]: pivot_table(data, 'baz', rows='foo', cols='bar')
    Out[41]:
      bar  a         b          c         d         e      
    foo                                                    
    a     -0.02162  -0.002155  -0.002331 -0.01734   0.02837
    b      0.01703  -0.001043  -0.006015  0.009683  0.02347
    c      0.01561   0.0003992  0.01451  -0.01046  -0.00368
    d      0.003504 -0.01437   -0.03265   0.003471 -0.004638
    e     -0.01656  -0.01276   -0.02481   0.008629  0.011  

    In [42]: timeit pivot_table(data, 'baz', rows='foo', cols='bar')
    10 loops, best of 3: 55 ms per loop

    I suspect one source of slowdown reshape/reshape2 approach is that the melt operation is very expensive, as is variable subsetting. However, it enables easier computation of group margins, something which I have not yet addressed (but will, eventually) in pandas.

    Python, R, and the allure of magic

    R is much more magical than Python. What do I mean by this? In R, things like this are a part of everyday life:

    > a <- rnorm(10)
    > b <- rnorm(10)
    > cbind(a, b)
                   a          b
     [1,]  0.8729978  0.5170078
     [2,] -0.6885048 -0.4430447
     [3,]  0.4017740  1.8985843
     [4,]  2.1088905 -1.4121763
     [5,]  0.9375273  0.4703302
     [6,]  0.5558276 -0.5825152
     [7,] -2.1606252  0.7379874
     [8,] -0.7651046 -0.4534345
     [9,] -4.2604901  0.9561077
    [10,]  0.3940632 -0.8331285

    If you’re a seasoned Python programmer, you might have the sort of visceral negative reaction that I do to this. Seriously, just where in the hell did those variable names come from? So when I say magic here I’m talking about abusing the language’s parser. There is nothing special about R that makes the above behavior possible, but rather taking a fundamentally different design philosophy to, say, Python. As any Python programmer knows: Explicit is better than implicit. I happen to agree. There is also a bit of a semantic difference in R versus Python in that assignment in R typically copies data, whereas variables in Python are simply references (labels) for a particular object. So you could make the argument that the names a and b above are more strongly linked to the underlying data.

    While building pandas over the last several years, I occasionally grapple with issues like the above. Maybe I should just break from Python ethos and embrace magic? I mean, how hard would it be to get the above behavior in Python? Python gives you stack frames and the ast module after all. So I went down the rabbit hole and wrote this little code snippet:

    While this is woefully unpythonic, it’s also kind of cool:

    In [27]: merge(a, b)
    Out[27]:
                a         b      
    2000-01-03 -1.35      0.8398
    2000-01-04  0.999    -1.617  
    2000-01-05  0.2537    1.433  
    2000-01-06  0.6273   -0.3959
    2000-01-07  0.7963   -0.789  
    2000-01-10  0.004295 -1.446

    This can even parse and format more complicated expressions (harder than it looks, because you have to walk the whole AST):

    In [30]: merge(a, np.log(b))
    Out[30]:
                a        np.log(b)
    2000-01-03  0.6243   0.7953  
    2000-01-04  0.3593  -1.199    
    2000-01-05  2.805   -1.059    
    2000-01-06  0.6369  -0.9067  
    2000-01-07 -0.2734   NaN      
    2000-01-10 -1.023    0.3326

    Now, I am *not* suggesting we do this any time soon. I’m going to prefer the explicit approach (cf. the Zen of Python) any day of the week:

    In [32]: DataFrame({'a' : a, 'log(b)' : np.log(b)})
    Out[32]:
                a        log(b)
    2000-01-03  0.6243   0.7953
    2000-01-04  0.3593  -1.199  
    2000-01-05  2.805   -1.059  
    2000-01-06  0.6369  -0.9067
    2000-01-07 -0.2734   NaN    
    2000-01-10 -1.023    0.3326

    NumPy indexing peculiarities

    Many scientific Python users are surprised when I tell them that ndarray.take is faster than __getitem__-based (a.k.a. “fancy” as I call it) indexing.

    import numpy as np
    import random

    arr = np.random.randn(10000, 5)
    indexer = np.arange(10000)
    random.shuffle(indexer)

    In [26]: timeit arr[indexer]
    1000 loops, best of 3: 1.25 ms per loop

    In [27]: timeit arr.take(indexer, axis=0)
    10000 loops, best of 3: 127 us per loop

    It’s actually kind of unbelievable when you think about it. What’s going on here that take is almost 10x faster? I really should take a closer at the internals of what __getitem__ does because this has always struck me as pretty bad. Maybe I shouldn’t be complaining? I mean, R 2.13′s indexing falls somewhere in the middle:

    mat <- matrix(rnorm(50000), nrow=10000, ncol=5)
    set.seed(12345)
    indexer <- sample(1:10000)
    > system.time(for (i in 1:1000) mat[indexer,])
       user  system elapsed
      0.460   0.197   0.656

    So 656 microseconds per iteration. (In an earlier version of this post I used rpy2 to do the benchmark and got 1.05 ms, but there was apparently some overhead from rpy2)

    Another peculiarity that I noticed with take is that performance gets worse when you use the out argument, which tells the function to use an array you pass in to write out the result:

    out = np.empty_like(arr)

    In [50]: timeit np.take(arr, indexer, axis=0, out=out)
    10000 loops, best of 3: 200 us per loop

    EDIT: I’ve been informed that using mode='clip' or mode='wrap' makes this run as fast as without the out argument.

    Weird! I was dissatisfied by this, so I got curious how fast a hand-coded little Cython function can do this:

    @cython.wraparound(False)
    @cython.boundscheck(False)
    def take_axis0(ndarray[float64_t, ndim=2] values,
                   ndarray[int32_t] indexer,
                   out=None):
        cdef:
            Py_ssize_t i, j, k, n, idx
            ndarray[float64_t, ndim=2] outbuf

        if out is None:
            outbuf = np.empty_like(values)
        else:
            outbuf = out

        n = len(indexer)
        k = values.shape[1]
        for i from 0 <= i < n:
            idx = indexer[i]

            if idx == -1:
                for j from 0 <= j < k:
                    outbuf[i, j] = NaN
            else:
                for j from 0 <= j < k:
                    outbuf[i, j] = values[idx, j]

    Don’t worry about the -1 thing– that’s a specialization that I’m using inside pandas. Curiously, this function is a lot faster than take using out but faster than the regular take by a handful of microseconds.

    In [53]: timeit lib.take_axis0(arr, indexer)
    10000 loops, best of 3: 115 us per loop

    In [54]: timeit lib.take_axis0(arr, indexer, out)
    10000 loops, best of 3: 109 us per loop

    Very interesting.

    TL;DR

  • Use take not []-based indexing to get best performance
  • Cython is just as fast for my specific application and a lot faster if you’re passing an out array (which I will be for the application that I needed this for)
  • R’s matrix indexing performance is better than NumPy’s fancy indexing, but about 5-6x slower than ndarray.take. This can probably be improved.
  • Leaking memory like a pro in Python

    For all my Python zealotry, I never claimed it was perfect. In fact, there are a number of places you can shoot yourself in the foot if you aren’t careful. But I was kind of bummed about this one:

    import numpy as np

    class MainScreen(object):

        def __init__(self, cats):
            self.cats = cats
            self._signal = None

        def turn_on(self):
            if self._signal is None:
                signal = Signal(self)
                self._signal = signal
            return self._signal

    class Signal(object):

        def __init__(self, main_screen):
            # well we want to know which screen we came from!
            self.main_screen = main_screen

        def we_get(self):
            return True

    for i in xrange(50):
        cats = np.empty((10000, 500), dtype=float)
        cats.fill(0)
        main_screen = MainScreen(cats)
        signal = main_screen.turn_on()
        # all your base...

    Try it.

    EDIT: I’ve been informed by an astute reader that the GC is actually capable of dealing with the above case. Yang Zhang says “The collector actually works with cycles, including this case. (Actually, most objects are managed via reference-counting; the gc is specifically for cyclic references.) The problem is that it’s triggered by # allocations, not # bytes. In this case, each cats array is large enough that you need several GB of allocations before the threshold is triggered. Try adding gc.collect() or gc.get_count() in the loop (and inspect gc.get_threshold()).”

    All is not lost however. This pattern can be implemented using weak references:

    class MainScreen(object):

        def __init__(self, cats):
            self.cats = cats
            self._signal = None

        def turn_on(self):
            import weakref
            if self._signal is None:
                signal = Signal(self)
                self._signal = weakref.ref(signal)
            return self._signal()