High performance database joins with pandas DataFrame, more benchmarks

I posted a brief article with some preliminary benchmarks for the new merge/join infrastructure that I’ve built in pandas. I compared the performance with base::merge in R which, as various folks in the R community have pointed out, is fairly slow. There are a few other more intelligently-implemented functions available in CRAN, in particular plyr::join in plyr and the merge implemented for data.table objects in the data.table package.

Lastly, Sean Taylor contributed a benchmark for SQLite3, by accounts the most widely deployed SQL engine.

So anyway, here are the two benchmarks I’m interested in to get a sense of the large-ish data runtime of these algorithms:

  • Many-to-one joins. In these benchmarks I have a 80,000 row table with 10 copies of 8,000 key pairs and an 8,000 row table with a single copy of another 8,000 key pairs, only 6,000 of which are found in the larger table. The result of a left join between these tables should have 80,000 rows, an inner join 60,000, and an outer join 82,000.
  • Many-to-many joins. To keep things simple I use the same tables as above except the right able is the table above stacked on itself. You can do the combinatorics here, but the outer join between these two tables has 144,000 rows.
  • Note: that plyr::join does not implement (or least I’ve been told to avoid) many-to-many joins so I only run the many-to-one benchmarks for that.

    I’ve normalized the results by the minimum runtime (which is pandas in all cases):

    UPDATE (1/6/2012): Jared Lander pointed out that data.table is capable of much faster left and right joins by using the syntax left[right] instead of merge(left, right, all.y=True). I updated the benchmarks below and added the right join results which shows data.table performing very admirably.

    pandas vs R merge benchmarks
    Many-to-one

    pandas data.table plyr base::merge
    inner 1 5.905 6.35 13.29
    outer 1 10.05 9.209 20.25
    left 1 2.849 5.918 14.93
    right 1 2.05 2.923 16.91

    Many-to-many

    pandas data.table plyr base::merge
    inner 1 5.194 5.223 18.87
    outer 1 10 6.903 33.75
    left 1 2.528 4.688 24.46
    right 1 1.681 2.05 25.24

    SQLite3 Benchmarks

    A bit of care needs to be taken with SQLite3 benchmarks because the time to fetch the table from the database cursor (even though this is an in-memory SQLite database) is very significant. The performance as you can imagine is also quite different with and without indexes.

    Here is the basic code to insert the data into SQLite:

    import sqlite3
    create_sql_indexes = False
    conn = sqlite3.connect(':memory:')
    conn.execute('create table left( key varchar(10), key2 varchar(10), value int);')
    conn.execute('create table right( key varchar(10), key2 varchar(10), value2 int);')
    conn.executemany('insert into left values (?, ?, ?)',
                     zip(key, key2, left['value']))
    conn.executemany('insert into right values (?, ?, ?)',
                     zip(right['key'], right['key2'], right['value2']))
    # Create Indices
    if create_sql_indexes:
        conn.execute('create index left_ix on left(key, key2)')
        conn.execute('create index right_ix on right(key, key2)')

    With no indexes, here is a comparison of just the SQL execution time vs. the total pandas execution time for the many-to-one case described above:


    sqlite3 pandas
    inner 0.02328 0.01799
    outer 0.02324 0.01943
    left 0.02324 0.01923

    So pandas very slightly edges out SQLite with no indexes. Note that it takes ~300ms (on my machine) to fetch the data out of the database into a list of tuples, so if you consider that, pandas is really crushing it by doing everything in memory.

    The results after adding indexes are pretty indistinguishable as far as I can tell because the fetch time (which involves creating and populating a large number of Python tuples) dwarfs the join time it appears. The execute + fetch time varies between 310-340 ms for all three join types, with an without indexes, for the many-to-one case. The many-to-many case varies between 420-490 ms, whereas pandas is 22-25ms!

    UPDATE: After some thought and discussions with people, these benchmarks are not fair to SQLite. A more appropriate benchmark would be to create the joined table inside the database using a statement like

    CREATE TABLE test AS
    SELECT *
       FROM LEFT
       INNER JOIN RIGHT
         ON LEFT.KEY=RIGHT.KEY
           AND LEFT.key2 = RIGHT.key2;

    Here are new benchmarks using this SQL statement:

    Many-to-one Many-to-many


    pandas sqlite3
    inner 0.01674 0.04187
    outer 0.01882 0.04534
    left 0.01851 0.04514


    pandas sqlite3
    inner 0.02091 0.06733
    outer 0.02266 0.06968
    left 0.02203 0.06882

    So pandas still significantly outperforms SQLite3 (even with SQL indexes as in these benchmarks). But it’s not totally apples-to-apples as SQLite3 is able to perform joins on extremely large data sets on disk.

    Conclusions

    As far as I can tell, pandas now has one of the fastest in-memory database join operators out there. If you’re using Python to do relational algebra, you’d be crazy to pick SQLite3 over pandas due to the high cost of reading and writing large data sets (in the form of Python tuples) to SQL format.

    Code links

  • R code
  • pandas code
  • SQLite3 (Python) code
  • Some pandas Database Join (merge) Benchmarks vs. R base::merge

    Over the last week I have completely retooled pandas’s “database” join infrastructure / algorithms in order to support the full gamut of SQL-style many-to-many merges (pandas has had one-to-one and many-to-one joins for a long time). I was curious about the performance with reasonably large data sets as compared with base::merge.data.frame which I’ve used many times in R. So I just ran a little benchmark for joining a 100000-row DataFrame with a 10000-row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.

    Note this new functionality will be shipping in the upcoming 0.7.0 release (!).

    There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it’s completely apples to apples. These are mean runtimes in seconds:


    Sort by key columns
    pandas R Ratio
    inner 0.07819 0.2286 2.924
    outer 0.09013 1.284 14.25
    left 0.0853 0.7766 9.104
    right 0.08112 0.3371 4.156


    Don’t sort by key columns
    pandas R Ratio
    inner 0.02808 0.2297 8.18
    outer 0.03879 1.181 30.45
    left 0.0365 0.6706 18.37
    right 0.03021 0.2995 9.912

    As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn’t really an indictment of R’s merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven’t run benchmarks against SQLite or another SQL database yet; that will happen eventually.

    I’m going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast–it was not easy!

    Here is the R code.

    N <- 10000
    indices = rep(NA, N)
    for (i in 1:N)
      indices[i] <- paste(sample(letters, 10), collapse="")

    left <- data.frame(key=rep(indices, 10),
                       key2=sample(rep(indices, 10)),
                       value=rnorm(100000))
    right <- data.frame(key=indices,
                        key2=sample(indices),
                        value2=rnorm(10000))
    timeit <- function(func, niter=10) {
      timing = rep(NA, niter)
      for (i in 1:niter) {
        gc()
        timing[i] <- system.time(func())[3]
      }
      mean(timing)
    }

    left.join <- function(sort=TRUE) {
      result <- merge(left, right, all.x=TRUE, sort=sort)
    }
    right.join <- function(sort=TRUE) {
      result <- merge(left, right, all.y=TRUE, sort=sort)
    }
    outer.join <- function(sort=TRUE) {
      result <- merge(left, right, all=TRUE, sort=sort)
    }
    inner.join <- function(sort=TRUE) {
      reuslt <- merge(left, right, sort=sort)
    }

    sort.options <- c(FALSE, TRUE)

    results <- matrix(nrow=4, ncol=2)
    colnames(results) <- c("dont_sort", "sort")
    rownames(results) <- c("inner", "outer", "left", "right")

    join.functions <- c(inner.join, outer.join, left.join, right.join)
    for (i in 1:4) {
      results[1, 1] <- timeit(function() {inner.join(sort=sort.options[1])})
      results[1, 2] <- timeit(function() {inner.join(sort=sort.options[2])})
      results[2, 1] <- timeit(function() {outer.join(sort=sort.options[1])})
      results[2, 2] <- timeit(function() {outer.join(sort=sort.options[2])})
      results[3, 1] <- timeit(function() {left.join(sort=sort.options[1])})
      results[3, 2] <- timeit(function() {left.join(sort=sort.options[2])})
      results[4, 1] <- timeit(function() {right.join(sort=sort.options[1])})
      results[4, 2] <- timeit(function() {right.join(sort=sort.options[2])})
    }

    And the Python code

    from collections import defaultdict
    import gc
    import time
    from pandas.util.testing import rands
    N = 10000
    indices = np.array([rands(10) for _ in xrange(N)], dtype='O')

    key = np.tile(indices, 10)
    key2 = key.copy()
    random.shuffle(key2)
    indices2 = indices.copy()
    random.shuffle(indices2)
    left = DataFrame({'key' : key, 'key2':key2,
                      'value' : np.random.randn(100000)})
    right = DataFrame({'key': indices, 'key2':indices2,
                       'value2' : np.random.randn(10000)})
    join_methods = ['inner', 'outer', 'left', 'right']
    results = DataFrame(index=join_methods, columns=[False, True])
    niter = 10
    for sort in [False, True]:
        for join_method in join_methods:
            f = lambda: merge(left, right, how=join_method, sort=sort)
            gc.disable()
            start = time.time()
            for _ in xrange(niter):
                f()
            elapsed = (time.time() - start) / niter
            gc.enable()
            results[sort][join_method] = elapsed
    results.columns = ['dont_sort', 'sort']

    # R results
    from StringIO import StringIO
    r_results = read_table(StringIO("""dont_sort   sort
    inner    0.2297 0.2286
    outer    1.1811 1.2843
    left     0.6706 0.7766
    right    0.2995 0.3371
    """
    ), sep='\s+')

    sort_results = DataFrame.from_items([('pandas', results['sort']),
                                         ('R', r_results['sort'])])
    sort_results['Ratio'] = sort_results['R'] / sort_results['pandas']

    nosort_results = DataFrame.from_items([('pandas', results['dont_sort']),
                                           ('R', r_results['dont_sort'])])
    nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas']

    Introducing vbench, new code performance analysis and monitoring tool

    Do you know how fast your code is? Is it faster than it was last week? Or a month ago? How do you know if you accidentally made a function slower by changes elsewhere? Unintentional performance regressions are extremely common in my experience: it’s hard to unit test the performance of your code. Over time I have gotten tired of playing the game of “performance whack-a-mole”. Thus, I started hacking together a little weekend project that I’m calling vbench. If someone thinks up a cleverer name, I’m all ears.

    Link to pandas benchmarks page produced using vbench

    What is vbench?

    vbench is a super-lightweight Python library for running a collection of performance benchmarks over the course of your source repository’s history. Since I’m a GitHub user, it only does git for now, but it could be generalized to support other VCSs. Basically, you define a benchmark:

    common_setup = """
    from pandas import *
    import pandas.util.testing as tm
    import random
    import numpy as np
    """


    setup = common_setup + """

    N = 100000
    ngroups = 100

    def get_test_data(ngroups=100, n=N):
        unique_groups = range(ngroups)
        arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object)

        if len(arr) < n:
            arr = np.asarray(list(arr) + unique_groups[:n - len(arr)],
                             dtype=object)

        random.shuffle(arr)
        return arr

    df = DataFrame({'key1' : get_test_data(ngroups=ngroups),
                    'key2' : get_test_data(ngroups=ngroups),
                    'data' : np.random.randn(N)})
    def f():
        df.groupby(['key1', 'key2']).agg(lambda x: x.values.sum())
    """

    stmt2 = "df.groupby(['key1', 'key2']).sum()"
    bm_groupby2 = Benchmark(stmt2, setup, name="GroupBy test 2",
                            start_date=datetime(2011, 7, 1))

    Then you write down the information about your repository and how to build any relevant DLLs, etc., that vary from revision to revision:

    REPO_PATH = '/home/wesm/code/pandas'
    REPO_URL = 'git@github.com:wesm/pandas.git'
    DB_PATH = '/home/wesm/code/pandas/gb_suite/benchmarks.db'
    TMP_DIR = '/home/wesm/tmp/gb_pandas'
    PREPARE = """
    python setup.py clean
    """
    BUILD = """
    python setup.py build_ext --inplace
    """
    START_DATE = datetime(2011, 3, 1)

    Then you pass this info, plus a list of your benchmark objects, to the BenchmarkRunner class:

    runner = BenchmarkRunner(benchmarks, REPO_PATH, REPO_URL,
                             BUILD, DB_PATH, TMP_DIR, PREPARE,
                             run_option='eod', start_date=START_DATE)
    runner.run()

    Now, the BenchmarkRunner makes a clone of your repo, then runs all of the benchmarks once for each revision in the repository (or some other rule, e.g. I’ve set run_option='eod' to only take the last snapshot on each day). It persists the results in a SQLite database so that you can rerun the process and it will skip benchmarks it’s already run (this is key when you add new benchmarks, only the new ones will be updated). Benchmarks are uniquely identified by the MD5 hash of their source code.

    This is the resulting plot over time for the above GroupBy benchmark related to some Cython code that I worked on late last week (where I made a major performance improvement in this case):

    Here is a fully-formed vbench suite in the pandas git repository.

    Kind of like codespeed and speed.pypy.org?

    Before starting to write a new project I looked briefly at codespeed and the excellent work that the PyPy guys have done with speed.pypy.org. But then I started thinking, you know, Django web application, JSON requests to upload benchmark results? Seemed like far too much work to do something relatively simple. The dealbreaker is that codespeed is just a web application. It doesn’t actually (to my knowledge, someone correct me if I’m wrong?) have any kind of a framework for orchestrating the running of benchmarks throughout your code history. That is what this new project is for. I actually see a natural connection between vbench and codespeed, all you need to do is write a script to upload your vbench results to a codespeed web app!

    At some point I’d like to build a simple web front end or wx/Qt viewer for the generated vbench database. I’ve never done any JavaScript, but it would be a good opportunity to learn. Knowing me, I might break down and hack out a stupid little wxPython app with an embedded matplotlib widget anyway.

    Anyway, I’m really excited about this project. It’s very prototype-y at the moment but I tried to design it in a nice and extensible way. I also plan to put all my git repo analysis tools in there (like code churn graphs etc.) so it should become a nice little collection of tools.

    Other ideas for extending

  • Dealing with API changes
  • Multiple library versions (e.g. NumPy) or Python versions
  • Formatting DataFrame as HTML

    Recently, a pandas user, Wouter Overmeire, contributed a to_html function to DataFrame in a pull request. It’s very handy:

    In [31]: df
    Out[31]:
                            ls    lsc   pop   ccode
    year cname     agefrom                        
    1950 Australia 15       64.3  15.4  558   AUS  
                   40       38.9  20.1  555   AUS  
                   65       24.7  10.7  273   AUS  
    1955 Australia 25       48.4  26.2  705   AUS  
                   50       34    16.6  478   AUS  
                   75       26.3  11.7  240   AUS  
    1960 Australia 35       47.9  24.7  755   AUS  
                   60       29.6  13.5  389   AUS  
    1965 Australia 20       57.7  31.1  832   AUS  
                   45       41.3  20.3  665   AUS  
                   70       30.2  12.9  273   AUS  
    1970 Australia 30       57.7  31.1  789   AUS  
                   55       42.1  19.4  619   AUS  
    1975 Australia 15       91.1  23.1  1225  AUS  
                   40       56.3  30.9  744   AUS  
                   65       48.8  22.5  440   AUS  
    1980 Australia 25       74    44.5  1205  AUS  
                   50       60.2  32.8  774   AUS  
                   75       62.5  30.8  500   AUS  
    1985 Australia 35       72.7  42.8  1180  AUS  

    In [32]: df.to_html()

    which outputs:


    ls lsc pop ccode
    year cname agefrom
    1950 Australia 15 64.3 15.4 558 AUS
    1955 Australia 50 34 16.6 478 AUS
    1965 Australia 20 57.7 31.1 832 AUS
    1970 Australia 55 42.1 19.4 619 AUS
    1980 Australia 25 74 44.5 1205 AUS
    1985 Australia 60 59.1 30.8 685 AUS
    1995 Australia 30 72 45.7 1430 AUS
    2000 Australia 65 60 31.5 663 AUS
    2010 Australia 35 59.9 39.1 1506 AUS
    1950 Austria 70 2.8 1.9 218 AUT
    1960 Austria 40 4.5 3 324 AUT
    1965 Austria 75 13 8.6 322 AUT
    1975 Austria 45 42.1 28 455 AUT
    1985 Austria 15 22.4 4.2 626 AUT
    1990 Austria 50 47.3 23.4 448 AUT
    2000 Austria 20 73.2 63.2 465 AUT
    2005 Austria 55 55.1 36.2 486 AUT

    It also renders a tidy representation with hierarchical columns:

    In [14]: table
    Out[14]:
      sex     Female          Male          
      smoker  No      Yes     No      Yes  
    day                                    
    Fri       0.1653  0.2091  0.138   0.1447
    Sat       0.148   0.1638  0.1621  0.1391
    Sun       0.1657  0.2371  0.1583  0.174
    Thur      0.156   0.1631  0.1657  0.1644


    sex
    smoker
    Female
    No
    Female
    Yes
    Male
    No
    Male
    Yes
    day
    Fri 0.1653 0.2091 0.138 0.1447
    Sat 0.148 0.1638 0.1621 0.1391
    Sun 0.1657 0.2371 0.1583 0.174
    Thur 0.156 0.1631 0.1657 0.1644

    Talk at Rice Stats on structured data analysis, pandas, 11/21/2011

    I had the privilege of speaking today at Rice at Hadley Wickham’s (of R fame) invitation. I talked broadly about the problems faced in structured data manipulation problems and how I’ve worked to address them in pandas. I enjoyed speaking with Hadley (whom I’d never met before in person) about these problems as we’ve come up with nice solutions to many of them independently, and in different languages (R and Python, respectively). A lot of folks assume that parts of pandas are inspired by his plyr and reshape2 packages, but I’ve only started playing around with them recently. I think we all have a lot we can learn from each other on the road toward building even better high performance, easy-to-use, and expressive data manipulation tools.

    pandas talk at PyHPC 2011 workshop in SC11, thoughts on hash tables

    Here are the slides from my talk at PyHPC2011. Not really my usual talk for data crowds– a little more nuts and bolts about some of the indexing and GroupBy implementation details. Some people might be interested in the illustrative data alignment benchmarks which show the relative weakness of Python’s dict implementation (in both speed and memory usage) for lookups and alignment. After these benchmarks I think it’s pretty much inevitable that I’m going to end up writing a custom hash table implementation in C for the data alignment on primitive types. Now, if I wanted a threadsafe hash table that I could use OpenMP on, that would be a serious undertaking. Anyone want to help?

    The basic problem is that Python dicts are not designed for my use case– namely very large dicts that I use to perform data alignment operations.

    PyHPC2011

    Filtering out duplicate pandas.DataFrame rows

    Sean Taylor recently alerted me to the fact that there wasn’t an easy way to filter out duplicate rows in a pandas DataFrame. R has the duplicated function which serves this purpose quite nicely. The R method’s implementation is kind of kludgy in my opinion (from “The data frame method works by pasting together a character representation of the rows”), but in any case I set about writing a Python version from first principles.

    So a drop_duplicates method should be able to either consider a subset of the columns or all of the columns for determining which are “duplicates”. It occurred to me that a reasonably fast and efficient way to do this was to use GroupBy. You have to know a bit about the internals, but it’s fairly straightforward otherwise:

    grouped = df.groupby(cols_to_consider)
    index = [gp_keys[0] for gp_keys in grouped.groups.values()]
    unique_df = df.reindex(index)

    For a faster and more direct implementation in Cython, I decided to use a list of tuples and a dict to keep track of whether a row has been “seen” or not:

    def duplicated(list values, take_last=False):
        cdef:
            Py_ssize_t i, n
            dict seen = {}
            object row

        n = len(values)
        cdef ndarray[uint8_t] result = np.zeros(n, dtype=np.uint8)

        if take_last:
            for i from n > i >= 0:
                row = values[i]
                if row in seen:
                    result[i] = 1
                else:
                    seen[row] = None
                    result[i] = 0
        else:
            for i from 0 <= i < n:
                row = values[i]
                if row in seen:
                    result[i] = 1
                else:
                    seen[row] = None
                    result[i] = 0

        return result.view(np.bool_)

    Aside: the uint8 and casting to np.bool_ thing is there because the buffer interface doesn’t work quite right with boolean arrays in Python 2.5. So as soon as I drop Python 2.5 compatibility (probably around the end of the year), I’ll go through and fix that stuff up. As a second aside, using a dict with dummy keys was coming out a bit faster than using a set in Cython, for reasons unknown to me.

    So, on the Python side, the new DataFrame function just takes the boolean vector returned by that Cython function and uses it to select out the rows:

    In [21]: df
    Out[21]:
        A  B  C
    0   a  c  0
    1   b  c  1
    2   c  b  2
    3   d  a  0
    4   e  c  1
    5   a  a  2
    6   b  b  0
    7   c  b  1
    8   d  b  2
    9   e  b  0
    10  a  a  1
    11  b  a  2
    12  c  c  0
    13  d  a  1
    14  e  c  2

    In [22]: df.drop_duplicates('A')
    Out[22]:
       A  B  C
    0  a  c  0
    1  b  c  1
    2  c  b  2
    3  d  a  0
    4  e  c  1

    In [23]: df.drop_duplicates(['A', 'B'])
    Out[23]:
        A  B  C
    0   a  c  0
    1   b  c  1
    2   c  b  2
    3   d  a  0
    4   e  c  1
    5   a  a  2
    6   b  b  0
    8   d  b  2
    9   e  b  0
    11  b  a  2
    12  c  c  0

    Not passing any particular column or columns is the same as passing all of them.

    I was impressed by the performance:

    In [26]: from pandas.util.testing import rands

    In [27]: k1 = [rands(1) for _ in xrange(100000)]

    In [28]: k2 = [rands(1) for _ in xrange(100000)]

    In [29]: df = DataFrame({'A' : k1, 'B' : k2, 'C' : np.tile(np.arange(100), 1000)})

    In [30]: timeit df.drop_duplicates(['A', 'B'])
    100 loops, best of 3: 18.2 ms per loop

    In [31]: timeit df.drop_duplicates()
    10 loops, best of 3: 49.8 ms per loop

    # the groupby method
    In [32]: timeit df.reindex([v[0] for v in df.groupby(['A', 'B']).groups.values()])
    10 loops, best of 3: 56.5 ms per loop

    For reference, it’s roughly twice as fast as the equivalent R code (though I’m still using R 2.11.1…time to upgrade to 2.14!), but perhaps not too shocking as the R function is converting each row into a string first.

    N <- 100000

    k1 = rep(NA, N)
    k2 = rep(NA, N)
    for (i in 1:N){
      k1[i] <- paste(sample(letters, 1), collapse="")
      k2[i] <- paste(sample(letters, 1), collapse="")
    }
    df <- data.frame(a=k1, b=k2, c=rep(1:100, N / 100))
    df2 <- data.frame(a=k1, b=k2)

    timings <- numeric()
    timings2 <- numeric()
    for (i in 1:50) {
      gc()
      timings[i] = system.time(deduped <- df[!duplicated(df),])[3]
      gc()
      timings2[i] = system.time(deduped <- df[!duplicated(df[,c("a", "b")]),])[3]
    }

    > mean(timings)
    [1] 0.09312
    > mean(timings2)
    [1] 0.04312

    PyHPC 2011 Pre-print paper on pandas

    I’ll be presenting this paper in a couple of weeks at SC11. It’s the most up-to-date white paper about pandas.

    PDF Download Link

    pandas: a Foundational Python Library for Data Analysis and Statistics

    Fast and easy pivot tables in pandas 0.5.0

    Over the last several months, I’ve invested a great deal in the GroupBy and indexing infrastructure of pandas. GroupBy in particular could still use more work to be made even higher performance, but even for quite large data sets, most users will find it more than satisfactory compared with other Python alternatives (which are mainly DIY approaches these days) or even in other data analysis packages, e.g. those in R, which I’ll talk a bit about in this article.

    An easy application of GroupBy is creating pivot tables from a tabular data set. This involves aggregating a data set on some number of keys, then reshaping it to form a 2D table with the stratified aggregated values. I added a new function

     

    pivot_table

     

    which provides a high level interface for creating pivot tables. Let’s consider the tips data set used in some examples of the Hadley Wickham’s excellent R reshape2 package:

    In [1]: import pandas.rpy.common as com

    In [2]: tips = com.load_data('tips', package='reshape2')

    In [3]: tips['tip_pct'] = tips['tip'] / tips['total_bill']

    In [4]: tips.head()
    Out[4]:
       total_bill  tip   sex     smoker  day  time    size  tip_pct
    1  16.99       1.01  Female  No      Sun  Dinner  2     0.05945
    2  10.34       1.66  Male    No      Sun  Dinner  3     0.1605
    3  21.01       3.5   Male    No      Sun  Dinner  3     0.1666
    4  23.68       3.31  Male    No      Sun  Dinner  2     0.1398
    5  24.59       3.61  Female  No      Sun  Dinner  4     0.1468

    Since most readers won’t have rpy2, you can get the data file here and read it with:

    tips = read_csv('tips.csv')

    All of this new functionality is available now on GitHub but will be a part of the upcoming pandas 0.5.0 release. Note that I am only using rpy2 here to pull the R data set into my Python sesson– all of the below code is implemented in pure Python (with a little Cython magic to make things fast).

    Pivot table basics

    To use pivot_table, pass the pandas DataFrame and specify the column you wish to aggregate and the columns to group on the rows and columns of the table. Note the default aggregation is mean, though this can be specified:

    In [9]: table = pivot_table(tips, 'tip_pct', rows=['time', 'sex'],
                                cols='smoker')

    In [10]: table
    Out[10]:
      smoker       No      Yes  
    time   sex                  
    Dinner Female  0.1568  0.1851
           Male    0.1594  0.1489
    Lunch  Female  0.1571  0.1753
           Male    0.1657  0.1667

    In [17]: pivot_table(tips, 'tip_pct', rows=['day', 'time'],
                         cols='sex')
    Out[17]:
      sex        Female  Male  
    day  time                  
    Fri  Dinner  0.1991  0.1302
         Lunch   0.1997  0.1741
    Sat  Dinner  0.1565  0.1516
    Sun  Dinner  0.1816  0.1623
    Thur Dinner  0.1597  NaN  
         Lunch   0.1575  0.1653

    Note that the returned object is still a DataFrame, so you can use all of the standard indexing technology to select portions of the resulting table:

    In [22]: table.ix['Dinner']
    Out[22]:
      smoker  No      Yes  
    sex                    
    Female    0.1568  0.1851
    Male      0.1594  0.1489

    If you don’t specify a particular values column, it will try to aggregate all the other columns:

    In [24]: pivot_table(tips, rows=['sex', 'smoker'])
    Out[24]:
                   size   tip    tip_pct  total_bill
    sex    smoker                                  
    Female No      2.593  2.774  0.1569   18.11    
           Yes     2.242  2.932  0.1822   17.98    
    Male   No      2.711  3.113  0.1607   19.79    
           Yes     2.5    3.051  0.1528   22.28

    Other aggregations can be performed, like using len to get group counts:

    In [26]: pivot_table(tips, 'tip_pct', rows='sex', cols='smoker', aggfunc=len)
    Out[26]:
      smoker  No  Yes
    sex              
    Female    54  33
    Male      97  60

    Once you’ve done the aggregation, you’re free to use the built-in reshaping capability to reshape the data in the table:

    In [29]: table = pivot_table(tips, 'tip_pct', rows=['sex', 'day'],
                                 cols='smoker', aggfunc=len)

    In [30]: table
    Out[30]:
      smoker     No  Yes
    sex    day          
    Female Fri   2   7  
           Sat   13  15
           Sun   14  4  
           Thur  25  7  
    Male   Fri   2   8  
           Sat   32  27
           Sun   43  15
           Thur  20  10

    In [31]: table.unstack('sex')
    Out[31]:
      smoker  No            Yes        
      sex     Female  Male  Female  Male
    day                                
    Fri       2       2     7       8  
    Sat       13      32    15      27  
    Sun       14      43    4       15  
    Thur      25      20    7       10

    If you know a bit more about groupby, you can get clever and pass a dict of functions to perform multiple aggregations at once:

    In [59]: pivot_table(tips, rows=['sex', 'smoker'],
                         aggfunc={'tip_pct' : 'mean', 'size' : 'sum'})
    Out[59]:
                   size  tip_pct
    sex    smoker              
    Female No      140   0.1569
           Yes     74    0.1822
    Male   No      263   0.1607
           Yes     150   0.1528

    I thought that was pretty cool.

    Comparison with reshape2

    R’s reshape2 package relies on two functions, melt and cast, to do the reshaping. Its implementation is, at least in my opinion, a slightly more ad hoc approach than in pandas (the actual pivot_table function is only about 10 lines and basically falls out of the groupby and hierarchical-indexing based reshaping). However, it can take advantage of R’s built-in formulas which makes the syntax very intuitive. Here are some of the same examples using reshape2:

    > tips$tip.pct <- tips$tip / tips$total_bill
    > mtips = melt(tips, id=c("sex", "smoker", "day", "time"))
    > acast(mtips, day ~ sex + smoker, length, subset=.(variable=="tip"))
         Female_No Female_Yes Male_No Male_Yes
    Fri          2          7       2        8
    Sat         13         15      32       27
    Sun         14          4      43       15
    Thur        25          7      20       10

    > acast(mtips, sex + smoker ~ variable, mean)
               total_bill      tip     size   tip.pct
    Female_No    18.10519 2.773519 2.592593 0.1569210
    Female_Yes   17.97788 2.931515 2.242424 0.1821504
    Male_No      19.79124 3.113402 2.711340 0.1606687
    Male_Yes     22.28450 3.051167 2.500000 0.1527712

    R lacks hierarchical indexing as far as I know, so reshape2 munges the group labels together into a row label.

    In R there is also the xtabs function for doing cross-tabulation:

    > ftable(xtabs(size ~ time + sex + smoker + day, data=tips))
                         day Fri Sat Sun Thur
    time   sex    smoker                    
    Dinner Female No           2  30  43    2
                  Yes          8  33  10    0
           Male   No           4  85 124    0
                  Yes         12  71  39    0
    Lunch  Female No           3   0   0   60
                  Yes          6   0   0   17
           Male   No           0   0   0   50
                  Yes          5   0   0   23

    ftable just creates a flat table which can be more easily viewed. This table could be created with pandas by:

    In [54]: pivot_table(tips, 'size', rows=['time', 'sex', 'smoker'],
       ....:             cols='day', aggfunc=np.sum, fill_value=0)
    Out[54]:
      day                 Fri  Sat  Sun  Thur
    time   sex    smoker                    
    Dinner Female No      2    30   43   2  
                  Yes     8    33   10   0  
    Dinner Male   No      4    85   124  0  
                  Yes     12   71   39   0  
    Lunch  Female No      3    0    0    60  
                  Yes     6    0    0    17  
    Lunch  Male   No      0    0    0    50  
                  Yes     5    0    0    23

    Performance

    pandas is very fast as I’ve invested a great deal in optimizing the indexing infrastructure and other core algorithms related to things such as this. I don’t have a lot of points of comparison, but here is a simple benchmark of reshape2 versus pandas.pivot_table on a data set with 100000 entries and 25 groups. Here is the R code for the benchmark:

    library(reshape2)
    n <- 100000
    a.size <- 5
    b.size <- 5
    data <- data.frame(a=sample(letters[1:a.size], n, replace=T),
                       b=sample(letters[1:b.size], n, replace=T),
                       c=rnorm(n),
                       d=rnorm(n))
    timings <- numeric()

    for (i in 1:10) {
      gc()
      tim <- system.time(acast(melt(data, id=c("a", "b")), a + b ~ variable, mean))
      timings[i] = tim[3]
    }
    mean(timings)

    Here is what the actual table looks like:

    > table <- acast(melt(data, id=c("a", "b")), a + b ~ variable, mean)
    > head(table)
                   c            d
    a_a  0.002761963 -0.008793516
    a_b -0.004618980 -0.026978744
    a_c  0.022491767 -0.002163986
    a_d  0.005027362 -0.002532782
    a_e  0.006146438  0.031699238
    b_a -0.025094899  0.003334301

    Here’s the equivalent Python code (sans timings, which I’ll do in IPython):

    from pandas import *
    import string
    n = 100000
    asize = 5
    bsize = 5

    letters = np.asarray(list(string.letters), dtype=object)

    data = DataFrame(dict(foo=letters[:asize][np.random.randint(0, asize, n)],
                          bar=letters[:bsize][np.random.randint(0, bsize, n)],
                          baz=np.random.randn(n),
                          qux=np.random.randn(n)))

    table = pivot_table(data, rows=['foo', 'bar'])

    Similarly:

    In [36]: table = pivot_table(data, rows=['foo', 'bar'])

    In [37]: table[:10]
    Out[37]:
             baz       qux      
    foo bar                    
    a   a   -0.02162   0.02186  
        b   -0.002155  0.01173  
        c   -0.002331  0.006608
        d   -0.01734   0.01109  
        e    0.02837   0.01093  
    b   a    0.01703  -0.005845
        b   -0.001043  0.01283  
        c   -0.006015 -0.0005866
        d    0.009683  0.02211  
        e    0.02347  -8.346e-05

    And the R timing

    > mean(timings)
    [1] 0.42

    And pandas:

    In [38]: timeit table = pivot_table(data, rows=['foo', 'bar'])
    10 loops, best of 3: 117 ms per loop

    So it’s about 3-4x faster than reshape2. This obviously depends on the data set and how you structure the aggregation. Here’s a couple slightly different benchmarks on the same data:

    > acast(melt(data, id=c("a", "b")), a~b, mean, subset=.(variable=="c"))
                 a            b            c           d            e
    a  0.002761963 -0.004618980  0.022491767 0.005027362  0.006146438
    b -0.025094899  0.015966167 -0.007663059 0.006795551  0.006648496
    c  0.012273797  0.006139820 -0.019169968 0.009035374 -0.006079683
    d -0.014174747 -0.009844566  0.022450738 0.025967638 -0.006572791
    e -0.003623382  0.005924704  0.008156482 0.013288116  0.009839315

    Timing
    > mean(timings)
    [1] 0.3036

    and pandas:

    In [41]: pivot_table(data, 'baz', rows='foo', cols='bar')
    Out[41]:
      bar  a         b          c         d         e      
    foo                                                    
    a     -0.02162  -0.002155  -0.002331 -0.01734   0.02837
    b      0.01703  -0.001043  -0.006015  0.009683  0.02347
    c      0.01561   0.0003992  0.01451  -0.01046  -0.00368
    d      0.003504 -0.01437   -0.03265   0.003471 -0.004638
    e     -0.01656  -0.01276   -0.02481   0.008629  0.011  

    In [42]: timeit pivot_table(data, 'baz', rows='foo', cols='bar')
    10 loops, best of 3: 55 ms per loop

    I suspect one source of slowdown reshape/reshape2 approach is that the melt operation is very expensive, as is variable subsetting. However, it enables easier computation of group margins, something which I have not yet addressed (but will, eventually) in pandas.

    Performance quirk: making a 1D object ndarray of tuples

    Getting a 1-dimensional ndarray of object dtype containing Python tuples is, unless I’m missing something, rather difficult. Take this simple example:

    In [1]: tuples = zip(range(100000), range(100000))
    In [2]: arr = np.array(tuples, dtype=object, ndmin=1)
    In [3]: arr
    Out[3]:
    array([[0, 0],
           [1, 1],
           [2, 2],
           ...,
           [99997, 99997],
           [99998, 99998],
           [99999, 99999]], dtype=object)

    In [5]: arr.ndim
    Out[5]: 2

    OK, that didn’t work so well. The only way I’ve figured out how to get what I want is:

    In [6]: arr = np.empty(len(tuples), dtype='O')
    In [7]: arr[:] = tuples
    In [8]: arr
    Out[8]:
    array([(0, 0), (1, 1), (2, 2), ..., (99997, 99997), (99998, 99998),
           (99999, 99999)], dtype=object)

    Yahtzee. But the kids aren’t alright:

    In [9]: timeit arr[:] = tuples
    10 loops, best of 3: 133 ms per loop

    Maybe it’s just me but that strikes me as being outrageously slow. Someday I’ll look at what’s going on under the hood, but a quickie Cython function comes to the rescue:

    def list_to_object_array(list obj):
        '''
        Convert list to object ndarray.
        Seriously can't believe I had to write this function
        '''

        cdef:
            Py_ssize_t i, n
            ndarray[object] arr
        n = len(obj)
        arr = np.empty(n, dtype=object)
        for i from 0 <= i < n:
            arr[i] = obj[i]
        return arr

    You would hope this is faster, and indeed it’s about 85x faster:

    In [12]: timeit arr = lib.list_to_object_array(tuples)
    1000 loops, best of 3: 1.56 ms per loop

    Scratching my head here, but I’ll take it. I suspect there might be some object copying going on under the hood, anyone know?