Page 3 of 5

Contingency tables and cross-tabulations in pandas

Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.

When calling help(xtabs) (or help(table)) in R, you get a number of examples for using table, xtabs, and ftable:

Here are the same examples using pandas:

One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:

Here is another fun example:

Here is the equivalent R code:

As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.

NYCPython 1/10/2012: A look inside pandas design and development

I had the privilege of speaking last night at the NYCPython meetup group. I’ve given tons of “use pandas!” talks so I thought I would take a slightly different angle and talk about some of the design and implementation work that I’ve done for getting good performance in critical data manipulations. I’ll turn some of this material into some blog articles in the near future.

Wes McKinney: pandas design and development from Adam Klein on Vimeo.

Here’s some more video footable shot by my awesome friend Emily Paup with her HDSLR.

I did a little interactive demo (using the ever-amazing IPython HTML Notebook) on Ashley Williams’s Food Nutrient JSON Database:

Link to PDF output of Demo

Link to IPython Notebook file

If you want to run the code in the IPython notebook, you’ll have to download the food database file above.

The audience and I learned from this demo that if you’re after Tryptophan, “Sea lion, Steller, meat with fat (Alaska Native)” is where to get it (in the highest density).

High performance database joins with pandas DataFrame, more benchmarks

I posted a brief article with some preliminary benchmarks for the new merge/join infrastructure that I’ve built in pandas. I compared the performance with base::merge in R which, as various folks in the R community have pointed out, is fairly slow. There are a few other more intelligently-implemented functions available in CRAN, in particular plyr::join in plyr and the merge implemented for data.table objects in the data.table package.

Lastly, Sean Taylor contributed a benchmark for SQLite3, by accounts the most widely deployed SQL engine.

So anyway, here are the two benchmarks I’m interested in to get a sense of the large-ish data runtime of these algorithms:

  • Many-to-one joins. In these benchmarks I have a 80,000 row table with 10 copies of 8,000 key pairs and an 8,000 row table with a single copy of another 8,000 key pairs, only 6,000 of which are found in the larger table. The result of a left join between these tables should have 80,000 rows, an inner join 60,000, and an outer join 82,000.
  • Many-to-many joins. To keep things simple I use the same tables as above except the right able is the table above stacked on itself. You can do the combinatorics here, but the outer join between these two tables has 144,000 rows.
  • Note: that plyr::join does not implement (or least I’ve been told to avoid) many-to-many joins so I only run the many-to-one benchmarks for that.

    I’ve normalized the results by the minimum runtime (which is pandas in all cases):

    UPDATE (1/6/2012): Jared Lander pointed out that data.table is capable of much faster left and right joins by using the syntax left[right] instead of merge(left, right, all.y=True). I updated the benchmarks below and added the right join results which shows data.table performing very admirably.

    pandas vs R merge benchmarks

    pandas data.table plyr base::merge
    inner 1 5.905 6.35 13.29
    outer 1 10.05 9.209 20.25
    left 1 2.849 5.918 14.93
    right 1 2.05 2.923 16.91


    pandas data.table plyr base::merge
    inner 1 5.194 5.223 18.87
    outer 1 10 6.903 33.75
    left 1 2.528 4.688 24.46
    right 1 1.681 2.05 25.24

    SQLite3 Benchmarks

    A bit of care needs to be taken with SQLite3 benchmarks because the time to fetch the table from the database cursor (even though this is an in-memory SQLite database) is very significant. The performance as you can imagine is also quite different with and without indexes.

    Here is the basic code to insert the data into SQLite:

    import sqlite3
    create_sql_indexes = False
    conn = sqlite3.connect(':memory:')
    conn.execute('create table left( key varchar(10), key2 varchar(10), value int);')
    conn.execute('create table right( key varchar(10), key2 varchar(10), value2 int);')
    conn.executemany('insert into left values (?, ?, ?)',
                     zip(key, key2, left['value']))
    conn.executemany('insert into right values (?, ?, ?)',
                     zip(right['key'], right['key2'], right['value2']))
    # Create Indices
    if create_sql_indexes:
        conn.execute('create index left_ix on left(key, key2)')
        conn.execute('create index right_ix on right(key, key2)')

    With no indexes, here is a comparison of just the SQL execution time vs. the total pandas execution time for the many-to-one case described above:

    sqlite3 pandas
    inner 0.02328 0.01799
    outer 0.02324 0.01943
    left 0.02324 0.01923

    So pandas very slightly edges out SQLite with no indexes. Note that it takes ~300ms (on my machine) to fetch the data out of the database into a list of tuples, so if you consider that, pandas is really crushing it by doing everything in memory.

    The results after adding indexes are pretty indistinguishable as far as I can tell because the fetch time (which involves creating and populating a large number of Python tuples) dwarfs the join time it appears. The execute + fetch time varies between 310-340 ms for all three join types, with an without indexes, for the many-to-one case. The many-to-many case varies between 420-490 ms, whereas pandas is 22-25ms!

    UPDATE: After some thought and discussions with people, these benchmarks are not fair to SQLite. A more appropriate benchmark would be to create the joined table inside the database using a statement like

    SELECT *
       FROM LEFT
           AND LEFT.key2 = RIGHT.key2;

    Here are new benchmarks using this SQL statement:

    Many-to-one Many-to-many

    pandas sqlite3
    inner 0.01674 0.04187
    outer 0.01882 0.04534
    left 0.01851 0.04514

    pandas sqlite3
    inner 0.02091 0.06733
    outer 0.02266 0.06968
    left 0.02203 0.06882

    So pandas still significantly outperforms SQLite3 (even with SQL indexes as in these benchmarks). But it’s not totally apples-to-apples as SQLite3 is able to perform joins on extremely large data sets on disk.


    As far as I can tell, pandas now has one of the fastest in-memory database join operators out there. If you’re using Python to do relational algebra, you’d be crazy to pick SQLite3 over pandas due to the high cost of reading and writing large data sets (in the form of Python tuples) to SQL format.

    Code links

  • R code
  • pandas code
  • SQLite3 (Python) code
  • Some pandas Database Join (merge) Benchmarks vs. R base::merge

    Over the last week I have completely retooled pandas’s “database” join infrastructure / algorithms in order to support the full gamut of SQL-style many-to-many merges (pandas has had one-to-one and many-to-one joins for a long time). I was curious about the performance with reasonably large data sets as compared with which I’ve used many times in R. So I just ran a little benchmark for joining a 100000-row DataFrame with a 10000-row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.

    Note this new functionality will be shipping in the upcoming 0.7.0 release (!).

    There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it’s completely apples to apples. These are mean runtimes in seconds:

    Sort by key columns
    pandas R Ratio
    inner 0.07819 0.2286 2.924
    outer 0.09013 1.284 14.25
    left 0.0853 0.7766 9.104
    right 0.08112 0.3371 4.156

    Don’t sort by key columns
    pandas R Ratio
    inner 0.02808 0.2297 8.18
    outer 0.03879 1.181 30.45
    left 0.0365 0.6706 18.37
    right 0.03021 0.2995 9.912

    As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn’t really an indictment of R’s merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven’t run benchmarks against SQLite or another SQL database yet; that will happen eventually.

    I’m going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast–it was not easy!

    Here is the R code.

    N <- 10000
    indices = rep(NA, N)
    for (i in 1:N)
      indices[i] <- paste(sample(letters, 10), collapse="")

    left <- data.frame(key=rep(indices, 10),
                       key2=sample(rep(indices, 10)),
    right <- data.frame(key=indices,
    timeit <- function(func, niter=10) {
      timing = rep(NA, niter)
      for (i in 1:niter) {
        timing[i] <- system.time(func())[3]

    left.join <- function(sort=TRUE) {
      result <- merge(left, right, all.x=TRUE, sort=sort)
    right.join <- function(sort=TRUE) {
      result <- merge(left, right, all.y=TRUE, sort=sort)
    outer.join <- function(sort=TRUE) {
      result <- merge(left, right, all=TRUE, sort=sort)
    inner.join <- function(sort=TRUE) {
      reuslt <- merge(left, right, sort=sort)

    sort.options <- c(FALSE, TRUE)

    results <- matrix(nrow=4, ncol=2)
    colnames(results) <- c("dont_sort", "sort")
    rownames(results) <- c("inner", "outer", "left", "right")

    join.functions <- c(inner.join, outer.join, left.join, right.join)
    for (i in 1:4) {
      results[1, 1] <- timeit(function() {inner.join(sort=sort.options[1])})
      results[1, 2] <- timeit(function() {inner.join(sort=sort.options[2])})
      results[2, 1] <- timeit(function() {outer.join(sort=sort.options[1])})
      results[2, 2] <- timeit(function() {outer.join(sort=sort.options[2])})
      results[3, 1] <- timeit(function() {left.join(sort=sort.options[1])})
      results[3, 2] <- timeit(function() {left.join(sort=sort.options[2])})
      results[4, 1] <- timeit(function() {right.join(sort=sort.options[1])})
      results[4, 2] <- timeit(function() {right.join(sort=sort.options[2])})

    And the Python code

    from collections import defaultdict
    import gc
    import time
    from pandas.util.testing import rands
    N = 10000
    indices = np.array([rands(10) for _ in xrange(N)], dtype='O')

    key = np.tile(indices, 10)
    key2 = key.copy()
    indices2 = indices.copy()
    left = DataFrame({'key' : key, 'key2':key2,
                      'value' : np.random.randn(100000)})
    right = DataFrame({'key': indices, 'key2':indices2,
                       'value2' : np.random.randn(10000)})
    join_methods = ['inner', 'outer', 'left', 'right']
    results = DataFrame(index=join_methods, columns=[False, True])
    niter = 10
    for sort in [False, True]:
        for join_method in join_methods:
            f = lambda: merge(left, right, how=join_method, sort=sort)
            start = time.time()
            for _ in xrange(niter):
            elapsed = (time.time() - start) / niter
            results[sort][join_method] = elapsed
    results.columns = ['dont_sort', 'sort']

    # R results
    from StringIO import StringIO
    r_results = read_table(StringIO("""dont_sort   sort
    inner    0.2297 0.2286
    outer    1.1811 1.2843
    left     0.6706 0.7766
    right    0.2995 0.3371
    ), sep='\s+')

    sort_results = DataFrame.from_items([('pandas', results['sort']),
                                         ('R', r_results['sort'])])
    sort_results['Ratio'] = sort_results['R'] / sort_results['pandas']

    nosort_results = DataFrame.from_items([('pandas', results['dont_sort']),
                                           ('R', r_results['dont_sort'])])
    nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas']

    Introducing vbench, new code performance analysis and monitoring tool

    Do you know how fast your code is? Is it faster than it was last week? Or a month ago? How do you know if you accidentally made a function slower by changes elsewhere? Unintentional performance regressions are extremely common in my experience: it’s hard to unit test the performance of your code. Over time I have gotten tired of playing the game of “performance whack-a-mole”. Thus, I started hacking together a little weekend project that I’m calling vbench. If someone thinks up a cleverer name, I’m all ears.

    Link to pandas benchmarks page produced using vbench

    What is vbench?

    vbench is a super-lightweight Python library for running a collection of performance benchmarks over the course of your source repository’s history. Since I’m a GitHub user, it only does git for now, but it could be generalized to support other VCSs. Basically, you define a benchmark:

    common_setup = """
    from pandas import *
    import pandas.util.testing as tm
    import random
    import numpy as np

    setup = common_setup + """

    N = 100000
    ngroups = 100

    def get_test_data(ngroups=100, n=N):
        unique_groups = range(ngroups)
        arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object)

        if len(arr) < n:
            arr = np.asarray(list(arr) + unique_groups[:n - len(arr)],

        return arr

    df = DataFrame({'key1' : get_test_data(ngroups=ngroups),
                    'key2' : get_test_data(ngroups=ngroups),
                    'data' : np.random.randn(N)})
    def f():
        df.groupby(['key1', 'key2']).agg(lambda x: x.values.sum())

    stmt2 = "df.groupby(['key1', 'key2']).sum()"
    bm_groupby2 = Benchmark(stmt2, setup, name="GroupBy test 2",
                            start_date=datetime(2011, 7, 1))

    Then you write down the information about your repository and how to build any relevant DLLs, etc., that vary from revision to revision:

    REPO_PATH = '/home/wesm/code/pandas'
    REPO_URL = ''
    DB_PATH = '/home/wesm/code/pandas/gb_suite/benchmarks.db'
    TMP_DIR = '/home/wesm/tmp/gb_pandas'
    PREPARE = """
    python clean
    BUILD = """
    python build_ext --inplace
    START_DATE = datetime(2011, 3, 1)

    Then you pass this info, plus a list of your benchmark objects, to the BenchmarkRunner class:

    runner = BenchmarkRunner(benchmarks, REPO_PATH, REPO_URL,
                             BUILD, DB_PATH, TMP_DIR, PREPARE,
                             run_option='eod', start_date=START_DATE)

    Now, the BenchmarkRunner makes a clone of your repo, then runs all of the benchmarks once for each revision in the repository (or some other rule, e.g. I’ve set run_option='eod' to only take the last snapshot on each day). It persists the results in a SQLite database so that you can rerun the process and it will skip benchmarks it’s already run (this is key when you add new benchmarks, only the new ones will be updated). Benchmarks are uniquely identified by the MD5 hash of their source code.

    This is the resulting plot over time for the above GroupBy benchmark related to some Cython code that I worked on late last week (where I made a major performance improvement in this case):

    Here is a fully-formed vbench suite in the pandas git repository.

    Kind of like codespeed and

    Before starting to write a new project I looked briefly at codespeed and the excellent work that the PyPy guys have done with But then I started thinking, you know, Django web application, JSON requests to upload benchmark results? Seemed like far too much work to do something relatively simple. The dealbreaker is that codespeed is just a web application. It doesn’t actually (to my knowledge, someone correct me if I’m wrong?) have any kind of a framework for orchestrating the running of benchmarks throughout your code history. That is what this new project is for. I actually see a natural connection between vbench and codespeed, all you need to do is write a script to upload your vbench results to a codespeed web app!

    At some point I’d like to build a simple web front end or wx/Qt viewer for the generated vbench database. I’ve never done any JavaScript, but it would be a good opportunity to learn. Knowing me, I might break down and hack out a stupid little wxPython app with an embedded matplotlib widget anyway.

    Anyway, I’m really excited about this project. It’s very prototype-y at the moment but I tried to design it in a nice and extensible way. I also plan to put all my git repo analysis tools in there (like code churn graphs etc.) so it should become a nice little collection of tools.

    Other ideas for extending

  • Dealing with API changes
  • Multiple library versions (e.g. NumPy) or Python versions
  • Formatting DataFrame as HTML

    Recently, a pandas user, Wouter Overmeire, contributed a to_html function to DataFrame in a pull request. It’s very handy:

    In [31]: df
                            ls    lsc   pop   ccode
    year cname     agefrom                        
    1950 Australia 15       64.3  15.4  558   AUS  
                   40       38.9  20.1  555   AUS  
                   65       24.7  10.7  273   AUS  
    1955 Australia 25       48.4  26.2  705   AUS  
                   50       34    16.6  478   AUS  
                   75       26.3  11.7  240   AUS  
    1960 Australia 35       47.9  24.7  755   AUS  
                   60       29.6  13.5  389   AUS  
    1965 Australia 20       57.7  31.1  832   AUS  
                   45       41.3  20.3  665   AUS  
                   70       30.2  12.9  273   AUS  
    1970 Australia 30       57.7  31.1  789   AUS  
                   55       42.1  19.4  619   AUS  
    1975 Australia 15       91.1  23.1  1225  AUS  
                   40       56.3  30.9  744   AUS  
                   65       48.8  22.5  440   AUS  
    1980 Australia 25       74    44.5  1205  AUS  
                   50       60.2  32.8  774   AUS  
                   75       62.5  30.8  500   AUS  
    1985 Australia 35       72.7  42.8  1180  AUS  

    In [32]: df.to_html()

    which outputs:

    ls lsc pop ccode
    year cname agefrom
    1950 Australia 15 64.3 15.4 558 AUS
    1955 Australia 50 34 16.6 478 AUS
    1965 Australia 20 57.7 31.1 832 AUS
    1970 Australia 55 42.1 19.4 619 AUS
    1980 Australia 25 74 44.5 1205 AUS
    1985 Australia 60 59.1 30.8 685 AUS
    1995 Australia 30 72 45.7 1430 AUS
    2000 Australia 65 60 31.5 663 AUS
    2010 Australia 35 59.9 39.1 1506 AUS
    1950 Austria 70 2.8 1.9 218 AUT
    1960 Austria 40 4.5 3 324 AUT
    1965 Austria 75 13 8.6 322 AUT
    1975 Austria 45 42.1 28 455 AUT
    1985 Austria 15 22.4 4.2 626 AUT
    1990 Austria 50 47.3 23.4 448 AUT
    2000 Austria 20 73.2 63.2 465 AUT
    2005 Austria 55 55.1 36.2 486 AUT

    It also renders a tidy representation with hierarchical columns:

    In [14]: table
      sex     Female          Male          
      smoker  No      Yes     No      Yes  
    Fri       0.1653  0.2091  0.138   0.1447
    Sat       0.148   0.1638  0.1621  0.1391
    Sun       0.1657  0.2371  0.1583  0.174
    Thur      0.156   0.1631  0.1657  0.1644

    Fri 0.1653 0.2091 0.138 0.1447
    Sat 0.148 0.1638 0.1621 0.1391
    Sun 0.1657 0.2371 0.1583 0.174
    Thur 0.156 0.1631 0.1657 0.1644

    Talk at Rice Stats on structured data analysis, pandas, 11/21/2011

    I had the privilege of speaking today at Rice at Hadley Wickham’s (of R fame) invitation. I talked broadly about the problems faced in structured data manipulation problems and how I’ve worked to address them in pandas. I enjoyed speaking with Hadley (whom I’d never met before in person) about these problems as we’ve come up with nice solutions to many of them independently, and in different languages (R and Python, respectively). A lot of folks assume that parts of pandas are inspired by his plyr and reshape2 packages, but I’ve only started playing around with them recently. I think we all have a lot we can learn from each other on the road toward building even better high performance, easy-to-use, and expressive data manipulation tools.

    pandas talk at PyHPC 2011 workshop in SC11, thoughts on hash tables

    Here are the slides from my talk at PyHPC2011. Not really my usual talk for data crowds– a little more nuts and bolts about some of the indexing and GroupBy implementation details. Some people might be interested in the illustrative data alignment benchmarks which show the relative weakness of Python’s dict implementation (in both speed and memory usage) for lookups and alignment. After these benchmarks I think it’s pretty much inevitable that I’m going to end up writing a custom hash table implementation in C for the data alignment on primitive types. Now, if I wanted a threadsafe hash table that I could use OpenMP on, that would be a serious undertaking. Anyone want to help?

    The basic problem is that Python dicts are not designed for my use case– namely very large dicts that I use to perform data alignment operations.


    Filtering out duplicate pandas.DataFrame rows

    Sean Taylor recently alerted me to the fact that there wasn’t an easy way to filter out duplicate rows in a pandas DataFrame. R has the duplicated function which serves this purpose quite nicely. The R method’s implementation is kind of kludgy in my opinion (from “The data frame method works by pasting together a character representation of the rows”), but in any case I set about writing a Python version from first principles.

    So a drop_duplicates method should be able to either consider a subset of the columns or all of the columns for determining which are “duplicates”. It occurred to me that a reasonably fast and efficient way to do this was to use GroupBy. You have to know a bit about the internals, but it’s fairly straightforward otherwise:

    grouped = df.groupby(cols_to_consider)
    index = [gp_keys[0] for gp_keys in grouped.groups.values()]
    unique_df = df.reindex(index)

    For a faster and more direct implementation in Cython, I decided to use a list of tuples and a dict to keep track of whether a row has been “seen” or not:

    def duplicated(list values, take_last=False):
            Py_ssize_t i, n
            dict seen = {}
            object row

        n = len(values)
        cdef ndarray[uint8_t] result = np.zeros(n, dtype=np.uint8)

        if take_last:
            for i from n > i >= 0:
                row = values[i]
                if row in seen:
                    result[i] = 1
                    seen[row] = None
                    result[i] = 0
            for i from 0 <= i < n:
                row = values[i]
                if row in seen:
                    result[i] = 1
                    seen[row] = None
                    result[i] = 0

        return result.view(np.bool_)

    Aside: the uint8 and casting to np.bool_ thing is there because the buffer interface doesn’t work quite right with boolean arrays in Python 2.5. So as soon as I drop Python 2.5 compatibility (probably around the end of the year), I’ll go through and fix that stuff up. As a second aside, using a dict with dummy keys was coming out a bit faster than using a set in Cython, for reasons unknown to me.

    So, on the Python side, the new DataFrame function just takes the boolean vector returned by that Cython function and uses it to select out the rows:

    In [21]: df
        A  B  C
    0   a  c  0
    1   b  c  1
    2   c  b  2
    3   d  a  0
    4   e  c  1
    5   a  a  2
    6   b  b  0
    7   c  b  1
    8   d  b  2
    9   e  b  0
    10  a  a  1
    11  b  a  2
    12  c  c  0
    13  d  a  1
    14  e  c  2

    In [22]: df.drop_duplicates('A')
       A  B  C
    0  a  c  0
    1  b  c  1
    2  c  b  2
    3  d  a  0
    4  e  c  1

    In [23]: df.drop_duplicates(['A', 'B'])
        A  B  C
    0   a  c  0
    1   b  c  1
    2   c  b  2
    3   d  a  0
    4   e  c  1
    5   a  a  2
    6   b  b  0
    8   d  b  2
    9   e  b  0
    11  b  a  2
    12  c  c  0

    Not passing any particular column or columns is the same as passing all of them.

    I was impressed by the performance:

    In [26]: from pandas.util.testing import rands

    In [27]: k1 = [rands(1) for _ in xrange(100000)]

    In [28]: k2 = [rands(1) for _ in xrange(100000)]

    In [29]: df = DataFrame({'A' : k1, 'B' : k2, 'C' : np.tile(np.arange(100), 1000)})

    In [30]: timeit df.drop_duplicates(['A', 'B'])
    100 loops, best of 3: 18.2 ms per loop

    In [31]: timeit df.drop_duplicates()
    10 loops, best of 3: 49.8 ms per loop

    # the groupby method
    In [32]: timeit df.reindex([v[0] for v in df.groupby(['A', 'B']).groups.values()])
    10 loops, best of 3: 56.5 ms per loop

    For reference, it’s roughly twice as fast as the equivalent R code (though I’m still using R 2.11.1…time to upgrade to 2.14!), but perhaps not too shocking as the R function is converting each row into a string first.

    N <- 100000

    k1 = rep(NA, N)
    k2 = rep(NA, N)
    for (i in 1:N){
      k1[i] <- paste(sample(letters, 1), collapse="")
      k2[i] <- paste(sample(letters, 1), collapse="")
    df <- data.frame(a=k1, b=k2, c=rep(1:100, N / 100))
    df2 <- data.frame(a=k1, b=k2)

    timings <- numeric()
    timings2 <- numeric()
    for (i in 1:50) {
      timings[i] = system.time(deduped <- df[!duplicated(df),])[3]
      timings2[i] = system.time(deduped <- df[!duplicated(df[,c("a", "b")]),])[3]

    > mean(timings)
    [1] 0.09312
    > mean(timings2)
    [1] 0.04312

    PyHPC 2011 Pre-print paper on pandas

    I’ll be presenting this paper in a couple of weeks at SC11. It’s the most up-to-date white paper about pandas.

    PDF Download Link

    pandas: a Foundational Python Library for Data Analysis and Statistics

    © 2015 Wes McKinney

    Theme by Anders NorenUp ↑