Monthly Archives: January 2012

Even easier frequency tables in pandas 0.7.0

I put in a little work on a new crosstab function in the main pandas namespace. It’s basically a convenient shortcut to calling pivot_table to make it easy to compute cross-tabulations for a set of factors using pandas DataFrame or even vanilla NumPy arrays!

Here’s an example:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
In [1]: from pandas import *
In [2]: df = DataFrame({'a' : np.random.randint(0, 2, size=20),
...: 'b' : np.random.randint(0, 5, size=20),
...: 'c' : np.random.randint(0, 3, size=20)})
In [3]: df
Out[3]:
a b c
0 0 1 2
1 1 4 1
2 1 4 2
3 1 2 2
4 0 0 0
5 0 0 2
6 1 2 2
7 1 2 0
8 1 1 1
9 1 2 0
10 1 4 1
11 1 1 2
12 0 1 2
13 1 1 0
14 1 4 1
15 1 4 0
16 0 4 1
17 0 3 1
18 1 0 1
19 0 4 1
In [4]: crosstab(df['b'], [df['a'], df['c']])
Out[4]:
a 0 1
c 0 1 2 0 1 2
b
0 1 0 1 0 1 0
1 0 0 2 1 1 1
2 0 0 0 2 0 2
3 0 1 0 0 0 0
4 0 2 0 1 3 1

This makes it very easy to produce an easy-on-the-eyes frequency table. crosstab can also take NumPy arrays. Suppose we had 1 million draws from a normal distribution, and we wish to produce a histogram-like table showing the number of draws whose absolute values fall into the bins defined by [0, 1, 2, 3]. Also, let’s divide things up by sign. Using crosstab this becomes dead simple.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
In [19]: arr = np.random.randn(1000000)
In [20]: buckets = [0, 1, 2, 3]
In [22]: crosstab(np.digitize(np.abs(arr), buckets), np.sign(arr),
....: rownames=['bucket'], colnames=['sign'])
....:
Out[22]:
sign -1.0 1.0
bucket
1 341678 340498
2 136104 135999
3 21424 21607
4 1334 1356

Of course since the result is a DataFrame you can customize the indexes:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [28]: table = crosstab(np.digitize(np.abs(arr), buckets),
....: np.sign(arr), rownames=['bucket'],
....: colnames=['sign'])
....:
In [29]: table.index = Index(['[0, 1)', '[1, 2)', '[2, 3)',
....: '[3, inf)'], name="bin")
....:
In [30]: table.columns = Index(['negative', 'positive'], name="sign")
In [31]: table
Out[31]:
sign negative positive
bin
[0, 1) 341678 340498
[1, 2) 136104 135999
[2, 3) 21424 21607
[3, inf) 1334 1356

Contingency tables and cross-tabulations in pandas

Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.

When calling help(xtabs) (or help(table)) in R, you get a number of examples for using table, xtabs, and ftable:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
> head(esoph)
agegp alcgp tobgp ncases ncontrols
1 25-34 0-39g/day 0-9g/day 0 40
2 25-34 0-39g/day 10-19 0 10
3 25-34 0-39g/day 20-29 0 6
4 25-34 0-39g/day 30+ 0 5
5 25-34 40-79 0-9g/day 0 27
6 25-34 40-79 10-19 0 7
> ftable(xtabs(cbind(ncases, ncontrols) ~ agegp + tobgp, data = esoph))
ncases ncontrols
agegp tobgp
25-34 0-9g/day 0 70
10-19 1 19
20-29 0 11
30+ 0 16
35-44 0-9g/day 2 109
10-19 4 46
20-29 3 27
30+ 0 17
45-54 0-9g/day 14 104
10-19 13 57
20-29 8 33
30+ 11 19
55-64 0-9g/day 25 117
10-19 23 65
20-29 12 38
30+ 16 22
65-74 0-9g/day 31 99
10-19 12 38
20-29 10 20
30+ 2 4
75+ 0-9g/day 6 26
10-19 5 11
20-29 0 3
30+ 2 4
> ftable(xtabs(cbind(ncases, ncontrols) ~ agegp, data = esoph))
ncases ncontrols
agegp
25-34 1 116
35-44 9 199
45-54 46 213
55-64 76 242
65-74 55 161
75+ 13 44

Here are the same examples using pandas:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
In [19]: table = esoph.pivot_table(['ncases', 'ncontrols'],
rows=['agegp', 'tobgp'],
aggfunc=np.sum)
In [20]: table
Out[20]:
ncases ncontrols
agegp tobgp
25-34 0-9g/day 0. 70.
10-19 1. 19.
20-29 0. 11.
30+ 0. 16.
35-44 0-9g/day 2. 109
10-19 4. 46.
20-29 3. 27.
30+ 0. 17.
45-54 0-9g/day 14 104
10-19 13 57.
20-29 8. 33.
30+ 11 19.
55-64 0-9g/day 25 117
10-19 23 65.
20-29 12 38.
30+ 16 22.
65-74 0-9g/day 31 99.
10-19 12 38.
20-29 10 20.
30+ 2. 4.0
75+ 0-9g/day 6. 26.
10-19 5. 11.
20-29 0. 3.0
30+ 2. 4.0
In [22]: table2 = esoph.pivot_table(['ncases', 'ncontrols'],
rows='agegp', aggfunc=np.sum)
In [23]: table2
Out[23]:
ncases ncontrols
agegp
25-34 1. 116
35-44 9. 199
45-54 46 213
55-64 76 242
65-74 55 161
75+ 13 44.

One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:

Python
1
2
3
4
5
6
7
8
In [26]: table['ncases'].unstack('agegp')
Out[26]:
agegp 25-34 35-44 45-54 55-64 65-74 75+
tobgp
0-9g/day 0 2 14 25 31 6
10-19 1 4 13 23 12 5
20-29 0 3 8. 12 10 0
30+ 0 0 11 16 2. 2

Here is another fun example:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [31]: import pandas.rpy.common as com
In [32]: wp = com.load_data('warpbreaks')
In [33]: wp['replicate'] = np.tile(range(1, 10), 6)
In [34]: wp.pivot_table('breaks', rows=['wool', 'tension'],
cols='replicate', aggfunc=np.sum)
Out[34]:
replicate 1 2 3 4 5 6 7 8 9
wool tension
A H 36 21 24 18 10 43 28 15 26
L 26 30 54 25 70 52 51 26 67
M 18 21 29 17 12 18 35 30 36
B H 20 21 24 17 13 15 15 16 28
L 27 14 29 19 29 31 41 20 44
M 42 26 19 16 39 28 21 39 29

Here is the equivalent R code:

1
2
3
4
5
6
7
8
9
10
> warpbreaks$replicate <- rep(1:9, len = 54)
> ftable(xtabs(breaks ~ wool + tension + replicate, data = warpbreaks))
replicate 1 2 3 4 5 6 7 8 9
wool tension
A L 26 30 54 25 70 52 51 26 67
M 18 21 29 17 12 18 35 30 36
H 36 21 24 18 10 43 28 15 26
B L 27 14 29 19 29 31 41 20 44
M 42 26 19 16 39 28 21 39 29
H 20 21 24 17 13 15 15 16 28

As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.

NYCPython 1/10/2012: A look inside pandas design and development

I had the privilege of speaking last night at the NYCPython meetup group. I’ve given tons of “use pandas!” talks so I thought I would take a slightly different angle and talk about some of the design and implementation work that I’ve done for getting good performance in critical data manipulations. I’ll turn some of this material into some blog articles in the near future.

Wes McKinney: pandas design and development from Adam Klein on Vimeo.

Here’s some more video footable shot by my awesome friend Emily Paup with her HDSLR.

I did a little interactive demo (using the ever-amazing IPython HTML Notebook) on Ashley Williams’s Food Nutrient JSON Database:

Link to PDF output of Demo

Link to IPython Notebook file

If you want to run the code in the IPython notebook, you’ll have to download the food database file above.

The audience and I learned from this demo that if you’re after Tryptophan, “Sea lion, Steller, meat with fat (Alaska Native)” is where to get it (in the highest density).

High performance database joins with pandas DataFrame, more benchmarks

I posted a brief article with some preliminary benchmarks for the new merge/join infrastructure that I’ve built in pandas. I compared the performance with base::merge in R which, as various folks in the R community have pointed out, is fairly slow. There are a few other more intelligently-implemented functions available in CRAN, in particular plyr::join in plyr and the merge implemented for data.table objects in the data.table package.

Lastly, Sean Taylor contributed a benchmark for SQLite3, by accounts the most widely deployed SQL engine.

So anyway, here are the two benchmarks I’m interested in to get a sense of the large-ish data runtime of these algorithms:

  • Many-to-one joins. In these benchmarks I have a 80,000 row table with 10 copies of 8,000 key pairs and an 8,000 row table with a single copy of another 8,000 key pairs, only 6,000 of which are found in the larger table. The result of a left join between these tables should have 80,000 rows, an inner join 60,000, and an outer join 82,000.
  • Many-to-many joins. To keep things simple I use the same tables as above except the right able is the table above stacked on itself. You can do the combinatorics here, but the outer join between these two tables has 144,000 rows.
  • Note: that plyr::join does not implement (or least I’ve been told to avoid) many-to-many joins so I only run the many-to-one benchmarks for that.

    I’ve normalized the results by the minimum runtime (which is pandas in all cases):

    UPDATE (1/6/2012): Jared Lander pointed out that data.table is capable of much faster left and right joins by using the syntax left[right] instead of merge(left, right, all.y=True). I updated the benchmarks below and added the right join results which shows data.table performing very admirably.

    pandas vs R merge benchmarks
    Many-to-one

    pandas data.table plyr base::merge
    inner 1 5.905 6.35 13.29
    outer 1 10.05 9.209 20.25
    left 1 2.849 5.918 14.93
    right 1 2.05 2.923 16.91

    Many-to-many

    pandas data.table plyr base::merge
    inner 1 5.194 5.223 18.87
    outer 1 10 6.903 33.75
    left 1 2.528 4.688 24.46
    right 1 1.681 2.05 25.24

    SQLite3 Benchmarks

    A bit of care needs to be taken with SQLite3 benchmarks because the time to fetch the table from the database cursor (even though this is an in-memory SQLite database) is very significant. The performance as you can imagine is also quite different with and without indexes.

    Here is the basic code to insert the data into SQLite:

    import sqlite3
    create_sql_indexes = False
    conn = sqlite3.connect(':memory:')
    conn.execute('create table left( key varchar(10), key2 varchar(10), value int);')
    conn.execute('create table right( key varchar(10), key2 varchar(10), value2 int);')
    conn.executemany('insert into left values (?, ?, ?)',
                     zip(key, key2, left['value']))
    conn.executemany('insert into right values (?, ?, ?)',
                     zip(right['key'], right['key2'], right['value2']))
    # Create Indices
    if create_sql_indexes:
        conn.execute('create index left_ix on left(key, key2)')
        conn.execute('create index right_ix on right(key, key2)')

    With no indexes, here is a comparison of just the SQL execution time vs. the total pandas execution time for the many-to-one case described above:


    sqlite3 pandas
    inner 0.02328 0.01799
    outer 0.02324 0.01943
    left 0.02324 0.01923

    So pandas very slightly edges out SQLite with no indexes. Note that it takes ~300ms (on my machine) to fetch the data out of the database into a list of tuples, so if you consider that, pandas is really crushing it by doing everything in memory.

    The results after adding indexes are pretty indistinguishable as far as I can tell because the fetch time (which involves creating and populating a large number of Python tuples) dwarfs the join time it appears. The execute + fetch time varies between 310-340 ms for all three join types, with an without indexes, for the many-to-one case. The many-to-many case varies between 420-490 ms, whereas pandas is 22-25ms!

    UPDATE: After some thought and discussions with people, these benchmarks are not fair to SQLite. A more appropriate benchmark would be to create the joined table inside the database using a statement like

    CREATE TABLE test AS
    SELECT *
       FROM LEFT
       INNER JOIN RIGHT
         ON LEFT.KEY=RIGHT.KEY
           AND LEFT.key2 = RIGHT.key2;

    Here are new benchmarks using this SQL statement:

    Many-to-one Many-to-many


    pandas sqlite3
    inner 0.01674 0.04187
    outer 0.01882 0.04534
    left 0.01851 0.04514


    pandas sqlite3
    inner 0.02091 0.06733
    outer 0.02266 0.06968
    left 0.02203 0.06882

    So pandas still significantly outperforms SQLite3 (even with SQL indexes as in these benchmarks). But it’s not totally apples-to-apples as SQLite3 is able to perform joins on extremely large data sets on disk.

    Conclusions

    As far as I can tell, pandas now has one of the fastest in-memory database join operators out there. If you’re using Python to do relational algebra, you’d be crazy to pick SQLite3 over pandas due to the high cost of reading and writing large data sets (in the form of Python tuples) to SQL format.

    Code links

  • R code
  • pandas code
  • SQLite3 (Python) code
  • Some pandas Database Join (merge) Benchmarks vs. R base::merge

    Over the last week I have completely retooled pandas’s “database” join infrastructure / algorithms in order to support the full gamut of SQL-style many-to-many merges (pandas has had one-to-one and many-to-one joins for a long time). I was curious about the performance with reasonably large data sets as compared with base::merge.data.frame which I’ve used many times in R. So I just ran a little benchmark for joining a 100000-row DataFrame with a 10000-row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.

    Note this new functionality will be shipping in the upcoming 0.7.0 release (!).

    There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it’s completely apples to apples. These are mean runtimes in seconds:


    Sort by key columns
    pandas R Ratio
    inner 0.07819 0.2286 2.924
    outer 0.09013 1.284 14.25
    left 0.0853 0.7766 9.104
    right 0.08112 0.3371 4.156


    Don’t sort by key columns
    pandas R Ratio
    inner 0.02808 0.2297 8.18
    outer 0.03879 1.181 30.45
    left 0.0365 0.6706 18.37
    right 0.03021 0.2995 9.912

    As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn’t really an indictment of R’s merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven’t run benchmarks against SQLite or another SQL database yet; that will happen eventually.

    I’m going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast–it was not easy!

    Here is the R code.

    N <- 10000
    indices = rep(NA, N)
    for (i in 1:N)
      indices[i] <- paste(sample(letters, 10), collapse="")

    left <- data.frame(key=rep(indices, 10),
                       key2=sample(rep(indices, 10)),
                       value=rnorm(100000))
    right <- data.frame(key=indices,
                        key2=sample(indices),
                        value2=rnorm(10000))
    timeit <- function(func, niter=10) {
      timing = rep(NA, niter)
      for (i in 1:niter) {
        gc()
        timing[i] <- system.time(func())[3]
      }
      mean(timing)
    }

    left.join <- function(sort=TRUE) {
      result <- merge(left, right, all.x=TRUE, sort=sort)
    }
    right.join <- function(sort=TRUE) {
      result <- merge(left, right, all.y=TRUE, sort=sort)
    }
    outer.join <- function(sort=TRUE) {
      result <- merge(left, right, all=TRUE, sort=sort)
    }
    inner.join <- function(sort=TRUE) {
      reuslt <- merge(left, right, sort=sort)
    }

    sort.options <- c(FALSE, TRUE)

    results <- matrix(nrow=4, ncol=2)
    colnames(results) <- c("dont_sort", "sort")
    rownames(results) <- c("inner", "outer", "left", "right")

    join.functions <- c(inner.join, outer.join, left.join, right.join)
    for (i in 1:4) {
      results[1, 1] <- timeit(function() {inner.join(sort=sort.options[1])})
      results[1, 2] <- timeit(function() {inner.join(sort=sort.options[2])})
      results[2, 1] <- timeit(function() {outer.join(sort=sort.options[1])})
      results[2, 2] <- timeit(function() {outer.join(sort=sort.options[2])})
      results[3, 1] <- timeit(function() {left.join(sort=sort.options[1])})
      results[3, 2] <- timeit(function() {left.join(sort=sort.options[2])})
      results[4, 1] <- timeit(function() {right.join(sort=sort.options[1])})
      results[4, 2] <- timeit(function() {right.join(sort=sort.options[2])})
    }

    And the Python code

    from collections import defaultdict
    import gc
    import time
    from pandas.util.testing import rands
    N = 10000
    indices = np.array([rands(10) for _ in xrange(N)], dtype='O')

    key = np.tile(indices, 10)
    key2 = key.copy()
    random.shuffle(key2)
    indices2 = indices.copy()
    random.shuffle(indices2)
    left = DataFrame({'key' : key, 'key2':key2,
                      'value' : np.random.randn(100000)})
    right = DataFrame({'key': indices, 'key2':indices2,
                       'value2' : np.random.randn(10000)})
    join_methods = ['inner', 'outer', 'left', 'right']
    results = DataFrame(index=join_methods, columns=[False, True])
    niter = 10
    for sort in [False, True]:
        for join_method in join_methods:
            f = lambda: merge(left, right, how=join_method, sort=sort)
            gc.disable()
            start = time.time()
            for _ in xrange(niter):
                f()
            elapsed = (time.time() - start) / niter
            gc.enable()
            results[sort][join_method] = elapsed
    results.columns = ['dont_sort', 'sort']

    # R results
    from StringIO import StringIO
    r_results = read_table(StringIO("""dont_sort   sort
    inner    0.2297 0.2286
    outer    1.1811 1.2843
    left     0.6706 0.7766
    right    0.2995 0.3371
    """
    ), sep='\s+')

    sort_results = DataFrame.from_items([('pandas', results['sort']),
                                         ('R', r_results['sort'])])
    sort_results['Ratio'] = sort_results['R'] / sort_results['pandas']

    nosort_results = DataFrame.from_items([('pandas', results['dont_sort']),
                                           ('R', r_results['dont_sort'])])
    nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas']