Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.

When calling help(xtabs) (or help(table)) in R, you get a number of examples for using table, xtabs, and ftable:

> head(esoph)
  agegp     alcgp    tobgp ncases ncontrols
1 25-34 0-39g/day 0-9g/day      0        40
2 25-34 0-39g/day    10-19      0        10
3 25-34 0-39g/day    20-29      0         6
4 25-34 0-39g/day      30+      0         5
5 25-34     40-79 0-9g/day      0        27
6 25-34     40-79    10-19      0         7

> ftable(xtabs(cbind(ncases, ncontrols) ~ agegp + tobgp, data = esoph))
                ncases ncontrols
agegp tobgp
25-34 0-9g/day       0        70
      10-19          1        19
      20-29          0        11
      30+            0        16
35-44 0-9g/day       2       109
      10-19          4        46
      20-29          3        27
      30+            0        17
45-54 0-9g/day      14       104
      10-19         13        57
      20-29          8        33
      30+           11        19
55-64 0-9g/day      25       117
      10-19         23        65
      20-29         12        38
      30+           16        22
65-74 0-9g/day      31        99
      10-19         12        38
      20-29         10        20
      30+            2         4
75+   0-9g/day       6        26
      10-19          5        11
      20-29          0         3
      30+            2         4

> ftable(xtabs(cbind(ncases, ncontrols) ~ agegp, data = esoph))
       ncases ncontrols
agegp
25-34       1       116
35-44       9       199
45-54      46       213
55-64      76       242
65-74      55       161
75+        13        44

Here are the same examples using pandas:

In [19]: table = esoph.pivot_table(['ncases', 'ncontrols'],
                                    rows=['agegp', 'tobgp'],
                                   aggfunc=np.sum)

In [20]: table
Out[20]:
                ncases  ncontrols
agegp tobgp
25-34 0-9g/day  0.      70.
      10-19     1.      19.
      20-29     0.      11.
      30+       0.      16.
35-44 0-9g/day  2.      109
      10-19     4.      46.
      20-29     3.      27.
      30+       0.      17.
45-54 0-9g/day  14      104
      10-19     13      57.
      20-29     8.      33.
      30+       11      19.
55-64 0-9g/day  25      117
      10-19     23      65.
      20-29     12      38.
      30+       16      22.
65-74 0-9g/day  31      99.
      10-19     12      38.
      20-29     10      20.
      30+       2.      4.0
75+   0-9g/day  6.      26.
      10-19     5.      11.
      20-29     0.      3.0
      30+       2.      4.0

In [22]: table2 = esoph.pivot_table(['ncases', 'ncontrols'],
                                    rows='agegp', aggfunc=np.sum)
In [23]: table2
Out[23]:
       ncases  ncontrols
agegp
25-34  1.      116
35-44  9.      199
45-54  46      213
55-64  76      242
65-74  55      161
75+    13      44.

One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:

In [26]: table['ncases'].unstack('agegp')
Out[26]:
agegp     25-34  35-44  45-54  55-64  65-74  75+
tobgp
0-9g/day  0      2      14     25     31     6
10-19     1      4      13     23     12     5
20-29     0      3      8.     12     10     0
30+       0      0      11     16     2.     2

Here is another fun example:

In [31]: import pandas.rpy.common as com
In [32]: wp = com.load_data('warpbreaks')
In [33]: wp['replicate'] = np.tile(range(1, 10), 6)

In [34]: wp.pivot_table('breaks', rows=['wool', 'tension'],
                        cols='replicate', aggfunc=np.sum)
Out[34]:
replicate     1   2   3   4   5   6   7   8   9
wool tension
A    H        36  21  24  18  10  43  28  15  26
     L        26  30  54  25  70  52  51  26  67
     M        18  21  29  17  12  18  35  30  36
B    H        20  21  24  17  13  15  15  16  28
     L        27  14  29  19  29  31  41  20  44
     M        42  26  19  16  39  28  21  39  29

Here is the equivalent R code:

> warpbreaks$replicate <- rep(1:9, len = 54)
> ftable(xtabs(breaks ~ wool + tension + replicate, data = warpbreaks))
             replicate  1  2  3  4  5  6  7  8  9
wool tension
A    L                 26 30 54 25 70 52 51 26 67
     M                 18 21 29 17 12 18 35 30 36
     H                 36 21 24 18 10 43 28 15 26
B    L                 27 14 29 19 29 31 41 20 44
     M                 42 26 19 16 39 28 21 39 29
     H                 20 21 24 17 13 15 15 16 28

As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.