Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table
function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.
When calling help(xtabs)
(or help(table)
) in R, you get a number of examples for using table, xtabs,
and ftable
:
> head(esoph)
agegp alcgp tobgp ncases ncontrols
1 25-34 0-39g/day 0-9g/day 0 40
2 25-34 0-39g/day 10-19 0 10
3 25-34 0-39g/day 20-29 0 6
4 25-34 0-39g/day 30+ 0 5
5 25-34 40-79 0-9g/day 0 27
6 25-34 40-79 10-19 0 7
> ftable(xtabs(cbind(ncases, ncontrols) ~ agegp + tobgp, data = esoph))
ncases ncontrols
agegp tobgp
25-34 0-9g/day 0 70
10-19 1 19
20-29 0 11
30+ 0 16
35-44 0-9g/day 2 109
10-19 4 46
20-29 3 27
30+ 0 17
45-54 0-9g/day 14 104
10-19 13 57
20-29 8 33
30+ 11 19
55-64 0-9g/day 25 117
10-19 23 65
20-29 12 38
30+ 16 22
65-74 0-9g/day 31 99
10-19 12 38
20-29 10 20
30+ 2 4
75+ 0-9g/day 6 26
10-19 5 11
20-29 0 3
30+ 2 4
> ftable(xtabs(cbind(ncases, ncontrols) ~ agegp, data = esoph))
ncases ncontrols
agegp
25-34 1 116
35-44 9 199
45-54 46 213
55-64 76 242
65-74 55 161
75+ 13 44
Here are the same examples using pandas:
In [19]: table = esoph.pivot_table(['ncases', 'ncontrols'],
rows=['agegp', 'tobgp'],
aggfunc=np.sum)
In [20]: table
Out[20]:
ncases ncontrols
agegp tobgp
25-34 0-9g/day 0. 70.
10-19 1. 19.
20-29 0. 11.
30+ 0. 16.
35-44 0-9g/day 2. 109
10-19 4. 46.
20-29 3. 27.
30+ 0. 17.
45-54 0-9g/day 14 104
10-19 13 57.
20-29 8. 33.
30+ 11 19.
55-64 0-9g/day 25 117
10-19 23 65.
20-29 12 38.
30+ 16 22.
65-74 0-9g/day 31 99.
10-19 12 38.
20-29 10 20.
30+ 2. 4.0
75+ 0-9g/day 6. 26.
10-19 5. 11.
20-29 0. 3.0
30+ 2. 4.0
In [22]: table2 = esoph.pivot_table(['ncases', 'ncontrols'],
rows='agegp', aggfunc=np.sum)
In [23]: table2
Out[23]:
ncases ncontrols
agegp
25-34 1. 116
35-44 9. 199
45-54 46 213
55-64 76 242
65-74 55 161
75+ 13 44.
One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:
In [26]: table['ncases'].unstack('agegp')
Out[26]:
agegp 25-34 35-44 45-54 55-64 65-74 75+
tobgp
0-9g/day 0 2 14 25 31 6
10-19 1 4 13 23 12 5
20-29 0 3 8. 12 10 0
30+ 0 0 11 16 2. 2
Here is another fun example:
In [31]: import pandas.rpy.common as com
In [32]: wp = com.load_data('warpbreaks')
In [33]: wp['replicate'] = np.tile(range(1, 10), 6)
In [34]: wp.pivot_table('breaks', rows=['wool', 'tension'],
cols='replicate', aggfunc=np.sum)
Out[34]:
replicate 1 2 3 4 5 6 7 8 9
wool tension
A H 36 21 24 18 10 43 28 15 26
L 26 30 54 25 70 52 51 26 67
M 18 21 29 17 12 18 35 30 36
B H 20 21 24 17 13 15 15 16 28
L 27 14 29 19 29 31 41 20 44
M 42 26 19 16 39 28 21 39 29
Here is the equivalent R code:
> warpbreaks$replicate <- rep(1:9, len = 54)
> ftable(xtabs(breaks ~ wool + tension + replicate, data = warpbreaks))
replicate 1 2 3 4 5 6 7 8 9
wool tension
A L 26 30 54 25 70 52 51 26 67
M 18 21 29 17 12 18 35 30 36
H 36 21 24 18 10 43 28 15 26
B L 27 14 29 19 29 31 41 20 44
M 42 26 19 16 39 28 21 39 29
H 20 21 24 17 13 15 15 16 28
As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.