Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table
function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.
When calling help(xtabs)
(or help(table)
) in R, you get a number of examples for using table, xtabs,
and ftable
:
> head(esoph) agegp alcgp tobgp ncases ncontrols 1 25-34 0-39g/day 0-9g/day 0 40 2 25-34 0-39g/day 10-19 0 10 3 25-34 0-39g/day 20-29 0 6 4 25-34 0-39g/day 30+ 0 5 5 25-34 40-79 0-9g/day 0 27 6 25-34 40-79 10-19 0 7 > ftable(xtabs(cbind(ncases, ncontrols) ~ agegp + tobgp, data = esoph)) ncases ncontrols agegp tobgp 25-34 0-9g/day 0 70 10-19 1 19 20-29 0 11 30+ 0 16 35-44 0-9g/day 2 109 10-19 4 46 20-29 3 27 30+ 0 17 45-54 0-9g/day 14 104 10-19 13 57 20-29 8 33 30+ 11 19 55-64 0-9g/day 25 117 10-19 23 65 20-29 12 38 30+ 16 22 65-74 0-9g/day 31 99 10-19 12 38 20-29 10 20 30+ 2 4 75+ 0-9g/day 6 26 10-19 5 11 20-29 0 3 30+ 2 4 > ftable(xtabs(cbind(ncases, ncontrols) ~ agegp, data = esoph)) ncases ncontrols agegp 25-34 1 116 35-44 9 199 45-54 46 213 55-64 76 242 65-74 55 161 75+ 13 44
Here are the same examples using pandas:
In [19]: table = esoph.pivot_table(['ncases', 'ncontrols'], rows=['agegp', 'tobgp'], aggfunc=np.sum) In [20]: table Out[20]: ncases ncontrols agegp tobgp 25-34 0-9g/day 0. 70. 10-19 1. 19. 20-29 0. 11. 30+ 0. 16. 35-44 0-9g/day 2. 109 10-19 4. 46. 20-29 3. 27. 30+ 0. 17. 45-54 0-9g/day 14 104 10-19 13 57. 20-29 8. 33. 30+ 11 19. 55-64 0-9g/day 25 117 10-19 23 65. 20-29 12 38. 30+ 16 22. 65-74 0-9g/day 31 99. 10-19 12 38. 20-29 10 20. 30+ 2. 4.0 75+ 0-9g/day 6. 26. 10-19 5. 11. 20-29 0. 3.0 30+ 2. 4.0 In [22]: table2 = esoph.pivot_table(['ncases', 'ncontrols'], rows='agegp', aggfunc=np.sum) In [23]: table2 Out[23]: ncases ncontrols agegp 25-34 1. 116 35-44 9. 199 45-54 46 213 55-64 76 242 65-74 55 161 75+ 13 44.
One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:
In [26]: table['ncases'].unstack('agegp') Out[26]: agegp 25-34 35-44 45-54 55-64 65-74 75+ tobgp 0-9g/day 0 2 14 25 31 6 10-19 1 4 13 23 12 5 20-29 0 3 8. 12 10 0 30+ 0 0 11 16 2. 2
Here is another fun example:
In [31]: import pandas.rpy.common as com In [32]: wp = com.load_data('warpbreaks') In [33]: wp['replicate'] = np.tile(range(1, 10), 6) In [34]: wp.pivot_table('breaks', rows=['wool', 'tension'], cols='replicate', aggfunc=np.sum) Out[34]: replicate 1 2 3 4 5 6 7 8 9 wool tension A H 36 21 24 18 10 43 28 15 26 L 26 30 54 25 70 52 51 26 67 M 18 21 29 17 12 18 35 30 36 B H 20 21 24 17 13 15 15 16 28 L 27 14 29 19 29 31 41 20 44 M 42 26 19 16 39 28 21 39 29
Here is the equivalent R code:
> warpbreaks$replicate <- rep(1:9, len = 54) > ftable(xtabs(breaks ~ wool + tension + replicate, data = warpbreaks)) replicate 1 2 3 4 5 6 7 8 9 wool tension A L 26 30 54 25 70 52 51 26 67 M 18 21 29 17 12 18 35 30 36 H 36 21 24 18 10 43 28 15 26 B L 27 14 29 19 29 31 41 20 44 M 42 26 19 16 39 28 21 39 29 H 20 21 24 17 13 15 15 16 28
As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.