10/2/2015: Some API changes in pandas in the last 4 years but the general advice stands.

Over the last several months, I've invested a great deal in the GroupBy and indexing infrastructure of pandas. GroupBy in particular could still use more work to be made even higher performance, but even for quite large data sets, most users will find it more than satisfactory compared with other Python alternatives (which are mainly DIY approaches these days) or even in other data analysis packages, e.g. those in R, which I'll talk a bit about in this article.

An easy application of GroupBy is creating pivot tables from a tabular data set. This involves aggregating a data set on some number of keys, then reshaping it to form a 2D table with the stratified aggregated values. I added a new function pivot_table which provides a high level interface for creating pivot tables. Let's consider the tips data set used in some examples of the Hadley Wickham's excellent R reshape2 package:

In [1]: import pandas.rpy.common as com

In [2]: tips = com.load_data('tips', package='reshape2')

In [3]: tips['tip_pct'] = tips['tip'] / tips['total_bill']

In [4]: tips.head()
Out[4]:
   total_bill  tip   sex     smoker  day  time    size  tip_pct
1  16.99       1.01  Female  No      Sun  Dinner  2     0.05945
2  10.34       1.66  Male    No      Sun  Dinner  3     0.1605
3  21.01       3.5   Male    No      Sun  Dinner  3     0.1666
4  23.68       3.31  Male    No      Sun  Dinner  2     0.1398
5  24.59       3.61  Female  No      Sun  Dinner  4     0.1468

Since most readers won't have rpy2, you can get the data file here and read it with:

tips = read_csv('tips.csv')

All of this new functionality is available now on GitHub but will be a part of the upcoming pandas 0.5.0 release. Note that I am only using rpy2 here to pull the R data set into my Python sesson— all of the below code is implemented in pure Python (with a little Cython magic to make things fast).

Pivot table basics

To use pivot_table, pass the pandas DataFrame and specify the column you wish to aggregate and the columns to group on the rows and columns of the table. Note the default aggregation is mean, though this can be specified:

In [9]: table = pivot_table(tips, 'tip_pct', rows=['time', 'sex'],
                            cols='smoker')

In [10]: table
Out[10]:
  smoker       No      Yes
time   sex
Dinner Female  0.1568  0.1851
       Male    0.1594  0.1489
Lunch  Female  0.1571  0.1753
       Male    0.1657  0.1667

In [17]: pivot_table(tips, 'tip_pct', rows=['day', 'time'],
                     cols='sex')
Out[17]:
  sex        Female  Male
day  time
Fri  Dinner  0.1991  0.1302
     Lunch   0.1997  0.1741
Sat  Dinner  0.1565  0.1516
Sun  Dinner  0.1816  0.1623
Thur Dinner  0.1597  NaN
     Lunch   0.1575  0.1653

Note that the returned object is still a DataFrame, so you can use all of the standard indexing technology to select portions of the resulting table:

In [22]: table.ix['Dinner']
Out[22]:
  smoker  No      Yes
sex
Female    0.1568  0.1851
Male      0.1594  0.1489

If you don't specify a particular values column, it will try to aggregate all the other columns:

In [24]: pivot_table(tips, rows=['sex', 'smoker'])
Out[24]:
               size   tip    tip_pct  total_bill
sex    smoker
Female No      2.593  2.774  0.1569   18.11
       Yes     2.242  2.932  0.1822   17.98
Male   No      2.711  3.113  0.1607   19.79
       Yes     2.5    3.051  0.1528   22.28

Other aggregations can be performed, like using len to get group counts:

In [26]: pivot_table(tips, 'tip_pct', rows='sex', cols='smoker', aggfunc=len)
Out[26]:
  smoker  No  Yes
sex              
Female    54  33
Male      97  60

Once you've done the aggregation, you're free to use the built-in reshaping capability to reshape the data in the table:

In [29]: table = pivot_table(tips, 'tip_pct', rows=['sex', 'day'],
                             cols='smoker', aggfunc=len)

In [30]: table
Out[30]:
  smoker     No  Yes
sex    day
Female Fri   2   7
       Sat   13  15
       Sun   14  4
       Thur  25  7
Male   Fri   2   8
       Sat   32  27
       Sun   43  15
       Thur  20  10

In [31]: table.unstack('sex')
Out[31]:
  smoker  No            Yes
  sex     Female  Male  Female  Male
day
Fri       2       2     7       8
Sat       13      32    15      27
Sun       14      43    4       15
Thur      25      20    7       10

If you know a bit more about groupby, you can get clever and pass a dict of functions to perform multiple aggregations at once:

In [59]: pivot_table(tips, rows=['sex', 'smoker'],
                     aggfunc={'tip_pct' : 'mean', 'size' : 'sum'})
Out[59]:
               size  tip_pct
sex    smoker
Female No      140   0.1569
       Yes     74    0.1822
Male   No      263   0.1607
       Yes     150   0.1528

I thought that was pretty cool.

Comparison with reshape2

R's reshape2 package relies on two functions, melt and cast, to do the reshaping. Its implementation is, at least in my opinion, a slightly more ad hoc approach than in pandas (the actual pivot_table function is only about 10 lines and basically falls out of the groupby and hierarchical-indexing based reshaping). However, it can take advantage of R's built-in formulas which makes the syntax very intuitive. Here are some of the same examples using reshape2:

> tips$tip.pct <- tips$tip / tips$total_bill
> mtips = melt(tips, id=c("sex", "smoker", "day", "time"))
> acast(mtips, day ~ sex + smoker, length, subset=.(variable=="tip"))
     Female_No Female_Yes Male_No Male_Yes
Fri          2          7       2        8
Sat         13         15      32       27
Sun         14          4      43       15
Thur        25          7      20       10

> acast(mtips, sex + smoker ~ variable, mean)
           total_bill      tip     size   tip.pct
Female_No    18.10519 2.773519 2.592593 0.1569210
Female_Yes   17.97788 2.931515 2.242424 0.1821504
Male_No      19.79124 3.113402 2.711340 0.1606687
Male_Yes     22.28450 3.051167 2.500000 0.1527712

R lacks hierarchical indexing as far as I know, so reshape2 munges the group labels together into a row label.

In R there is also the xtabs function for doing cross-tabulation:

> ftable(xtabs(size ~ time + sex + smoker + day, data=tips))
                     day Fri Sat Sun Thur
time   sex    smoker
Dinner Female No           2  30  43    2
              Yes          8  33  10    0
       Male   No           4  85 124    0
              Yes         12  71  39    0
Lunch  Female No           3   0   0   60
              Yes          6   0   0   17
       Male   No           0   0   0   50
              Yes          5   0   0   23

ftable just creates a flat table which can be more easily viewed. This table could be created with pandas by:

In [54]: pivot_table(tips, 'size', rows=['time', 'sex', 'smoker'],
   ....:             cols='day', aggfunc=np.sum, fill_value=0)
Out[54]:
  day                 Fri  Sat  Sun  Thur
time   sex    smoker
Dinner Female No      2    30   43   2
              Yes     8    33   10   0
Dinner Male   No      4    85   124  0
              Yes     12   71   39   0
Lunch  Female No      3    0    0    60
              Yes     6    0    0    17
Lunch  Male   No      0    0    0    50
              Yes     5    0    0    23

Performance

pandas is very fast as I've invested a great deal in optimizing the indexing infrastructure and other core algorithms related to things such as this. I don't have a lot of points of comparison, but here is a simple benchmark of reshape2 versus pandas.pivot_table on a data set with 100000 entries and 25 groups. Here is the R code for the benchmark:

library(reshape2)
n <- 100000
a.size <- 5
b.size <- 5
data <- data.frame(a=sample(letters[1:a.size], n, replace=T),
                   b=sample(letters[1:b.size], n, replace=T),
                   c=rnorm(n),
                   d=rnorm(n))
timings <- numeric()

for (i in 1:10) {
  gc()
  tim <- system.time(acast(melt(data, id=c("a", "b")), a + b ~ variable, mean))
  timings[i] = tim[3]
}
mean(timings)

Here is what the actual table looks like:

> table <- acast(melt(data, id=c("a", "b")), a + b ~ variable, mean)
> head(table)
               c            d
a_a  0.002761963 -0.008793516
a_b -0.004618980 -0.026978744
a_c  0.022491767 -0.002163986
a_d  0.005027362 -0.002532782
a_e  0.006146438  0.031699238
b_a -0.025094899  0.003334301

Here's the equivalent Python code (sans timings, which I'll do in IPython):

from pandas import *
import string
n = 100000
asize = 5
bsize = 5

letters = np.asarray(list(string.letters), dtype=object)

data = DataFrame(dict(foo=letters[:asize][np.random.randint(0, asize, n)],
                      bar=letters[:bsize][np.random.randint(0, bsize, n)],
                      baz=np.random.randn(n),
                      qux=np.random.randn(n)))

table = pivot_table(data, rows=['foo', 'bar'])

Similarly:

In [36]: table = pivot_table(data, rows=['foo', 'bar'])

In [37]: table[:10]
Out[37]:
         baz       qux
foo bar
a   a   -0.02162   0.02186
    b   -0.002155  0.01173
    c   -0.002331  0.006608
    d   -0.01734   0.01109
    e    0.02837   0.01093
b   a    0.01703  -0.005845
    b   -0.001043  0.01283
    c   -0.006015 -0.0005866
    d    0.009683  0.02211
    e    0.02347  -8.346e-05

And the R timing

> mean(timings)
[1] 0.42

And pandas:

In [38]: timeit table = pivot_table(data, rows=['foo', 'bar'])
10 loops, best of 3: 117 ms per loop

So it's about 3-4x faster than reshape2. This obviously depends on the data set and how you structure the aggregation. Here's a couple slightly different benchmarks on the same data:

> acast(melt(data, id=c("a", "b")), a~b, mean, subset=.(variable=="c"))
             a            b            c           d            e
a  0.002761963 -0.004618980  0.022491767 0.005027362  0.006146438
b -0.025094899  0.015966167 -0.007663059 0.006795551  0.006648496
c  0.012273797  0.006139820 -0.019169968 0.009035374 -0.006079683
d -0.014174747 -0.009844566  0.022450738 0.025967638 -0.006572791
e -0.003623382  0.005924704  0.008156482 0.013288116  0.009839315

Timing
> mean(timings)
[1] 0.3036

and pandas:

In [41]: pivot_table(data, 'baz', rows='foo', cols='bar')
Out[41]:
  bar  a         b          c         d         e
foo
a     -0.02162  -0.002155  -0.002331 -0.01734   0.02837
b      0.01703  -0.001043  -0.006015  0.009683  0.02347
c      0.01561   0.0003992  0.01451  -0.01046  -0.00368
d      0.003504 -0.01437   -0.03265   0.003471 -0.004638
e     -0.01656  -0.01276   -0.02481   0.008629  0.011

In [42]: timeit pivot_table(data, 'baz', rows='foo', cols='bar')
10 loops, best of 3: 55 ms per loop

I suspect one source of slowdown reshape/reshape2 approach is that the melt operation is very expensive, as is variable subsetting. However, it enables easier computation of group margins, something which I have not yet addressed (but will, eventually) in pandas.