10/2/2015: Some API changes in pandas in the last 4 years but the general advice stands.
Over the last several months, I've invested a great deal in the GroupBy and indexing infrastructure of pandas. GroupBy in particular could still use more work to be made even higher performance, but even for quite large data sets, most users will find it more than satisfactory compared with other Python alternatives (which are mainly DIY approaches these days) or even in other data analysis packages, e.g. those in R, which I'll talk a bit about in this article.
An easy application of GroupBy is creating pivot tables from a tabular data set. This involves aggregating a data set on some number of keys, then reshaping it to form a 2D table with the stratified aggregated values. I added a new function pivot_table
which provides a high level interface for creating pivot tables. Let's consider the tips
data set used in some examples of the Hadley Wickham's excellent R reshape2 package:
In [1]: import pandas.rpy.common as com
In [2]: tips = com.load_data('tips', package='reshape2')
In [3]: tips['tip_pct'] = tips['tip'] / tips['total_bill']
In [4]: tips.head()
Out[4]:
total_bill tip sex smoker day time size tip_pct
1 16.99 1.01 Female No Sun Dinner 2 0.05945
2 10.34 1.66 Male No Sun Dinner 3 0.1605
3 21.01 3.5 Male No Sun Dinner 3 0.1666
4 23.68 3.31 Male No Sun Dinner 2 0.1398
5 24.59 3.61 Female No Sun Dinner 4 0.1468
Since most readers won't have rpy2, you can get the data file here and read it with:
tips = read_csv('tips.csv')
All of this new functionality is available now on GitHub but will be a part of the upcoming pandas 0.5.0 release. Note that I am only using rpy2 here to pull the R data set into my Python sesson— all of the below code is implemented in pure Python (with a little Cython magic to make things fast).
Pivot table basics
To use pivot_table
, pass the pandas DataFrame
and specify the column you wish to aggregate and the columns to group on the rows and columns of the table. Note the default aggregation is mean, though this can be specified:
In [9]: table = pivot_table(tips, 'tip_pct', rows=['time', 'sex'],
cols='smoker')
In [10]: table
Out[10]:
smoker No Yes
time sex
Dinner Female 0.1568 0.1851
Male 0.1594 0.1489
Lunch Female 0.1571 0.1753
Male 0.1657 0.1667
In [17]: pivot_table(tips, 'tip_pct', rows=['day', 'time'],
cols='sex')
Out[17]:
sex Female Male
day time
Fri Dinner 0.1991 0.1302
Lunch 0.1997 0.1741
Sat Dinner 0.1565 0.1516
Sun Dinner 0.1816 0.1623
Thur Dinner 0.1597 NaN
Lunch 0.1575 0.1653
Note that the returned object is still a DataFrame
, so you can use all of the standard indexing technology to select portions of the resulting table:
In [22]: table.ix['Dinner']
Out[22]:
smoker No Yes
sex
Female 0.1568 0.1851
Male 0.1594 0.1489
If you don't specify a particular values column, it will try to aggregate all the other columns:
In [24]: pivot_table(tips, rows=['sex', 'smoker'])
Out[24]:
size tip tip_pct total_bill
sex smoker
Female No 2.593 2.774 0.1569 18.11
Yes 2.242 2.932 0.1822 17.98
Male No 2.711 3.113 0.1607 19.79
Yes 2.5 3.051 0.1528 22.28
Other aggregations can be performed, like using len
to get group counts:
Out[26]:
smoker No Yes
sex
Female 54 33
Male 97 60
Once you've done the aggregation, you're free to use the built-in reshaping capability to reshape the data in the table:
In [29]: table = pivot_table(tips, 'tip_pct', rows=['sex', 'day'],
cols='smoker', aggfunc=len)
In [30]: table
Out[30]:
smoker No Yes
sex day
Female Fri 2 7
Sat 13 15
Sun 14 4
Thur 25 7
Male Fri 2 8
Sat 32 27
Sun 43 15
Thur 20 10
In [31]: table.unstack('sex')
Out[31]:
smoker No Yes
sex Female Male Female Male
day
Fri 2 2 7 8
Sat 13 32 15 27
Sun 14 43 4 15
Thur 25 20 7 10
If you know a bit more about groupby, you can get clever and pass a dict of functions to perform multiple aggregations at once:
In [59]: pivot_table(tips, rows=['sex', 'smoker'],
aggfunc={'tip_pct' : 'mean', 'size' : 'sum'})
Out[59]:
size tip_pct
sex smoker
Female No 140 0.1569
Yes 74 0.1822
Male No 263 0.1607
Yes 150 0.1528
I thought that was pretty cool.
Comparison with reshape2
R's reshape2 package relies on two functions, melt
and cast
, to do the reshaping. Its implementation is, at least in my opinion, a slightly more ad hoc approach than in pandas (the actual pivot_table
function is only about 10 lines and basically falls out of the groupby and hierarchical-indexing based reshaping). However, it can take advantage of R's built-in formulas which makes the syntax very intuitive. Here are some of the same examples using reshape2:
> tips$tip.pct <- tips$tip / tips$total_bill
> mtips = melt(tips, id=c("sex", "smoker", "day", "time"))
> acast(mtips, day ~ sex + smoker, length, subset=.(variable=="tip"))
Female_No Female_Yes Male_No Male_Yes
Fri 2 7 2 8
Sat 13 15 32 27
Sun 14 4 43 15
Thur 25 7 20 10
> acast(mtips, sex + smoker ~ variable, mean)
total_bill tip size tip.pct
Female_No 18.10519 2.773519 2.592593 0.1569210
Female_Yes 17.97788 2.931515 2.242424 0.1821504
Male_No 19.79124 3.113402 2.711340 0.1606687
Male_Yes 22.28450 3.051167 2.500000 0.1527712
R lacks hierarchical indexing as far as I know, so reshape2 munges the group labels together into a row label.
In R there is also the xtabs
function for doing cross-tabulation:
> ftable(xtabs(size ~ time + sex + smoker + day, data=tips))
day Fri Sat Sun Thur
time sex smoker
Dinner Female No 2 30 43 2
Yes 8 33 10 0
Male No 4 85 124 0
Yes 12 71 39 0
Lunch Female No 3 0 0 60
Yes 6 0 0 17
Male No 0 0 0 50
Yes 5 0 0 23
ftable
just creates a flat table which can be more easily viewed. This table could be created with pandas by:
In [54]: pivot_table(tips, 'size', rows=['time', 'sex', 'smoker'],
....: cols='day', aggfunc=np.sum, fill_value=0)
Out[54]:
day Fri Sat Sun Thur
time sex smoker
Dinner Female No 2 30 43 2
Yes 8 33 10 0
Dinner Male No 4 85 124 0
Yes 12 71 39 0
Lunch Female No 3 0 0 60
Yes 6 0 0 17
Lunch Male No 0 0 0 50
Yes 5 0 0 23
Performance
pandas is very fast as I've invested a great deal in optimizing the indexing infrastructure and other core algorithms related to things such as this. I don't have a lot of points of comparison, but here is a simple benchmark of reshape2 versus pandas.pivot_table
on a data set with 100000 entries and 25 groups. Here is the R code for the benchmark:
library(reshape2)
n <- 100000
a.size <- 5
b.size <- 5
data <- data.frame(a=sample(letters[1:a.size], n, replace=T),
b=sample(letters[1:b.size], n, replace=T),
c=rnorm(n),
d=rnorm(n))
timings <- numeric()
for (i in 1:10) {
gc()
tim <- system.time(acast(melt(data, id=c("a", "b")), a + b ~ variable, mean))
timings[i] = tim[3]
}
mean(timings)
Here is what the actual table looks like:
> table <- acast(melt(data, id=c("a", "b")), a + b ~ variable, mean)
> head(table)
c d
a_a 0.002761963 -0.008793516
a_b -0.004618980 -0.026978744
a_c 0.022491767 -0.002163986
a_d 0.005027362 -0.002532782
a_e 0.006146438 0.031699238
b_a -0.025094899 0.003334301
Here's the equivalent Python code (sans timings, which I'll do in IPython):
from pandas import *
import string
n = 100000
asize = 5
bsize = 5
letters = np.asarray(list(string.letters), dtype=object)
data = DataFrame(dict(foo=letters[:asize][np.random.randint(0, asize, n)],
bar=letters[:bsize][np.random.randint(0, bsize, n)],
baz=np.random.randn(n),
qux=np.random.randn(n)))
table = pivot_table(data, rows=['foo', 'bar'])
Similarly:
In [36]: table = pivot_table(data, rows=['foo', 'bar'])
In [37]: table[:10]
Out[37]:
baz qux
foo bar
a a -0.02162 0.02186
b -0.002155 0.01173
c -0.002331 0.006608
d -0.01734 0.01109
e 0.02837 0.01093
b a 0.01703 -0.005845
b -0.001043 0.01283
c -0.006015 -0.0005866
d 0.009683 0.02211
e 0.02347 -8.346e-05
And the R timing
> mean(timings)
[1] 0.42
And pandas:
In [38]: timeit table = pivot_table(data, rows=['foo', 'bar'])
10 loops, best of 3: 117 ms per loop
So it's about 3-4x faster than reshape2. This obviously depends on the data set and how you structure the aggregation. Here's a couple slightly different benchmarks on the same data:
> acast(melt(data, id=c("a", "b")), a~b, mean, subset=.(variable=="c"))
a b c d e
a 0.002761963 -0.004618980 0.022491767 0.005027362 0.006146438
b -0.025094899 0.015966167 -0.007663059 0.006795551 0.006648496
c 0.012273797 0.006139820 -0.019169968 0.009035374 -0.006079683
d -0.014174747 -0.009844566 0.022450738 0.025967638 -0.006572791
e -0.003623382 0.005924704 0.008156482 0.013288116 0.009839315
Timing
> mean(timings)
[1] 0.3036
and pandas:
In [41]: pivot_table(data, 'baz', rows='foo', cols='bar')
Out[41]:
bar a b c d e
foo
a -0.02162 -0.002155 -0.002331 -0.01734 0.02837
b 0.01703 -0.001043 -0.006015 0.009683 0.02347
c 0.01561 0.0003992 0.01451 -0.01046 -0.00368
d 0.003504 -0.01437 -0.03265 0.003471 -0.004638
e -0.01656 -0.01276 -0.02481 0.008629 0.011
In [42]: timeit pivot_table(data, 'baz', rows='foo', cols='bar')
10 loops, best of 3: 55 ms per loop
I suspect one source of slowdown reshape/reshape2 approach is that the melt
operation is very expensive, as is variable subsetting. However, it enables easier computation of group margins, something which I have not yet addressed (but will, eventually) in pandas.