GroupBy-fu: improvements in grouping and aggregating data in pandas

A couple of weeks ago in my inaugural blog post I wrote about the state of GroupBy in pandas and gave an example application. However, I was dissatisfied with the limited expressiveness (see the end of the article), so I decided to invest some serious time in the groupby functionality in pandas over the last 2 weeks in beefing up what you can do. So this article is a part show-and-tell, part quick tutorial on the new features. Note that I haven’t added a lot of this to the official documentation yet.

GroupBy primer

GroupBy may be one of the least well-understood features in pandas. Let’s consider a DataFrame (basically a labeled 2D array):

In [18]: df
Out[18]:
                      A         B        C         D
2000-03-31 00:00:00   0.6304   -1.493    0.2082    0.3378
2000-06-30 00:00:00  -0.1041    1.093    0.1227   -0.2362
2000-09-29 00:00:00  -1.822     0.7325  -0.4964   -1.707
2000-12-29 00:00:00   0.3325   -0.2213   0.5465    1.265
2001-03-30 00:00:00   0.8422   -0.1265   0.1234    0.7743
2001-06-29 00:00:00  -0.07774  -0.3906  -0.6328   -1.259
2001-09-28 00:00:00  -0.3417   -0.4209   1.724     0.3069
2001-12-31 00:00:00   0.493    -0.2815   1.314     1.004
2002-03-29 00:00:00   1.003    -0.4807   0.7232    1.079
2002-06-28 00:00:00   0.1415   -2.275    0.6178    0.8817
2002-09-30 00:00:00  -0.6855    0.3921  -1.468     1.522
2002-12-31 00:00:00  -1.635     0.3083   0.1209   -0.5606
2003-03-31 00:00:00   0.5707   -0.7125  -0.07677   0.5154
2003-06-30 00:00:00   0.3646   -0.8574  -0.562     0.2176
2003-09-30 00:00:00   1.024    -0.8113   0.6771   -1.261

Here, the index (row labels) contains dates and the columns are names for each time series. When performing a groupby operation, we may different goals:

  • Perform an aggregation, like computing the sum of mean of each group. Functionally this means applying a function to each group and putting the aggregated results into a DataFrame
  • Slicing the DataFrame into chunks (groups) and then doing something with the slices
  • Performing a transformation, like standardizing each group (computing the zscore)

So there are two tasks: first, grouping the data; second, doing something with the grouped data. As far as grouping, in a totally abstract sense we have to define some kind of mapping that assigns labels (one of the axes) into group buckets. A mapping could be one of many things:

  • A Python function, to be called on each label
  • A dict, containing {label : group_name} mappings
  • An array the same length as the axis containing the group correspondences. This may often be one of the columns in a DataFrame, but need not be. Every other mapping type gets rendered to this form eventually.

So in the above data, let’s say we wanted to group the data by year. Since the row labels are Python datetime objects, we can access the year attribute when calling groupby

In [21]: grouped = df.groupby(lambda x: x.year)

This returns an object of type GroupBy. With this object, you can do a lot of things including: iteration, aggregation, or transformation. So we could iterate like so:

In [20]: for year, group in grouped:
   ....:     print year
   ....:     print group
   ....:
2000
                      A        B        C        D
2000-03-31 00:00:00   0.6304  -1.493    0.2082   0.3378
2000-06-30 00:00:00  -0.1041   1.093    0.1227  -0.2362
2000-09-29 00:00:00  -1.822    0.7325  -0.4964  -1.707
2000-12-29 00:00:00   0.3325  -0.2213   0.5465   1.265

2001
                      A         B        C        D
2001-03-30 00:00:00   0.8422   -0.1265   0.1234   0.7743
2001-06-29 00:00:00  -0.07774  -0.3906  -0.6328  -1.259
2001-09-28 00:00:00  -0.3417   -0.4209   1.724    0.3069
2001-12-31 00:00:00   0.493    -0.2815   1.314    1.004

2002
                      A        B        C        D
2002-03-29 00:00:00   1.003   -0.4807   0.7232   1.079
2002-06-28 00:00:00   0.1415  -2.275    0.6178   0.8817
2002-09-30 00:00:00  -0.6855   0.3921  -1.468    1.522
2002-12-31 00:00:00  -1.635    0.3083   0.1209  -0.5606

2003
                      A        B        C         D
2003-03-31 00:00:00   0.5707  -0.7125  -0.07677   0.5154
2003-06-30 00:00:00   0.3646  -0.8574  -0.562     0.2176
2003-09-30 00:00:00   1.024   -0.8113   0.6771   -1.261

Or aggregate like so:

In [21]: grouped.aggregate(np.sum)
Out[21]:
       A        B        C          D
2000  -0.9629   0.1108   0.3811    -0.3405
2001   0.9159  -1.219    2.529      0.8268
2002  -1.176   -2.055   -0.006253   2.921
2003   1.959   -2.381    0.03838   -0.528

Or transform groups like so (here I standardize each group):

In [22]: zscore = lambda x: (x - x.mean()) / x.std()

In [23]: transformed = grouped.transform(zscore)
Out[23]:
                      A        B         C         D
2000-03-31 00:00:00   0.7946  -1.316     0.2598    0.3396
2000-06-30 00:00:00   0.1246   0.9216    0.06319  -0.1213
2000-09-29 00:00:00  -1.442    0.61     -1.361    -1.302
2000-12-29 00:00:00   0.5228  -0.2155    1.038     1.084
2001-03-30 00:00:00   1.142    1.34     -0.47      0.5569
2001-06-29 00:00:00  -0.571   -0.644    -1.168    -1.438
2001-09-28 00:00:00  -1.062   -0.8715    1.009     0.0983
2001-12-31 00:00:00   0.4917   0.1758    0.6299    0.7827
2002-03-29 00:00:00   1.149    0.02675   0.7159    0.3864
2002-06-28 00:00:00   0.3857  -1.422     0.6118    0.168
2002-09-30 00:00:00  -0.3469   0.7316   -1.449     0.878
2002-12-31 00:00:00  -1.188    0.664     0.1209   -1.432
2003-03-31 00:00:00  -0.2443   1.097    -0.1434    0.7267
2003-06-30 00:00:00  -0.8552  -0.8598   -0.9205    0.4138
2003-09-30 00:00:00   1.099   -0.2376    1.064    -1.14

Some aggregations are so common that they’re provided as instance methods on the GroupBy object:

In [24]: grouped.sum()
Out[24]:
       A        B        C          D
2000  -0.9629   0.1108   0.3811    -0.3405
2001   0.9159  -1.219    2.529      0.8268
2002  -1.176   -2.055   -0.006253   2.921
2003   1.959   -2.381    0.03838   -0.528

In [25]: grouped.mean()
Out[25]:
       A        B         C          D
2000  -0.2407   0.02769   0.09526   -0.08512
2001   0.229   -0.3049    0.6324     0.2067
2002  -0.2939  -0.5138   -0.001563   0.7303
2003   0.6531  -0.7938    0.01279   -0.176

For those with SQL experience, a DataFrame can be treated like an SQL table. As such, grouping “by columns” is quite natural:

In [27]: df2
Out[27]:
    A     B       C        D
0   foo   one    -0.7883   0.7743
1   bar   one    -0.5866   0.06009
2   foo   two     0.9312   1.2
3   bar   three  -0.6417   0.3444
4   foo   two    -0.8841  -0.08126
5   bar   two     1.194   -0.7933
6   foo   one    -1.624   -0.1403
7   foo   three   0.5046   0.5833

In [28]: for name, group in df2.groupby('A'):
   ....:     print name
   ....:     print group
   ....:
bar
    A     B       C        D
1   bar   one    -0.5866   0.06009
3   bar   three  -0.6417   0.3444
5   bar   two     1.194   -0.7933

foo
    A     B       C        D
0   foo   one    -0.7883   0.7743
2   foo   two     0.9312   1.2
4   foo   two    -0.8841  -0.08126
6   foo   one    -1.624   -0.1403
7   foo   three   0.5046   0.5833

Note the presence of the “nuisance” B column when we group by A. I want this to work:

In [30]: df2.groupby('A').mean()
Out[30]:
      C         D
bar  -0.01137  -0.1296
foo  -0.3722    0.4671

More on dealing with this below.

New: Column selection

In my last post I mentioned wanting to be able to do a groupby-operation on a single column of a DataFrame using another column or set of columns as the groupers. So in the above example, what I want is:

grouped = df['C'].groupby(df['A'])

# or
grouped = df['C'].groupby([df['A'], df['B']])

This is too verbose / awkward for my tastes. So now you can do:

In [111]: df2.groupby('A')['C'].mean()
Out[111]:
bar    -0.447919901908
foo    -0.158016186685

Behind the scenes, this simply passes the C column to a Series GroupBy object along with the already-computed grouping(s).

New: Group by multiple columns / key functions

The ability to group by multiple criteria (just like SQL) has been one of my most desired GroupBy features for a long time. It’s also very hard to implement efficiently. So I bit the bullet and carefully crafted Cython code to speedily do aggregations on data grouped by multiple columns. So now you can do:

In [113]: df2.groupby(['A', 'B']).mean()
Out[113]:
    A     B       C        D
0   foo   one     0.3558  -0.4818
1   foo   two    -0.2758  -1.271
2   foo   three  -0.95     1.709
3   bar   one     1.916   -0.5093
4   bar   two    -2.315    0.1177
5   bar   three  -0.9447   0.2519

Iteration with multiple groups flattens the key pairs:

In [114]: for k1, k2, group in df2.groupby(['A', 'B']):
   .....:     print k1, k2
   .....:     print group
   .....:
foo one
    A     B     C        D
0   foo   one   0.8248   0.3856
6   foo   one  -0.1133  -1.349

foo two
    A     B     C        D
2   foo   two  -1.175   -1.036
4   foo   two   0.6239  -1.506

foo three
    A     B       C      D
7   foo   three  -0.95   1.709

bar one
    A     B     C       D
1   bar   one   1.916  -0.5093

bar two
    A     B     C       D
5   bar   two  -2.315   0.1177

bar three
    A     B       C        D
3   bar   three  -0.9447   0.2519

As a less trivial example, I loaded some data from the Bureau of Labor Statistics, which I’ll write more about another time:

In [126]: df = read_table('ce.data.44.FinActOtherCurr.txt',
                          sep='\s+', index_col=None)

In [127]: df
Out[127]:
<class 'pandas.core.frame.DataFrame'>
Index: 32806 entries, 0 to 32805
Data columns:
series_id    32806  non-null values
year         32806  non-null values
period       32806  non-null values
value        32806  non-null values
dtypes: int64(1), float64(1), object(2)

In [128]: df[:15]
Out[128]:
     series_id       year   period   value
0    CEU5500000034   1990   M01      83
1    CEU5500000034   1990   M02      83.6
2    CEU5500000034   1990   M03      83.4
3    CEU5500000034   1990   M04      84.9
4    CEU5500000034   1990   M05      84.1
5    CEU5500000034   1990   M06      85.4
6    CEU5500000034   1990   M07      86.5
7    CEU5500000034   1990   M08      86
8    CEU5500000034   1990   M09      85.5
9    CEU5500000034   1990   M10      83.6
10   CEU5500000034   1990   M11      83.6
11   CEU5500000034   1990   M12      84.7
12   CEU5500000034   1990   M13      84.5
13   CEU5500000034   1991   M01      82.9
14   CEU5500000034   1991   M02      83.1

This data set has 118 unique series_id’s and spans 22 years (the period column gives the month of observation), so we’re talking about 2596 unique groups. Let’s say we wanted to compute the mean value for each year, it’s pretty simple with the improved groupby:

In [129]: df.groupby(['series_id', 'year']).mean()
Out[129]:
<class 'pandas.core.frame.DataFrame'>
Index: 2596 entries, 0 to 2595
Data columns:
series_id    2596  non-null values
value        2596  non-null values
year         2596  non-null values
dtypes: int64(1), float64(1), object(1)

I was also pretty impressed with how fast this operation is (the power of Cython!):

In [130]: timeit df.groupby(['series_id', 'year']).mean()
10 loops, best of 3: 22.9 ms per loop

I haven’t compared this with an SQL engine (SQLite, MySQL, Postgres, …) but I hope that it’s competitive. If you call a non-Cythonized function, the speed goes down considerably:

In [132]: timeit df.groupby(['series_id', 'year'])['value'].std()
1 loops, best of 3: 232 ms per loop

Edit: It’s also possible to pass multiple functions instead of column names or to mix and match column names with functions:

In [22]: df.groupby([lambda x: x.year, lambda x: x.month]).mean()
Out[22]:
     A          B         C          D         key_0   key_1
0    0.09433    0.06146   0.1891     0.2079    2000    1    
1    0.02674   -0.09411   0.1557    -0.07926   2000    2    
2    0.07089    0.3938   -0.08026   -0.416     2000    3    
3    0.133     -0.04532   0.03341   -0.2866    2000    4    
4    0.1301     0.1803   -0.4241     0.1482    2000    5    
5   -0.3804    -0.08935   0.04791    0.08268   2000    6    
6    0.2211     0.2483   -0.3559    -0.06205   2000    7    
7   -0.2908     0.07606  -0.2001    -0.05225   2000    8    
8   -0.07888   -0.3915    0.06442    0.5109    2000    9    
9    0.3809     0.4037   -0.2431    -0.2892    2000    10  
10  -0.2765     0.01461  -0.2653     0.1335    2000    11  
11   0.6983    -0.1167    0.3964     0.03776   2000    12  
12  -0.2347     0.2333   -0.06508    0.3       2001    1    
13   0.1751     0.05545  -0.004038   0.06194   2001    2    
14   0.08792    0.03517   0.05583   -0.4356    2001    3    
15  -0.08408    0.05567  -0.1705     0.1914    2001    4    
16   0.008912   0.2317    0.08875    0.08637   2001    5    
17   0.3304     0.3635    0.005158   0.3588    2001    6    
18  -0.1324     0.3419   -0.2527    -0.2249    2001    7    
19   0.0472     0.2552   -0.07589   -0.07992   2001    8    
20   0.07582   -0.2295    0.1002     0.1077    2001    9    
21   0.2368     0.02214  -0.3866    -0.2671    2001    10  
22   0.2116    -0.07872  -0.06399    0.2309    2001    11

New: Omitting “nuisance” columns

It’s pretty common to group by a column and ignore other columns containing non-floating point data. It used to be that you’d get an error, forcing you to first drop the “nuisance” columns, e.g. suppose we only want to group by the A column here:

In [133]: df2
Out[133]:
    A     B       C        D
0   foo   one     0.8248   0.3856
1   bar   one     1.916   -0.5093
2   foo   two    -1.175   -1.036
3   bar   three  -0.9447   0.2519
4   foo   two     0.6239  -1.506
5   bar   two    -2.315    0.1177
6   foo   one    -0.1133  -1.349
7   foo   three  -0.95     1.709

If you try to compute the mean of the B column, things won’t quite work out:

In [134]: grouped = df2.groupby('A')['B'].mean()
TypeError: unsupported operand type(s) for /: 'str' and 'float'

To cope with this, so-called “nuisance” columns are now automatically dropped. While I don’t like discarding information, I think “practicality beats purity” in this case:

In [135]: df2.groupby('A').mean()
Out[135]:
      C        D
bar  -0.4479  -0.0466
foo  -0.158   -0.3593

New: “auto-dispatching” to DataFrame / Series

Since very few functions in DataFrame and Series/TimeSeries are implemented on the GroupBy classes, you may wish to simply invoke a Series instance method on each group, let’s say:

In [140]: df.groupby('series_id')['value'].agg(Series.skew)
Out[140]:
CEU5500000034    -0.253887792525
CEU5500000035    0.0938252241898
CEU5552200034    0.387563548587
CEU5552200035    0.0868489117242
CEU5552210034    0.74506367536
CEU5552210035    0.551290226183
<snip>

But since we’re got Python at our disposal, it’s relatively straightforward to override __getattribute__ to wrap and dispatch calls to Series/DataFrame instance methods so you can do this:

In [141]: df.groupby('series_id')['value'].skew()
Out[141]:
CEU5500000034    -0.253887792525
CEU5500000035    0.0938252241898
CEU5552200034    0.387563548587
CEU5552200035    0.0868489117242
CEU5552210034    0.74506367536
<snip>

Some of the more complicated methods, like the super-useful describe method can be used in fact:

For those interested, the internal code to do this dispatching is short and simple, by the way:

Old but not well known: apply multiple functions

This has always been around but not well-known by many people. Suppose you want to apply multiple functions to a group and collect all the results. You can pass a dict of functions:

In [144]: grouped.agg({'mean' : np.mean, 'variance' : np.var})
Out[144]:
                mean    variance
CEU5500000034   96.54   68.92  
CEU5500000035   93.88   832.3  
CEU5552200034   97.95   46.6    
CEU5552200035   93.58   683.1  
CEU5552210034   100.7   15.54  
CEU5552210035   98.84   567.1  
CEU5552211034   100.9   13.73  
CEU5552211035   99.34   622    
<snip>

I have a mind to expand on this idea for DataFrame objects to get us closer to the full flexibility of SQL with Pythonic syntax. If anyone has ideas or suggestions please let me know.

Conclusions

GroupBy is certainly not done. If you use these tools and find them useful, please let me know. If you can think of ways to make them better, that would be nice information too. I think it would be great to implement a full SQL engine on top of pandas (similar to the SAS “proc sql”), and this new GroupBy functionality gets us closer to that goal.