A couple of weeks ago in my inaugural blog post I wrote about the state of GroupBy in pandas and gave an example application. However, I was dissatisfied with the limited expressiveness (see the end of the article), so I decided to invest some serious time in the groupby functionality in pandas over the last 2 weeks in beefing up what you can do. So this article is a part show-and-tell, part quick tutorial on the new features. Note that I haven't added a lot of this to the official documentation yet.
GroupBy primer
GroupBy may be one of the least well-understood features in pandas. Let's consider a DataFrame (basically a labeled 2D array):
In [18]: df
Out[18]:
A B C D
2000-03-31 00:00:00 0.6304 -1.493 0.2082 0.3378
2000-06-30 00:00:00 -0.1041 1.093 0.1227 -0.2362
2000-09-29 00:00:00 -1.822 0.7325 -0.4964 -1.707
2000-12-29 00:00:00 0.3325 -0.2213 0.5465 1.265
2001-03-30 00:00:00 0.8422 -0.1265 0.1234 0.7743
2001-06-29 00:00:00 -0.07774 -0.3906 -0.6328 -1.259
2001-09-28 00:00:00 -0.3417 -0.4209 1.724 0.3069
2001-12-31 00:00:00 0.493 -0.2815 1.314 1.004
2002-03-29 00:00:00 1.003 -0.4807 0.7232 1.079
2002-06-28 00:00:00 0.1415 -2.275 0.6178 0.8817
2002-09-30 00:00:00 -0.6855 0.3921 -1.468 1.522
2002-12-31 00:00:00 -1.635 0.3083 0.1209 -0.5606
2003-03-31 00:00:00 0.5707 -0.7125 -0.07677 0.5154
2003-06-30 00:00:00 0.3646 -0.8574 -0.562 0.2176
2003-09-30 00:00:00 1.024 -0.8113 0.6771 -1.261
Here, the index (row labels) contains dates and the columns are names for each time series. When performing a groupby operation, we may different goals:
- Perform an aggregation, like computing the sum of mean of each group. Functionally this means applying a function to each group and putting the aggregated results into a DataFrame
- Slicing the DataFrame into chunks (groups) and then doing something with the slices
- Performing a transformation, like standardizing each group (computing the zscore)
So there are two tasks: first, grouping the data; second, doing something with the grouped data. As far as grouping, in a totally abstract sense we have to define some kind of mapping that assigns labels (one of the axes) into group buckets. A mapping could be one of many things:
- A Python function, to be called on each label
- A dict, containing {label : group_name} mappings
- An array the same length as the axis containing the group correspondences. This may often be one of the columns in a DataFrame, but need not be. Every other mapping type gets rendered to this form eventually.
So in the above data, let's say we wanted to group the data by year. Since the row labels are Python datetime objects, we can access the year
attribute when calling groupby
In [21]: grouped = df.groupby(lambda x: x.year)
This returns an object of type GroupBy. With this object, you can do a lot of things including: iteration, aggregation, or transformation. So we could iterate like so:
In [20]: for year, group in grouped:
....: print year
....: print group
....:
2000
A B C D
2000-03-31 00:00:00 0.6304 -1.493 0.2082 0.3378
2000-06-30 00:00:00 -0.1041 1.093 0.1227 -0.2362
2000-09-29 00:00:00 -1.822 0.7325 -0.4964 -1.707
2000-12-29 00:00:00 0.3325 -0.2213 0.5465 1.265
2001
A B C D
2001-03-30 00:00:00 0.8422 -0.1265 0.1234 0.7743
2001-06-29 00:00:00 -0.07774 -0.3906 -0.6328 -1.259
2001-09-28 00:00:00 -0.3417 -0.4209 1.724 0.3069
2001-12-31 00:00:00 0.493 -0.2815 1.314 1.004
2002
A B C D
2002-03-29 00:00:00 1.003 -0.4807 0.7232 1.079
2002-06-28 00:00:00 0.1415 -2.275 0.6178 0.8817
2002-09-30 00:00:00 -0.6855 0.3921 -1.468 1.522
2002-12-31 00:00:00 -1.635 0.3083 0.1209 -0.5606
2003
A B C D
2003-03-31 00:00:00 0.5707 -0.7125 -0.07677 0.5154
2003-06-30 00:00:00 0.3646 -0.8574 -0.562 0.2176
2003-09-30 00:00:00 1.024 -0.8113 0.6771 -1.261
Or aggregate like so:
In [21]: grouped.aggregate(np.sum)
Out[21]:
A B C D
2000 -0.9629 0.1108 0.3811 -0.3405
2001 0.9159 -1.219 2.529 0.8268
2002 -1.176 -2.055 -0.006253 2.921
2003 1.959 -2.381 0.03838 -0.528
Or transform groups like so (here I standardize each group):
In [22]: zscore = lambda x: (x - x.mean()) / x.std()
In [23]: transformed = grouped.transform(zscore)
Out[23]:
A B C D
2000-03-31 00:00:00 0.7946 -1.316 0.2598 0.3396
2000-06-30 00:00:00 0.1246 0.9216 0.06319 -0.1213
2000-09-29 00:00:00 -1.442 0.61 -1.361 -1.302
2000-12-29 00:00:00 0.5228 -0.2155 1.038 1.084
2001-03-30 00:00:00 1.142 1.34 -0.47 0.5569
2001-06-29 00:00:00 -0.571 -0.644 -1.168 -1.438
2001-09-28 00:00:00 -1.062 -0.8715 1.009 0.0983
2001-12-31 00:00:00 0.4917 0.1758 0.6299 0.7827
2002-03-29 00:00:00 1.149 0.02675 0.7159 0.3864
2002-06-28 00:00:00 0.3857 -1.422 0.6118 0.168
2002-09-30 00:00:00 -0.3469 0.7316 -1.449 0.878
2002-12-31 00:00:00 -1.188 0.664 0.1209 -1.432
2003-03-31 00:00:00 -0.2443 1.097 -0.1434 0.7267
2003-06-30 00:00:00 -0.8552 -0.8598 -0.9205 0.4138
2003-09-30 00:00:00 1.099 -0.2376 1.064 -1.14
Some aggregations are so common that they're provided as instance methods on the GroupBy object:
In [24]: grouped.sum()
Out[24]:
A B C D
2000 -0.9629 0.1108 0.3811 -0.3405
2001 0.9159 -1.219 2.529 0.8268
2002 -1.176 -2.055 -0.006253 2.921
2003 1.959 -2.381 0.03838 -0.528
In [25]: grouped.mean()
Out[25]:
A B C D
2000 -0.2407 0.02769 0.09526 -0.08512
2001 0.229 -0.3049 0.6324 0.2067
2002 -0.2939 -0.5138 -0.001563 0.7303
2003 0.6531 -0.7938 0.01279 -0.176
For those with SQL experience, a DataFrame can be treated like an SQL table. As such, grouping 'by columns" is quite natural:
In [27]: df2
Out[27]:
A B C D
0 foo one -0.7883 0.7743
1 bar one -0.5866 0.06009
2 foo two 0.9312 1.2
3 bar three -0.6417 0.3444
4 foo two -0.8841 -0.08126
5 bar two 1.194 -0.7933
6 foo one -1.624 -0.1403
7 foo three 0.5046 0.5833
In [28]: for name, group in df2.groupby('A'):
....: print name
....: print group
....:
bar
A B C D
1 bar one -0.5866 0.06009
3 bar three -0.6417 0.3444
5 bar two 1.194 -0.7933
foo
A B C D
0 foo one -0.7883 0.7743
2 foo two 0.9312 1.2
4 foo two -0.8841 -0.08126
6 foo one -1.624 -0.1403
7 foo three 0.5046 0.5833
Note the presence of the "nuisance" B column when we group by A. I want this to work:
In [30]: df2.groupby('A').mean()
Out[30]:
C D
bar -0.01137 -0.1296
foo -0.3722 0.4671
More on dealing with this below.
New: Column selection
In my last post I mentioned wanting to be able to do a groupby-operation on a single column of a DataFrame using another column or set of columns as the groupers. So in the above example, what I want is:
grouped = df['C'].groupby(df['A'])
# or
grouped = df['C'].groupby([df['A'], df['B']])
This is too verbose / awkward for my tastes. So now you can do:
In [111]: df2.groupby('A')['C'].mean()
Out[111]:
bar -0.447919901908
foo -0.158016186685
Behind the scenes, this simply passes the C column to a Series GroupBy object along with the already-computed grouping(s).
New: Group by multiple columns / key functions
The ability to group by multiple criteria (just like SQL) has been one of my most desired GroupBy features for a long time. It's also very hard to implement efficiently. So I bit the bullet and carefully crafted Cython code to speedily do aggregations on data grouped by multiple columns. So now you can do:
In [113]: df2.groupby(['A', 'B']).mean()
Out[113]:
A B C D
0 foo one 0.3558 -0.4818
1 foo two -0.2758 -1.271
2 foo three -0.95 1.709
3 bar one 1.916 -0.5093
4 bar two -2.315 0.1177
5 bar three -0.9447 0.2519
Iteration with multiple groups flattens the key pairs:
In [114]: for k1, k2, group in df2.groupby(['A', 'B']):
.....: print k1, k2
.....: print group
.....:
foo one
A B C D
0 foo one 0.8248 0.3856
6 foo one -0.1133 -1.349
foo two
A B C D
2 foo two -1.175 -1.036
4 foo two 0.6239 -1.506
foo three
A B C D
7 foo three -0.95 1.709
bar one
A B C D
1 bar one 1.916 -0.5093
bar two
A B C D
5 bar two -2.315 0.1177
bar three
A B C D
3 bar three -0.9447 0.2519
As a less trivial example, I loaded some data from the Bureau of Labor Statistics, which I'll write more about another time:
In [126]: df = read_table('ce.data.44.FinActOtherCurr.txt',
sep='\s+', index_col=None)
In [127]: df
Out[127]:
<class 'pandas.core.frame.DataFrame'>
Index: 32806 entries, 0 to 32805
Data columns:
series_id 32806 non-null values
year 32806 non-null values
period 32806 non-null values
value 32806 non-null values
dtypes: int64(1), float64(1), object(2)
In [128]: df[:15]
Out[128]:
series_id year period value
0 CEU5500000034 1990 M01 83
1 CEU5500000034 1990 M02 83.6
2 CEU5500000034 1990 M03 83.4
3 CEU5500000034 1990 M04 84.9
4 CEU5500000034 1990 M05 84.1
5 CEU5500000034 1990 M06 85.4
6 CEU5500000034 1990 M07 86.5
7 CEU5500000034 1990 M08 86
8 CEU5500000034 1990 M09 85.5
9 CEU5500000034 1990 M10 83.6
10 CEU5500000034 1990 M11 83.6
11 CEU5500000034 1990 M12 84.7
12 CEU5500000034 1990 M13 84.5
13 CEU5500000034 1991 M01 82.9
14 CEU5500000034 1991 M02 83.1
This data set has 118 unique series_id's and spans 22 years (the period column gives the month of observation), so we're talking about 2596 unique groups. Let's say we wanted to compute the mean value for each year, it's pretty simple with the improved groupby:
In [129]: df.groupby(['series_id', 'year']).mean()
Out[129]:
<class 'pandas.core.frame.DataFrame'>
Index: 2596 entries, 0 to 2595
Data columns:
series_id 2596 non-null values
value 2596 non-null values
year 2596 non-null values
dtypes: int64(1), float64(1), object(1)
I was also pretty impressed with how fast this operation is (the power of Cython!):
In [130]: timeit df.groupby(['series_id', 'year']).mean()
10 loops, best of 3: 22.9 ms per loop
I haven't compared this with an SQL engine (SQLite, MySQL, Postgres, ...) but I hope that it's competitive. If you call a non-Cythonized function, the speed goes down considerably:
In [132]: timeit df.groupby(['series_id', 'year'])['value'].std()
1 loops, best of 3: 232 ms per loop
Edit: It's also possible to pass multiple functions instead of column names or to mix and match column names with functions:
In [22]: df.groupby([lambda x: x.year, lambda x: x.month]).mean()
Out[22]:
A B C D key_0 key_1
0 0.09433 0.06146 0.1891 0.2079 2000 1
1 0.02674 -0.09411 0.1557 -0.07926 2000 2
2 0.07089 0.3938 -0.08026 -0.416 2000 3
3 0.133 -0.04532 0.03341 -0.2866 2000 4
4 0.1301 0.1803 -0.4241 0.1482 2000 5
5 -0.3804 -0.08935 0.04791 0.08268 2000 6
6 0.2211 0.2483 -0.3559 -0.06205 2000 7
7 -0.2908 0.07606 -0.2001 -0.05225 2000 8
8 -0.07888 -0.3915 0.06442 0.5109 2000 9
9 0.3809 0.4037 -0.2431 -0.2892 2000 10
10 -0.2765 0.01461 -0.2653 0.1335 2000 11
11 0.6983 -0.1167 0.3964 0.03776 2000 12
12 -0.2347 0.2333 -0.06508 0.3 2001 1
13 0.1751 0.05545 -0.004038 0.06194 2001 2
14 0.08792 0.03517 0.05583 -0.4356 2001 3
15 -0.08408 0.05567 -0.1705 0.1914 2001 4
16 0.008912 0.2317 0.08875 0.08637 2001 5
17 0.3304 0.3635 0.005158 0.3588 2001 6
18 -0.1324 0.3419 -0.2527 -0.2249 2001 7
19 0.0472 0.2552 -0.07589 -0.07992 2001 8
20 0.07582 -0.2295 0.1002 0.1077 2001 9
21 0.2368 0.02214 -0.3866 -0.2671 2001 10
22 0.2116 -0.07872 -0.06399 0.2309 2001 11
New: Omitting "nuisance" columns
It's pretty common to group by a column and ignore other columns containing non-floating point data. It used to be that you'd get an error, forcing you to first drop the "nuisance" columns, e.g. suppose we only want to group by the A column here:
In [133]: df2
Out[133]:
A B C D
0 foo one 0.8248 0.3856
1 bar one 1.916 -0.5093
2 foo two -1.175 -1.036
3 bar three -0.9447 0.2519
4 foo two 0.6239 -1.506
5 bar two -2.315 0.1177
6 foo one -0.1133 -1.349
7 foo three -0.95 1.709
If you try to compute the mean of the B column, things won't quite work out:
In [134]: grouped = df2.groupby('A')['B'].mean()
TypeError: unsupported operand type(s) for /: 'str' and 'float'
To cope with this, so-called "nuisance" columns are now automatically dropped. While I don't like discarding information, I think "practicality beats purity" in this case:
In [135]: df2.groupby('A').mean()
Out[135]:
C D
bar -0.4479 -0.0466
foo -0.158 -0.3593
New: “auto-dispatching” to DataFrame / Series
Since very few functions in DataFrame and Series/TimeSeries are implemented on the GroupBy classes, you may wish to simply invoke a Series instance method on each group, let's say:
In [140]: df.groupby('series_id')['value'].agg(Series.skew)
Out[140]:
CEU5500000034 -0.253887792525
CEU5500000035 0.0938252241898
CEU5552200034 0.387563548587
CEU5552200035 0.0868489117242
CEU5552210034 0.74506367536
CEU5552210035 0.551290226183
<snip>
But since we're got Python at our disposal, it's relatively straightforward to override __getattribute__
to wrap and dispatch calls to Series/DataFrame instance methods so you can do this:
In [141]: df.groupby('series_id')['value'].skew()
Out[141]:
CEU5500000034 -0.253887792525
CEU5500000035 0.0938252241898
CEU5552200034 0.387563548587
CEU5552200035 0.0868489117242
CEU5552210034 0.74506367536
<snip>
Some of the more complicated methods, like the super-useful describe method can be used in fact:
In [142]: df.groupby('series_id')['value'].describe()
Out[142]:
count mean std min 10% 50% 90% max
CEU5500000034 279 96.54 8.317 81.7 84 98.6 107.2 112
CEU5500000035 279 93.88 28.9 50.4 55.28 91.9 134.5 140.2
CEU5552200034 278 97.95 6.839 86.9 89.47 97.15 108.2 113.9
CEU5552200035 278 93.58 26.18 58 60.34 89.2 127.4 137.3
CEU5552210034 278 100.7 3.95 94.1 95.87 99.8 106 113.1
CEU5552210035 278 98.84 23.86 71.6 73.77 92.6 137.2 145.8
...
For those interested, the internal code to do this dispatching is short and simple, by the way:
def __getattribute__(self, attr):
try:
return object.__getattribute__(self, attr)
except AttributeError:
if hasattr(self.obj, attr):
return self._make_wrapper(attr)
raise
def _make_wrapper(self, name):
f = getattr(self.obj, name)
if not isinstance(f, types.MethodType):
return self.aggregate(lambda self: getattr(self, name))
# return the class reference
f = getattr(type(self.obj), name)
def wrapper(*args, **kwargs):
curried = lambda self: f(self, *args, **kwargs)
return self.aggregate(curried)
return wrapper
Old but not well known: apply multiple functions
This has always been around but not well-known by many people. Suppose you want to apply multiple functions to a group and collect all the results. You can pass a dict of functions:
In [144]: grouped.agg({'mean' : np.mean, 'variance' : np.var})
Out[144]:
mean variance
CEU5500000034 96.54 68.92
CEU5500000035 93.88 832.3
CEU5552200034 97.95 46.6
CEU5552200035 93.58 683.1
CEU5552210034 100.7 15.54
CEU5552210035 98.84 567.1
CEU5552211034 100.9 13.73
CEU5552211035 99.34 622
<snip>
I have a mind to expand on this idea for DataFrame objects to get us closer to the full flexibility of SQL with Pythonic syntax. If anyone has ideas or suggestions please let me know.
Conclusions
GroupBy is certainly not done. If you use these tools and find them useful, please let me know. If you can think of ways to make them better, that would be nice information too. I think it would be great to implement a full SQL engine on top of pandas (similar to the SAS "proc sql"), and this new GroupBy functionality gets us closer to that goal.