10 Data Aggregation and Group Operations
Categorizing a dataset and applying a function to each group, whether an aggregation or transformation, can be a critical component of a data analysis workflow. After loading, merging, and preparing a dataset, you may need to compute group statistics or possibly pivot tables for reporting or visualization purposes. pandas provides a versatile groupby
interface, enabling you to slice, dice, and summarize datasets in a natural way.
One reason for the popularity of relational databases and SQL (which stands for “structured query language”) is the ease with which data can be joined, filtered, transformed, and aggregated. However, query languages like SQL impose certain limitations on the kinds of group operations that can be performed. As you will see, with the expressiveness of Python and pandas, we can perform quite complex group operations by expressing them as custom Python functions that manipulate the data associated with each group. In this chapter, you will learn how to:
Split a pandas object into pieces using one or more keys (in the form of functions, arrays, or DataFrame column names)
Calculate group summary statistics, like count, mean, or standard deviation, or a user-defined function
Apply within-group transformations or other manipulations, like normalization, linear regression, rank, or subset selection
Compute pivot tables and cross-tabulations
Perform quantile analysis and other statistical group analyses
Time-based aggregation of time series data, a special use case of groupby
, is referred to as resampling in this book and will receive separate treatment in Ch 11: Time Series.
As with the rest of the chapters, we start by importing NumPy and pandas:
12]: import numpy as np
In [
13]: import pandas as pd In [
10.1 How to Think About Group Operations
Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations. In the first stage of the process, data contained in a pandas object, whether a Series, DataFrame, or otherwise, is split into groups based on one or more keys that you provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis="index"
) or its columns (axis="columns"
). Once this is done, a function is applied to each group, producing a new value. Finally, the results of all those function applications are combined into a result object. The form of the resulting object will usually depend on what’s being done to the data. See Figure 10.1 for a mockup of a simple group aggregation.
Each grouping key can take many forms, and the keys do not have to be all of the same type:
A list or array of values that is the same length as the axis being grouped
A value indicating a column name in a DataFrame
A dictionary or Series giving a correspondence between the values on the axis being grouped and the group names
A function to be invoked on the axis index or the individual labels in the index
Note that the latter three methods are shortcuts for producing an array of values to be used to split up the object. Don’t worry if this all seems abstract. Throughout this chapter, I will give many examples of all these methods. To get started, here is a small tabular dataset as a DataFrame:
14]: df = pd.DataFrame({"key1" : ["a", "a", None, "b", "b", "a", None],
In ["key2" : pd.Series([1, 2, 1, 2, 1, None, 1],
....: ="Int64"),
....: dtype"data1" : np.random.standard_normal(7),
....: "data2" : np.random.standard_normal(7)})
....:
15]: df
In [15]:
Out[
key1 key2 data1 data20 a 1 -0.204708 0.281746
1 a 2 0.478943 0.769023
2 None 1 -0.519439 1.246435
3 b 2 -0.555730 1.007189
4 b 1 1.965781 -1.296221
5 a <NA> 1.393406 0.274992
6 None 1 0.092908 0.228913
Suppose you wanted to compute the mean of the data1
column using the labels from key1
. There are a number of ways to do this. One is to access data1
and call groupby
with the column (a Series) at key1
:
16]: grouped = df["data1"].groupby(df["key1"])
In [
17]: grouped
In [17]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x17b7913f0> Out[
This grouped
variable is now a special "GroupBy" object. It has not actually computed anything yet except for some intermediate data about the group key df["key1"]
. The idea is that this object has all of the information needed to then apply some operation to each of the groups. For example, to compute group means we can call the GroupBy’s mean
method:
18]: grouped.mean()
In [18]:
Out[
key10.555881
a 0.705025
b Name: data1, dtype: float64
Later in Data Aggregation, I'll explain more about what happens when you call .mean()
. The important thing here is that the data (a Series) has been aggregated by splitting the data on the group key, producing a new Series that is now indexed by the unique values in the key1
column. The result index has the name "key1"
because the DataFrame column df["key1"]
did.
If instead we had passed multiple arrays as a list, we'd get something different:
19]: means = df["data1"].groupby([df["key1"], df["key2"]]).mean()
In [
20]: means
In [20]:
Out[
key1 key21 -0.204708
a 2 0.478943
1 1.965781
b 2 -0.555730
Name: data1, dtype: float64
Here we grouped the data using two keys, and the resulting Series now has a hierarchical index consisting of the unique pairs of keys observed:
21]: means.unstack()
In [21]:
Out[1 2
key2
key1 -0.204708 0.478943
a 1.965781 -0.555730 b
In this example, the group keys are all Series, though they could be any arrays of the right length:
22]: states = np.array(["OH", "CA", "CA", "OH", "OH", "CA", "OH"])
In [
23]: years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]
In [
24]: df["data1"].groupby([states, years]).mean()
In [24]:
Out[2005 0.936175
CA 2006 -0.519439
2005 -0.380219
OH 2006 1.029344
Name: data1, dtype: float64
Frequently, the grouping information is found in the same DataFrame as the data you want to work on. In that case, you can pass column names (whether those are strings, numbers, or other Python objects) as the group keys:
25]: df.groupby("key1").mean()
In [25]:
Out[
key2 data1 data2
key1 1.5 0.555881 0.441920
a 1.5 0.705025 -0.144516
b
26]: df.groupby("key2").mean(numeric_only=True)
In [26]:
Out[
data1 data2
key2 1 0.333636 0.115218
2 -0.038393 0.888106
27]: df.groupby(["key1", "key2"]).mean()
In [27]:
Out[
data1 data2
key1 key2 1 -0.204708 0.281746
a 2 0.478943 0.769023
1 1.965781 -1.296221
b 2 -0.555730 1.007189
You may notice that in the second case, it is necessary to pass numeric_only=True
because the key1
column is not numeric and thus cannot be aggregated with mean()
.
Regardless of the objective in using groupby
, a generally useful GroupBy method is size
, which returns a Series containing group sizes:
28]: df.groupby(["key1", "key2"]).size()
In [28]:
Out[
key1 key21 1
a 2 1
1 1
b 2 1
dtype: int64
Note that any missing values in a group key are excluded from the result by default. This behavior can be disabled by passing dropna=False
to groupby
:
29]: df.groupby("key1", dropna=False).size()
In [29]:
Out[
key13
a 2
b 2
NaN
dtype: int64
30]: df.groupby(["key1", "key2"], dropna=False).size()
In [30]:
Out[
key1 key21 1
a 2 1
<NA> 1
1 1
b 2 1
1 2
NaN dtype: int64
A group function similar in spirit to size
is count, which computes the number of nonnull values in each group:
31]: df.groupby("key1").count()
In [31]:
Out[
key2 data1 data2
key1 2 3 3
a 2 2 2 b
Iterating over Groups
The object returned by groupby
supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data. Consider the following:
32]: for name, group in df.groupby("key1"):
In [print(name)
....: print(group)
....:
....:
a
key1 key2 data1 data20 a 1 -0.204708 0.281746
1 a 2 0.478943 0.769023
5 a <NA> 1.393406 0.274992
b
key1 key2 data1 data23 b 2 -0.555730 1.007189
4 b 1 1.965781 -1.296221
In the case of multiple keys, the first element in the tuple will be a tuple of key values:
33]: for (k1, k2), group in df.groupby(["key1", "key2"]):
In [print((k1, k2))
....: print(group)
....:
....:'a', 1)
(
key1 key2 data1 data20 a 1 -0.204708 0.281746
'a', 2)
(
key1 key2 data1 data21 a 2 0.478943 0.769023
'b', 1)
(
key1 key2 data1 data24 b 1 1.965781 -1.296221
'b', 2)
(
key1 key2 data1 data23 b 2 -0.55573 1.007189
Of course, you can choose to do whatever you want with the pieces of data. A recipe you may find useful is computing a dictionary of the data pieces as a one-liner:
34]: pieces = {name: group for name, group in df.groupby("key1")}
In [
35]: pieces["b"]
In [35]:
Out[
key1 key2 data1 data23 b 2 -0.555730 1.007189
4 b 1 1.965781 -1.296221
By default groupby
groups on axis="index"
, but you can group on any of the other axes. For example, we could group the columns of our example df
here by whether they start with "key"
or "data"
:
36]: grouped = df.groupby({"key1": "key", "key2": "key",
In ["data1": "data", "data2": "data"}, axis="columns") ....:
We can print out the groups like so:
37]: for group_key, group_values in grouped:
In [print(group_key)
....: print(group_values)
....:
....:
data
data1 data20 -0.204708 0.281746
1 0.478943 0.769023
2 -0.519439 1.246435
3 -0.555730 1.007189
4 1.965781 -1.296221
5 1.393406 0.274992
6 0.092908 0.228913
key
key1 key20 a 1
1 a 2
2 None 1
3 b 2
4 b 1
5 a <NA>
6 None 1
Selecting a Column or Subset of Columns
Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of column subsetting for aggregation. This means that:
"key1")["data1"]
df.groupby("key1")[["data2"]] df.groupby(
are conveniences for:
"data1"].groupby(df["key1"])
df["data2"]].groupby(df["key1"]) df[[
Especially for large datasets, it may be desirable to aggregate only a few columns. For example, in the preceding dataset, to compute the means for just the data2
column and get the result as a DataFrame, we could write:
38]: df.groupby(["key1", "key2"])[["data2"]].mean()
In [38]:
Out[
data2
key1 key2 1 0.281746
a 2 0.769023
1 -1.296221
b 2 1.007189
The object returned by this indexing operation is a grouped DataFrame if a list or array is passed, or a grouped Series if only a single column name is passed as a scalar:
39]: s_grouped = df.groupby(["key1", "key2"])["data2"]
In [
40]: s_grouped
In [40]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x17b8356c0>
Out[
41]: s_grouped.mean()
In [41]:
Out[
key1 key21 0.281746
a 2 0.769023
1 -1.296221
b 2 1.007189
Name: data2, dtype: float64
Grouping with Dictionaries and Series
Grouping information may exist in a form other than an array. Let’s consider another example DataFrame:
42]: people = pd.DataFrame(np.random.standard_normal((5, 5)),
In [=["a", "b", "c", "d", "e"],
....: columns=["Joe", "Steve", "Wanda", "Jill", "Trey"])
....: index
43]: people.iloc[2:3, [1, 2]] = np.nan # Add a few NA values
In [
44]: people
In [44]:
Out[
a b c d e1.352917 0.886429 -2.001637 -0.371843 1.669025
Joe -0.438570 -0.539741 0.476985 3.248944 -1.021228
Steve -0.577087 NaN NaN 0.523772 0.000940
Wanda 1.343810 -0.713544 -0.831154 -2.370232 -1.860761
Jill -0.860757 0.560145 -1.265934 0.119827 -1.063512 Trey
Now, suppose I have a group correspondence for the columns and want to sum the columns by group:
45]: mapping = {"a": "red", "b": "red", "c": "blue",
In ["d": "blue", "e": "red", "f" : "orange"} ....:
Now, you could construct an array from this dictionary to pass to groupby
, but instead we can just pass the dictionary (I included the key "f"
to highlight that unused grouping keys are OK):
46]: by_column = people.groupby(mapping, axis="columns")
In [
47]: by_column.sum()
In [47]:
Out[
blue red-2.373480 3.908371
Joe 3.725929 -1.999539
Steve 0.523772 -0.576147
Wanda -3.201385 -1.230495
Jill -1.146107 -1.364125 Trey
The same functionality holds for Series, which can be viewed as a fixed-size mapping:
48]: map_series = pd.Series(mapping)
In [
49]: map_series
In [49]:
Out[
a red
b red
c blue
d blue
e red
f orangeobject
dtype:
50]: people.groupby(map_series, axis="columns").count()
In [50]:
Out[
blue red2 3
Joe 2 3
Steve 1 2
Wanda 2 3
Jill 2 3 Trey
Grouping with Functions
Using Python functions is a more generic way of defining a group mapping compared with a dictionary or Series. Any function passed as a group key will be called once per index value (or once per column value if using axis="columns"
), with the return values being used as the group names. More concretely, consider the example DataFrame from the previous section, which has people’s first names as index values. Suppose you wanted to group by name length. While you could compute an array of string lengths, it's simpler to just pass the len
function:
51]: people.groupby(len).sum()
In [51]:
Out[
a b c d e3 1.352917 0.886429 -2.001637 -0.371843 1.669025
4 0.483052 -0.153399 -2.097088 -2.250405 -2.924273
5 -1.015657 -0.539741 0.476985 3.772716 -1.020287
Mixing functions with arrays, dictionaries, or Series is not a problem, as everything gets converted to arrays internally:
52]: key_list = ["one", "one", "one", "two", "two"]
In [
53]: people.groupby([len, key_list]).min()
In [53]:
Out[
a b c d e3 one 1.352917 0.886429 -2.001637 -0.371843 1.669025
4 two -0.860757 -0.713544 -1.265934 -2.370232 -1.860761
5 one -0.577087 -0.539741 0.476985 0.523772 -1.021228
Grouping by Index Levels
A final convenience for hierarchically indexed datasets is the ability to aggregate using one of the levels of an axis index. Let's look at an example:
54]: columns = pd.MultiIndex.from_arrays([["US", "US", "US", "JP", "JP"],
In [1, 3, 5, 1, 3]],
....: [=["cty", "tenor"])
....: names
55]: hier_df = pd.DataFrame(np.random.standard_normal((4, 5)), columns=column
In [
s)
56]: hier_df
In [56]:
Out[
cty US JP 1 3 5 1 3
tenor 0 0.332883 -2.359419 -0.199543 -1.541996 -0.970736
1 -1.307030 0.286350 0.377984 -0.753887 0.331286
2 1.349742 0.069877 0.246674 -0.011862 1.004812
3 1.327195 -0.919262 -1.549106 0.022185 0.758363
To group by level, pass the level number or name using the level
keyword:
57]: hier_df.groupby(level="cty", axis="columns").count()
In [57]:
Out[
cty JP US0 2 3
1 2 3
2 2 3
3 2 3
10.2 Data Aggregation
Aggregations refer to any data transformation that produces scalar values from arrays. The preceding examples have used several of them, including mean
, count
, min
, and sum
. You may wonder what is going on when you invoke mean()
on a GroupBy object. Many common aggregations, such as those found in Table 10.1, have optimized implementations. However, you are not limited to only this set of methods.
Function name | Description |
---|---|
any, all |
Return True if any (one or more values) or all non-NA values are "truthy" |
count |
Number of non-NA values |
cummin, cummax |
Cumulative minimum and maximum of non-NA values |
cumsum |
Cumulative sum of non-NA values |
cumprod |
Cumulative product of non-NA values |
first, last |
First and last non-NA values |
mean |
Mean of non-NA values |
median |
Arithmetic median of non-NA values |
min, max |
Minimum and maximum of non-NA values |
nth |
Retrieve value that would appear at position n with the data in sorted order |
ohlc |
Compute four "open-high-low-close" statistics for time series-like data |
prod |
Product of non-NA values |
quantile |
Compute sample quantile |
rank |
Ordinal ranks of non-NA values, like calling Series.rank |
size |
Compute group sizes, returning result as a Series |
sum |
Sum of non-NA values |
std, var |
Sample standard deviation and variance |
You can use aggregations of your own devising and additionally call any method that is also defined on the object being grouped. For example, the nsmallest
Series method selects the smallest requested number of values from the data. While nsmallest
is not explicitly implemented for GroupBy, we can still use it with a nonoptimized implementation. Internally, GroupBy slices up the Series, calls piece.nsmallest(n)
for each piece, and then assembles those results into the result object:
58]: df
In [58]:
Out[
key1 key2 data1 data20 a 1 -0.204708 0.281746
1 a 2 0.478943 0.769023
2 None 1 -0.519439 1.246435
3 b 2 -0.555730 1.007189
4 b 1 1.965781 -1.296221
5 a <NA> 1.393406 0.274992
6 None 1 0.092908 0.228913
59]: grouped = df.groupby("key1")
In [
60]: grouped["data1"].nsmallest(2)
In [60]:
Out[
key1 0 -0.204708
a 1 0.478943
3 -0.555730
b 4 1.965781
Name: data1, dtype: float64
To use your own aggregation functions, pass any function that aggregates an array to the aggregate
method or its short alias agg
:
61]: def peak_to_peak(arr):
In [return arr.max() - arr.min()
....:
62]: grouped.agg(peak_to_peak)
In [62]:
Out[
key2 data1 data2
key1 1 1.598113 0.494031
a 1 2.521511 2.303410 b
You may notice that some methods, like describe
, also work, even though they are not aggregations, strictly speaking:
63]: grouped.describe()
In [63]:
Out[
key2 data1 ... min 25% 50% 75% max count mean ...
count mean std
key1 ... 2.0 1.5 0.707107 1.0 1.25 1.5 1.75 2.0 3.0 0.555881 ... \
a 2.0 1.5 0.707107 1.0 1.25 1.5 1.75 2.0 2.0 0.705025 ...
b
data2 75% max count mean std min 25%
key1 0.936175 1.393406 3.0 0.441920 0.283299 0.274992 0.278369 \
a 1.335403 1.965781 2.0 -0.144516 1.628757 -1.296221 -0.720368
b
50% 75% max
key1 0.281746 0.525384 0.769023
a -0.144516 0.431337 1.007189
b 2 rows x 24 columns] [
I will explain in more detail what has happened here in Apply: General split-apply-combine.
Custom aggregation functions are generally much slower than the optimized functions found in Table 10.1. This is because there is some extra overhead (function calls, data rearrangement) in constructing the intermediate group data chunks.
Column-Wise and Multiple Function Application
Let's return to the tipping dataset used in the last chapter. After loading it with pandas.read_csv
, we add a tipping percentage column:
64]: tips = pd.read_csv("examples/tips.csv")
In [
65]: tips.head()
In [65]:
Out[
total_bill tip smoker day time size0 16.99 1.01 No Sun Dinner 2
1 10.34 1.66 No Sun Dinner 3
2 21.01 3.50 No Sun Dinner 3
3 23.68 3.31 No Sun Dinner 2
4 24.59 3.61 No Sun Dinner 4
Now I will add a tip_pct
column with the tip percentage of the total bill:
66]: tips["tip_pct"] = tips["tip"] / tips["total_bill"]
In [
67]: tips.head()
In [67]:
Out[
total_bill tip smoker day time size tip_pct0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808
As you’ve already seen, aggregating a Series or all of the columns of a DataFrame is a matter of using aggregate
(or agg
) with the desired function or calling a method like mean
or std
. However, you may want to aggregate using a different function, depending on the column, or multiple functions at once. Fortunately, this is possible to do, which I’ll illustrate through a number of examples. First, I’ll group the tips
by day
and smoker
:
68]: grouped = tips.groupby(["day", "smoker"]) In [
Note that for descriptive statistics like those in Table 10.1, you can pass the name of the function as a string:
69]: grouped_pct = grouped["tip_pct"]
In [
70]: grouped_pct.agg("mean")
In [70]:
Out[
day smoker0.151650
Fri No 0.174783
Yes 0.158048
Sat No 0.147906
Yes 0.160113
Sun No 0.187250
Yes 0.160298
Thur No 0.163863
Yes Name: tip_pct, dtype: float64
If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:
71]: grouped_pct.agg(["mean", "std", peak_to_peak])
In [71]:
Out[
mean std peak_to_peak
day smoker 0.151650 0.028123 0.067349
Fri No 0.174783 0.051293 0.159925
Yes 0.158048 0.039767 0.235193
Sat No 0.147906 0.061375 0.290095
Yes 0.160113 0.042347 0.193226
Sun No 0.187250 0.154134 0.644685
Yes 0.160298 0.038774 0.193350
Thur No 0.163863 0.039389 0.151240 Yes
Here we passed a list of aggregation functions to agg
to evaluate independently on the data groups.
You don’t need to accept the names that GroupBy gives to the columns; notably, lambda
functions have the name "<lambda>"
, which makes them hard to identify (you can see for yourself by looking at a function’s __name__
attribute). Thus, if you pass a list of (name, function)
tuples, the first element of each tuple will be used as the DataFrame column names (you can think of a list of 2-tuples as an ordered mapping):
72]: grouped_pct.agg([("average", "mean"), ("stdev", np.std)])
In [72]:
Out[
average stdev
day smoker 0.151650 0.028123
Fri No 0.174783 0.051293
Yes 0.158048 0.039767
Sat No 0.147906 0.061375
Yes 0.160113 0.042347
Sun No 0.187250 0.154134
Yes 0.160298 0.038774
Thur No 0.163863 0.039389 Yes
With a DataFrame you have more options, as you can specify a list of functions to apply to all of the columns or different functions per column. To start, suppose we wanted to compute the same three statistics for the tip_pct
and total_bill
columns:
73]: functions = ["count", "mean", "max"]
In [
74]: result = grouped[["tip_pct", "total_bill"]].agg(functions)
In [
75]: result
In [75]:
Out[
tip_pct total_bill max count mean max
count mean
day smoker 4 0.151650 0.187735 4 18.420000 22.75
Fri No 15 0.174783 0.263480 15 16.813333 40.17
Yes 45 0.158048 0.291990 45 19.661778 48.33
Sat No 42 0.147906 0.325733 42 21.276667 50.81
Yes 57 0.160113 0.252672 57 20.506667 48.17
Sun No 19 0.187250 0.710345 19 24.120000 45.35
Yes 45 0.160298 0.266312 45 17.113111 41.19
Thur No 17 0.163863 0.241255 17 19.190588 43.11 Yes
As you can see, the resulting DataFrame has hierarchical columns, the same as you would get aggregating each column separately and using concat
to glue the results together using the column names as the keys
argument:
76]: result["tip_pct"]
In [76]:
Out[max
count mean
day smoker 4 0.151650 0.187735
Fri No 15 0.174783 0.263480
Yes 45 0.158048 0.291990
Sat No 42 0.147906 0.325733
Yes 57 0.160113 0.252672
Sun No 19 0.187250 0.710345
Yes 45 0.160298 0.266312
Thur No 17 0.163863 0.241255 Yes
As before, a list of tuples with custom names can be passed:
77]: ftuples = [("Average", "mean"), ("Variance", np.var)]
In [
78]: grouped[["tip_pct", "total_bill"]].agg(ftuples)
In [78]:
Out[
tip_pct total_bill
Average Variance Average Variance
day smoker 0.151650 0.000791 18.420000 25.596333
Fri No 0.174783 0.002631 16.813333 82.562438
Yes 0.158048 0.001581 19.661778 79.908965
Sat No 0.147906 0.003767 21.276667 101.387535
Yes 0.160113 0.001793 20.506667 66.099980
Sun No 0.187250 0.023757 24.120000 109.046044
Yes 0.160298 0.001503 17.113111 59.625081
Thur No 0.163863 0.001551 19.190588 69.808518 Yes
Now, suppose you wanted to apply potentially different functions to one or more of the columns. To do this, pass a dictionary to agg
that contains a mapping of column names to any of the function specifications listed so far:
79]: grouped.agg({"tip" : np.max, "size" : "sum"})
In [79]:
Out[
tip size
day smoker 3.50 9
Fri No 4.73 31
Yes 9.00 115
Sat No 10.00 104
Yes 6.00 167
Sun No 6.50 49
Yes 6.70 112
Thur No 5.00 40
Yes
80]: grouped.agg({"tip_pct" : ["min", "max", "mean", "std"],
In ["size" : "sum"})
....: 80]:
Out[
tip_pct sizemin max mean std sum
day smoker 0.120385 0.187735 0.151650 0.028123 9
Fri No 0.103555 0.263480 0.174783 0.051293 31
Yes 0.056797 0.291990 0.158048 0.039767 115
Sat No 0.035638 0.325733 0.147906 0.061375 104
Yes 0.059447 0.252672 0.160113 0.042347 167
Sun No 0.065660 0.710345 0.187250 0.154134 49
Yes 0.072961 0.266312 0.160298 0.038774 112
Thur No 0.090014 0.241255 0.163863 0.039389 40 Yes
A DataFrame will have hierarchical columns only if multiple functions are applied to at least one column.
Returning Aggregated Data Without Row Indexes
In all of the examples up until now, the aggregated data comes back with an index, potentially hierarchical, composed from the unique group key combinations. Since this isn’t always desirable, you can disable this behavior in most cases by passing as_index=False
to groupby
:
81]: grouped = tips.groupby(["day", "smoker"], as_index=False)
In [
82]: grouped.mean(numeric_only=True)
In [82]:
Out[
day smoker total_bill tip size tip_pct0 Fri No 18.420000 2.812500 2.250000 0.151650
1 Fri Yes 16.813333 2.714000 2.066667 0.174783
2 Sat No 19.661778 3.102889 2.555556 0.158048
3 Sat Yes 21.276667 2.875476 2.476190 0.147906
4 Sun No 20.506667 3.167895 2.929825 0.160113
5 Sun Yes 24.120000 3.516842 2.578947 0.187250
6 Thur No 17.113111 2.673778 2.488889 0.160298
7 Thur Yes 19.190588 3.030000 2.352941 0.163863
Of course, it’s always possible to obtain the result in this format by calling reset_index
on the result. Using the as_index=False
argument avoids some unnecessary computations.
10.3 Apply: General split-apply-combine
The most general-purpose GroupBy method is apply
, which is the subject of this section. apply
splits the object being manipulated into pieces, invokes the passed function on each piece, and then attempts to concatenate the pieces.
Returning to the tipping dataset from before, suppose you wanted to select the top five tip_pct
values by group. First, write a function that selects the rows with the largest values in a particular column:
83]: def top(df, n=5, column="tip_pct"):
In [return df.sort_values(column, ascending=False)[:n]
....:
84]: top(tips, n=6)
In [84]:
Out[
total_bill tip smoker day time size tip_pct172 7.25 5.15 Yes Sun Dinner 2 0.710345
178 9.60 4.00 Yes Sun Dinner 2 0.416667
67 3.07 1.00 Yes Sat Dinner 1 0.325733
232 11.61 3.39 No Sat Dinner 2 0.291990
183 23.17 6.50 Yes Sun Dinner 4 0.280535
109 14.31 4.00 Yes Sat Dinner 2 0.279525
Now, if we group by smoker
, say, and call apply
with this function, we get the following:
85]: tips.groupby("smoker").apply(top)
In [85]:
Out[
total_bill tip smoker day time size tip_pct
smoker 232 11.61 3.39 No Sat Dinner 2 0.291990
No 149 7.51 2.00 No Thur Lunch 2 0.266312
51 10.29 2.60 No Sun Dinner 2 0.252672
185 20.69 5.00 No Sun Dinner 5 0.241663
88 24.71 5.85 No Thur Lunch 2 0.236746
172 7.25 5.15 Yes Sun Dinner 2 0.710345
Yes 178 9.60 4.00 Yes Sun Dinner 2 0.416667
67 3.07 1.00 Yes Sat Dinner 1 0.325733
183 23.17 6.50 Yes Sun Dinner 4 0.280535
109 14.31 4.00 Yes Sat Dinner 2 0.279525
What has happened here? First, the tips
DataFrame is split into groups based on the value of smoker
. Then the top
function is called on each group, and the results of each function call are glued together using pandas.concat
, labeling the pieces with the group names. The result therefore has a hierarchical index with an inner level that contains index values from the original DataFrame.
If you pass a function to apply
that takes other arguments or keywords, you can pass these after the function:
86]: tips.groupby(["smoker", "day"]).apply(top, n=1, column="total_bill")
In [86]:
Out[
total_bill tip smoker day time size tip_pct
smoker day 94 22.75 3.25 No Fri Dinner 2 0.142857
No Fri 212 48.33 9.00 No Sat Dinner 4 0.186220
Sat 156 48.17 5.00 No Sun Dinner 6 0.103799
Sun 142 41.19 5.00 No Thur Lunch 5 0.121389
Thur 95 40.17 4.73 Yes Fri Dinner 4 0.117750
Yes Fri 170 50.81 10.00 Yes Sat Dinner 3 0.196812
Sat 182 45.35 3.50 Yes Sun Dinner 3 0.077178
Sun 197 43.11 5.00 Yes Thur Lunch 4 0.115982 Thur
Beyond these basic usage mechanics, getting the most out of apply
may require some creativity. What occurs inside the function passed is up to you; it must either return a pandas object or a scalar value. The rest of this chapter will consist mainly of examples showing you how to solve various problems using groupby
.
For example, you may recall that I earlier called describe
on a GroupBy object:
87]: result = tips.groupby("smoker")["tip_pct"].describe()
In [
88]: result
In [88]:
Out[min 25% 50% 75%
count mean std
smoker 151.0 0.159328 0.039910 0.056797 0.136906 0.155625 0.185014 \
No 93.0 0.163196 0.085119 0.035638 0.106771 0.153846 0.195059
Yes max
smoker 0.291990
No 0.710345
Yes
89]: result.unstack("smoker")
In [89]:
Out[
smoker151.000000
count No 93.000000
Yes 0.159328
mean No 0.163196
Yes 0.039910
std No 0.085119
Yes min No 0.056797
0.035638
Yes 25% No 0.136906
0.106771
Yes 50% No 0.155625
0.153846
Yes 75% No 0.185014
0.195059
Yes max No 0.291990
0.710345
Yes dtype: float64
Inside GroupBy, when you invoke a method like describe
, it is actually just a shortcut for:
def f(group):
return group.describe()
apply(f) grouped.
Suppressing the Group Keys
In the preceding examples, you see that the resulting object has a hierarchical index formed from the group keys, along with the indexes of each piece of the original object. You can disable this by passing group_keys=False
to groupby
:
90]: tips.groupby("smoker", group_keys=False).apply(top)
In [90]:
Out[
total_bill tip smoker day time size tip_pct232 11.61 3.39 No Sat Dinner 2 0.291990
149 7.51 2.00 No Thur Lunch 2 0.266312
51 10.29 2.60 No Sun Dinner 2 0.252672
185 20.69 5.00 No Sun Dinner 5 0.241663
88 24.71 5.85 No Thur Lunch 2 0.236746
172 7.25 5.15 Yes Sun Dinner 2 0.710345
178 9.60 4.00 Yes Sun Dinner 2 0.416667
67 3.07 1.00 Yes Sat Dinner 1 0.325733
183 23.17 6.50 Yes Sun Dinner 4 0.280535
109 14.31 4.00 Yes Sat Dinner 2 0.279525
Quantile and Bucket Analysis
As you may recall from Ch 8: Data Wrangling: Join, Combine, and Reshape, pandas has some tools, in particular pandas.cut
and pandas.qcut
, for slicing data up into buckets with bins of your choosing, or by sample quantiles. Combining these functions with groupby
makes it convenient to perform bucket or quantile analysis on a dataset. Consider a simple random dataset and an equal-length bucket categorization using pandas.cut
:
91]: frame = pd.DataFrame({"data1": np.random.standard_normal(1000),
In ["data2": np.random.standard_normal(1000)})
....:
92]: frame.head()
In [92]:
Out[
data1 data20 -0.660524 -0.612905
1 0.862580 0.316447
2 -0.010032 0.838295
3 0.050009 -1.034423
4 0.670216 0.434304
93]: quartiles = pd.cut(frame["data1"], 4)
In [
94]: quartiles.head(10)
In [94]:
Out[0 (-1.23, 0.489]
1 (0.489, 2.208]
2 (-1.23, 0.489]
3 (-1.23, 0.489]
4 (0.489, 2.208]
5 (0.489, 2.208]
6 (-1.23, 0.489]
7 (-1.23, 0.489]
8 (-2.956, -1.23]
9 (-1.23, 0.489]
Name: data1, dtype: category4, interval[float64, right]): [(-2.956, -1.23] < (-1.23, 0.489] < (0.
Categories (489, 2.208] <
2.208, 3.928]] (
The Categorical
object returned by cut
can be passed directly to groupby
. So we could compute a set of group statistics for the quartiles, like so:
95]: def get_stats(group):
In [return pd.DataFrame(
....: "min": group.min(), "max": group.max(),
....: {"count": group.count(), "mean": group.mean()}
....:
....: )
96]: grouped = frame.groupby(quartiles)
In [
97]: grouped.apply(get_stats)
In [97]:
Out[min max count mean
data1 -2.956, -1.23] data1 -2.949343 -1.230179 94 -1.658818
(-3.399312 1.670835 94 -0.033333
data2 -1.23, 0.489] data1 -1.228918 0.488675 598 -0.329524
(-2.989741 3.260383 598 -0.002622
data2 0.489, 2.208] data1 0.489965 2.200997 298 1.065727
(-3.745356 2.954439 298 0.078249
data2 2.208, 3.928] data1 2.212303 3.927528 10 2.644253
(-1.929776 1.765640 10 0.024750 data2
Keep in mind the same result could have been computed more simply with:
98]: grouped.agg(["min", "max", "count", "mean"])
In [98]:
Out[
data1 data2 min max count mean min max count
data1 -2.956, -1.23] -2.949343 -1.230179 94 -1.658818 -3.399312 1.670835 94 \
(-1.23, 0.489] -1.228918 0.488675 598 -0.329524 -2.989741 3.260383 598
(0.489, 2.208] 0.489965 2.200997 298 1.065727 -3.745356 2.954439 298
(2.208, 3.928] 2.212303 3.927528 10 2.644253 -1.929776 1.765640 10
(
mean
data1 -2.956, -1.23] -0.033333
(-1.23, 0.489] -0.002622
(0.489, 2.208] 0.078249
(2.208, 3.928] 0.024750 (
These were equal-length buckets; to compute equal-size buckets based on sample quantiles, use pandas.qcut
. We can pass 4
as the number of bucket compute sample quartiles, and pass labels=False
to obtain just the quartile indices instead of intervals:
99]: quartiles_samp = pd.qcut(frame["data1"], 4, labels=False)
In [
100]: quartiles_samp.head()
In [100]:
Out[0 1
1 3
2 2
3 2
4 3
Name: data1, dtype: int64
101]: grouped = frame.groupby(quartiles_samp)
In [
102]: grouped.apply(get_stats)
In [102]:
Out[min max count mean
data1 0 data1 -2.949343 -0.685484 250 -1.212173
-3.399312 2.628441 250 -0.027045
data2 1 data1 -0.683066 -0.030280 250 -0.368334
-2.630247 3.260383 250 -0.027845
data2 2 data1 -0.027734 0.618965 250 0.295812
-3.056990 2.458842 250 0.014450
data2 3 data1 0.623587 3.927528 250 1.248875
-3.745356 2.954439 250 0.115899 data2
Example: Filling Missing Values with Group-Specific Values
When cleaning up missing data, in some cases you will remove data observations using dropna
, but in others you may want to fill in the null (NA) values using a fixed value or some value derived from the data. fillna
is the right tool to use; for example, here I fill in the null values with the mean:
103]: s = pd.Series(np.random.standard_normal(6))
In [
104]: s[::2] = np.nan
In [
105]: s
In [105]:
Out[0 NaN
1 0.227290
2 NaN
3 -2.153545
4 NaN
5 -0.375842
dtype: float64
106]: s.fillna(s.mean())
In [106]:
Out[0 -0.767366
1 0.227290
2 -0.767366
3 -2.153545
4 -0.767366
5 -0.375842
dtype: float64
Suppose you need the fill value to vary by group. One way to do this is to group the data and use apply
with a function that calls fillna
on each data chunk. Here is some sample data on US states divided into eastern and western regions:
107]: states = ["Ohio", "New York", "Vermont", "Florida",
In ["Oregon", "Nevada", "California", "Idaho"]
.....:
108]: group_key = ["East", "East", "East", "East",
In ["West", "West", "West", "West"]
.....:
109]: data = pd.Series(np.random.standard_normal(8), index=states)
In [
110]: data
In [110]:
Out[0.329939
Ohio 0.981994
New York 1.105913
Vermont -1.613716
Florida 1.561587
Oregon 0.406510
Nevada 0.359244
California -0.614436
Idaho dtype: float64
Let's set some values in the data to be missing:
111]: data[["Vermont", "Nevada", "Idaho"]] = np.nan
In [
112]: data
In [112]:
Out[0.329939
Ohio 0.981994
New York
Vermont NaN-1.613716
Florida 1.561587
Oregon
Nevada NaN0.359244
California
Idaho NaN
dtype: float64
113]: data.groupby(group_key).size()
In [113]:
Out[4
East 4
West
dtype: int64
114]: data.groupby(group_key).count()
In [114]:
Out[3
East 2
West
dtype: int64
115]: data.groupby(group_key).mean()
In [115]:
Out[-0.100594
East 0.960416
West dtype: float64
We can fill the NA values using the group means, like so:
116]: def fill_mean(group):
In [return group.fillna(group.mean())
.....:
117]: data.groupby(group_key).apply(fill_mean)
In [117]:
Out[0.329939
East Ohio 0.981994
New York -0.100594
Vermont -1.613716
Florida 1.561587
West Oregon 0.960416
Nevada 0.359244
California 0.960416
Idaho dtype: float64
In another case, you might have predefined fill values in your code that vary by group. Since the groups have a name
attribute set internally, we can use that:
118]: fill_values = {"East": 0.5, "West": -1}
In [
119]: def fill_func(group):
In [return group.fillna(fill_values[group.name])
.....:
120]: data.groupby(group_key).apply(fill_func)
In [120]:
Out[0.329939
East Ohio 0.981994
New York 0.500000
Vermont -1.613716
Florida 1.561587
West Oregon -1.000000
Nevada 0.359244
California -1.000000
Idaho dtype: float64
Example: Random Sampling and Permutation
Suppose you wanted to draw a random sample (with or without replacement) from a large dataset for Monte Carlo simulation purposes or some other application. There are a number of ways to perform the “draws”; here we use the sample
method for Series.
To demonstrate, here’s a way to construct a deck of English-style playing cards:
= ["H", "S", "C", "D"] # Hearts, Spades, Clubs, Diamonds
suits = (list(range(1, 11)) + [10] * 3) * 4
card_val = ["A"] + list(range(2, 11)) + ["J", "K", "Q"]
base_names = []
cards for suit in suits:
str(num) + suit for num in base_names)
cards.extend(
= pd.Series(card_val, index=cards) deck
Now we have a Series of length 52 whose index contains card names, and values are the ones used in blackjack and other games (to keep things simple, I let the ace "A"
be 1):
122]: deck.head(13)
In [122]:
Out[1
AH 2H 2
3H 3
4H 4
5H 5
6H 6
7H 7
8H 8
9H 9
10H 10
10
JH 10
KH 10
QH dtype: int64
Now, based on what I said before, drawing a hand of five cards from the deck could be written as:
123]: def draw(deck, n=5):
In [return deck.sample(n)
.....:
124]: draw(deck)
In [124]:
Out[4D 4
10
QH 8S 8
7D 7
9C 9
dtype: int64
Suppose you wanted two random cards from each suit. Because the suit is the last character of each card name, we can group based on this and use apply
:
125]: def get_suit(card):
In [# last letter is suit
.....: return card[-1]
.....:
126]: deck.groupby(get_suit).apply(draw, n=2)
In [126]:
Out[6C 6
C 10
KC 7D 7
D 3D 3
7H 7
H 9H 9
2S 2
S 10
QS dtype: int64
Alternatively, we could pass group_keys=False
to drop the outer suit index, leaving in just the selected cards:
127]: deck.groupby(get_suit, group_keys=False).apply(draw, n=2)
In [127]:
Out[1
AC 3C 3
5D 5
4D 4
10H 10
7H 7
10
QS 7S 7
dtype: int64
Example: Group Weighted Average and Correlation
Under the split-apply-combine paradigm of groupby
, operations between columns in a DataFrame or two Series, such as a group weighted average, are possible. As an example, take this dataset containing group keys, values, and some weights:
128]: df = pd.DataFrame({"category": ["a", "a", "a", "a",
In ["b", "b", "b", "b"],
.....: "data": np.random.standard_normal(8),
.....: "weights": np.random.uniform(size=8)})
.....:
129]: df
In [129]:
Out[
category data weights0 a -1.691656 0.955905
1 a 0.511622 0.012745
2 a -0.401675 0.137009
3 a 0.968578 0.763037
4 b -1.818215 0.492472
5 b 0.279963 0.832908
6 b -0.200819 0.658331
7 b -0.217221 0.612009
The weighted average by category
would then be:
130]: grouped = df.groupby("category")
In [
131]: def get_wavg(group):
In [return np.average(group["data"], weights=group["weights"])
.....:
132]: grouped.apply(get_wavg)
In [132]:
Out[
category-0.495807
a -0.357273
b dtype: float64
As another example, consider a financial dataset originally obtained from Yahoo! Finance containing end-of-day prices for a few stocks and the S&P 500 index (the SPX
symbol):
133]: close_px = pd.read_csv("examples/stock_px.csv", parse_dates=True,
In [=0)
.....: index_col
134]: close_px.info()
In [<class 'pandas.core.frame.DataFrame'>
2214 entries, 2003-01-02 to 2011-10-14
DatetimeIndex: 4 columns):
Data columns (total # Column Non-Null Count Dtype
--- ------ -------------- -----
0 AAPL 2214 non-null float64
1 MSFT 2214 non-null float64
2 XOM 2214 non-null float64
3 SPX 2214 non-null float64
4)
dtypes: float64(86.5 KB
memory usage:
135]: close_px.tail(4)
In [135]:
Out[
AAPL MSFT XOM SPX2011-10-11 400.29 27.00 76.27 1195.54
2011-10-12 402.19 26.96 77.16 1207.25
2011-10-13 408.43 27.18 76.37 1203.66
2011-10-14 422.00 27.27 78.11 1224.58
The DataFrame info()
method here is a convenient way to get an overview of the contents of a DataFrame.
One task of interest might be to compute a DataFrame consisting of the yearly correlations of daily returns (computed from percent changes) with SPX
. As one way to do this, we first create a function that computes the pair-wise correlation of each column with the "SPX"
column:
136]: def spx_corr(group):
In [return group.corrwith(group["SPX"]) .....:
Next, we compute percent change on close_px
using pct_change
:
137]: rets = close_px.pct_change().dropna() In [
Lastly, we group these percent changes by year, which can be extracted from each row label with a one-line function that returns the year
attribute of each datetime
label:
138]: def get_year(x):
In [return x.year
.....:
139]: by_year = rets.groupby(get_year)
In [
140]: by_year.apply(spx_corr)
In [140]:
Out[
AAPL MSFT XOM SPX2003 0.541124 0.745174 0.661265 1.0
2004 0.374283 0.588531 0.557742 1.0
2005 0.467540 0.562374 0.631010 1.0
2006 0.428267 0.406126 0.518514 1.0
2007 0.508118 0.658770 0.786264 1.0
2008 0.681434 0.804626 0.828303 1.0
2009 0.707103 0.654902 0.797921 1.0
2010 0.710105 0.730118 0.839057 1.0
2011 0.691931 0.800996 0.859975 1.0
You could also compute intercolumn correlations. Here we compute the annual correlation between Apple and Microsoft:
141]: def corr_aapl_msft(group):
In [return group["AAPL"].corr(group["MSFT"])
.....:
142]: by_year.apply(corr_aapl_msft)
In [142]:
Out[2003 0.480868
2004 0.259024
2005 0.300093
2006 0.161735
2007 0.417738
2008 0.611901
2009 0.432738
2010 0.571946
2011 0.581987
dtype: float64
Example: Group-Wise Linear Regression
In the same theme as the previous example, you can use groupby
to perform more complex group-wise statistical analysis, as long as the function returns a pandas object or scalar value. For example, I can define the following regress
function (using the statsmodels
econometrics library), which executes an ordinary least squares (OLS) regression on each chunk of data:
import statsmodels.api as sm
def regress(data, yvar=None, xvars=None):
= data[yvar]
Y = data[xvars]
X "intercept"] = 1.
X[= sm.OLS(Y, X).fit()
result return result.params
You can install statsmodels
with conda if you don't have it already:
conda install statsmodels
Now, to run a yearly linear regression of AAPL
on SPX
returns, execute:
144]: by_year.apply(regress, yvar="AAPL", xvars=["SPX"])
In [144]:
Out[
SPX intercept2003 1.195406 0.000710
2004 1.363463 0.004201
2005 1.766415 0.003246
2006 1.645496 0.000080
2007 1.198761 0.003438
2008 0.968016 -0.001110
2009 0.879103 0.002954
2010 1.052608 0.001261
2011 0.806605 0.001514
10.4 Group Transforms and "Unwrapped" GroupBys
In Apply: General split-apply-combine, we looked at the apply
method in grouped operations for performing transformations. There is another built-in method called transform
, which is similar to apply
but imposes more constraints on the kind of function you can use:
It can produce a scalar value to be broadcast to the shape of the group.
It can produce an object of the same shape as the input group.
It must not mutate its input.
Let's consider a simple example for illustration:
145]: df = pd.DataFrame({'key': ['a', 'b', 'c'] * 4,
In ['value': np.arange(12.)})
.....:
146]: df
In [146]:
Out[
key value0 a 0.0
1 b 1.0
2 c 2.0
3 a 3.0
4 b 4.0
5 c 5.0
6 a 6.0
7 b 7.0
8 c 8.0
9 a 9.0
10 b 10.0
11 c 11.0
Here are the group means by key:
147]: g = df.groupby('key')['value']
In [
148]: g.mean()
In [148]:
Out[
key4.5
a 5.5
b 6.5
c Name: value, dtype: float64
Suppose instead we wanted to produce a Series of the same shape as df['value']
but with values replaced by the average grouped by 'key'
. We can pass a function that computes the mean of a single group to transform
:
149]: def get_mean(group):
In [return group.mean()
.....:
150]: g.transform(get_mean)
In [150]:
Out[0 4.5
1 5.5
2 6.5
3 4.5
4 5.5
5 6.5
6 4.5
7 5.5
8 6.5
9 4.5
10 5.5
11 6.5
Name: value, dtype: float64
For built-in aggregation functions, we can pass a string alias as with the GroupBy agg
method:
151]: g.transform('mean')
In [151]:
Out[0 4.5
1 5.5
2 6.5
3 4.5
4 5.5
5 6.5
6 4.5
7 5.5
8 6.5
9 4.5
10 5.5
11 6.5
Name: value, dtype: float64
Like apply
, transform
works with functions that return Series, but the result must be the same size as the input. For example, we can multiply each group by 2 using a helper function:
152]: def times_two(group):
In [return group * 2
.....:
153]: g.transform(times_two)
In [153]:
Out[0 0.0
1 2.0
2 4.0
3 6.0
4 8.0
5 10.0
6 12.0
7 14.0
8 16.0
9 18.0
10 20.0
11 22.0
Name: value, dtype: float64
As a more complicated example, we can compute the ranks in descending order for each group:
154]: def get_ranks(group):
In [return group.rank(ascending=False)
.....:
155]: g.transform(get_ranks)
In [155]:
Out[0 4.0
1 4.0
2 4.0
3 3.0
4 3.0
5 3.0
6 2.0
7 2.0
8 2.0
9 1.0
10 1.0
11 1.0
Name: value, dtype: float64
Consider a group transformation function composed from simple aggregations:
156]: def normalize(x):
In [return (x - x.mean()) / x.std() .....:
We can obtain equivalent results in this case using either transform
or apply
:
157]: g.transform(normalize)
In [157]:
Out[0 -1.161895
1 -1.161895
2 -1.161895
3 -0.387298
4 -0.387298
5 -0.387298
6 0.387298
7 0.387298
8 0.387298
9 1.161895
10 1.161895
11 1.161895
Name: value, dtype: float64
158]: g.apply(normalize)
In [158]:
Out[
key 0 -1.161895
a 3 -0.387298
6 0.387298
9 1.161895
1 -1.161895
b 4 -0.387298
7 0.387298
10 1.161895
2 -1.161895
c 5 -0.387298
8 0.387298
11 1.161895
Name: value, dtype: float64
Built-in aggregate functions like 'mean'
or 'sum'
are often much faster than a general apply
function. These also have a "fast path" when used with transform
. This allows us to perform what is called an unwrapped group operation:
159]: g.transform('mean')
In [159]:
Out[0 4.5
1 5.5
2 6.5
3 4.5
4 5.5
5 6.5
6 4.5
7 5.5
8 6.5
9 4.5
10 5.5
11 6.5
Name: value, dtype: float64
160]: normalized = (df['value'] - g.transform('mean')) / g.transform('std')
In [
161]: normalized
In [161]:
Out[0 -1.161895
1 -1.161895
2 -1.161895
3 -0.387298
4 -0.387298
5 -0.387298
6 0.387298
7 0.387298
8 0.387298
9 1.161895
10 1.161895
11 1.161895
Name: value, dtype: float64
Here, we are doing arithmetic between the outputs of multiple GroupBy operations instead of writing a function and passing it to groupby(...).apply
. That is what is meant by "unwrapped."
While an unwrapped group operation may involve multiple group aggregations, the overall benefit of vectorized operations often outweighs this.
10.5 Pivot Tables and Cross-Tabulation
A pivot table is a data summarization tool frequently found in spreadsheet programs and other data analysis software. It aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns. Pivot tables in Python with pandas are made possible through the groupby
facility described in this chapter, combined with reshape operations utilizing hierarchical indexing. DataFrame also has a pivot_table
method, and there is also a top-level pandas.pivot_table
function. In addition to providing a convenience interface to groupby
, pivot_table
can add partial totals, also known as margins.
Returning to the tipping dataset, suppose you wanted to compute a table of group means (the default pivot_table
aggregation type) arranged by day
and smoker
on the rows:
162]: tips.head()
In [162]:
Out[
total_bill tip smoker day time size tip_pct0 16.99 1.01 No Sun Dinner 2 0.059447
1 10.34 1.66 No Sun Dinner 3 0.160542
2 21.01 3.50 No Sun Dinner 3 0.166587
3 23.68 3.31 No Sun Dinner 2 0.139780
4 24.59 3.61 No Sun Dinner 4 0.146808
163]: tips.pivot_table(index=["day", "smoker"],
In [=["size", "tip", "tip_pct", "total_bill"])
.....: values163]:
Out[
size tip tip_pct total_bill
day smoker 2.250000 2.812500 0.151650 18.420000
Fri No 2.066667 2.714000 0.174783 16.813333
Yes 2.555556 3.102889 0.158048 19.661778
Sat No 2.476190 2.875476 0.147906 21.276667
Yes 2.929825 3.167895 0.160113 20.506667
Sun No 2.578947 3.516842 0.187250 24.120000
Yes 2.488889 2.673778 0.160298 17.113111
Thur No 2.352941 3.030000 0.163863 19.190588 Yes
This could have been produced with groupby
directly, using tips.groupby(["day", "smoker"]).mean()
. Now, suppose we want to take the average of only tip_pct
and size
, and additionally group by time
. I’ll put smoker
in the table columns and time
and day
in the rows:
164]: tips.pivot_table(index=["time", "day"], columns="smoker",
In [=["tip_pct", "size"])
.....: values164]:
Out[
size tip_pct
smoker No Yes No Yes
time day 2.000000 2.222222 0.139622 0.165347
Dinner Fri 2.555556 2.476190 0.158048 0.147906
Sat 2.929825 2.578947 0.160113 0.187250
Sun 2.000000 NaN 0.159744 NaN
Thur 3.000000 1.833333 0.187735 0.188937
Lunch Fri 2.500000 2.352941 0.160311 0.163863 Thur
We could augment this table to include partial totals by passing margins=True
. This has the effect of adding All
row and column labels, with corresponding values being the group statistics for all the data within a single tier:
165]: tips.pivot_table(index=["time", "day"], columns="smoker",
In [=["tip_pct", "size"], margins=True)
.....: values165]:
Out[
size tip_pct
smoker No Yes All No Yes All
time day 2.000000 2.222222 2.166667 0.139622 0.165347 0.158916
Dinner Fri 2.555556 2.476190 2.517241 0.158048 0.147906 0.153152
Sat 2.929825 2.578947 2.842105 0.160113 0.187250 0.166897
Sun 2.000000 NaN 2.000000 0.159744 NaN 0.159744
Thur 3.000000 1.833333 2.000000 0.187735 0.188937 0.188765
Lunch Fri 2.500000 2.352941 2.459016 0.160311 0.163863 0.161301
Thur 2.668874 2.408602 2.569672 0.159328 0.163196 0.160803 All
Here, the All
values are means without taking into account smoker versus non-smoker (the All
columns) or any of the two levels of grouping on the rows (the All
row).
To use an aggregation function other than mean
, pass it to the aggfunc
keyword argument. For example, "count"
or len
will give you a cross-tabulation (count or frequency) of group sizes (though "count"
will exclude null values from the count within data groups, while len
will not):
166]: tips.pivot_table(index=["time", "smoker"], columns="day",
In [="tip_pct", aggfunc=len, margins=True)
.....: values166]:
Out[
day Fri Sat Sun Thur All
time smoker 3.0 45.0 57.0 1.0 106
Dinner No 9.0 42.0 19.0 NaN 70
Yes 1.0 NaN NaN 44.0 45
Lunch No 6.0 NaN NaN 17.0 23
Yes 19.0 87.0 76.0 62.0 244 All
If some combinations are empty (or otherwise NA), you may wish to pass a fill_value
:
167]: tips.pivot_table(index=["time", "size", "smoker"], columns="day",
In [="tip_pct", fill_value=0)
.....: values167]:
Out[
day Fri Sat Sun Thur
time size smoker 1 No 0.000000 0.137931 0.000000 0.000000
Dinner 0.000000 0.325733 0.000000 0.000000
Yes 2 No 0.139622 0.162705 0.168859 0.159744
0.171297 0.148668 0.207893 0.000000
Yes 3 No 0.000000 0.154661 0.152663 0.000000
... ... ... ... ...3 Yes 0.000000 0.000000 0.000000 0.204952
Lunch 4 No 0.000000 0.000000 0.000000 0.138919
0.000000 0.000000 0.000000 0.155410
Yes 5 No 0.000000 0.000000 0.000000 0.121389
6 No 0.000000 0.000000 0.000000 0.173706
21 rows x 4 columns] [
See Table 10.2 for a summary of pivot_table
options.
Argument | Description |
---|---|
values |
Column name or names to aggregate; by default, aggregates all numeric columns |
index |
Column names or other group keys to group on the rows of the resulting pivot table |
columns |
Column names or other group keys to group on the columns of the resulting pivot table |
aggfunc |
Aggregation function or list of functions ("mean" by default); can be any function valid in a groupby context |
fill_value |
Replace missing values in the result table |
dropna |
If True , do not include columns whose entries are all NA |
margins |
Add row/column subtotals and grand total (False by default) |
margins_name |
Name to use for the margin row/column labels when passing margins=True ; defaults to "All" |
observed |
With Categorical group keys, if True , show only the observed category values in the keys rather than all categories |
Cross-Tabulations: Crosstab
A cross-tabulation (or crosstab for short) is a special case of a pivot table that computes group frequencies. Here is an example:
168]: from io import StringIO
In [
169]: data = """Sample Nationality Handedness
In [ .....: 1 USA Right-handed
.....: 2 Japan Left-handed
.....: 3 USA Right-handed
.....: 4 Japan Right-handed
.....: 5 Japan Left-handed
.....: 6 Japan Right-handed
.....: 7 USA Right-handed
.....: 8 USA Left-handed
.....: 9 Japan Right-handed
.....: 10 USA Right-handed"""
.....:
170]: data = pd.read_table(StringIO(data), sep="\s+") In [
171]: data
In [171]:
Out[
Sample Nationality Handedness0 1 USA Right-handed
1 2 Japan Left-handed
2 3 USA Right-handed
3 4 Japan Right-handed
4 5 Japan Left-handed
5 6 Japan Right-handed
6 7 USA Right-handed
7 8 USA Left-handed
8 9 Japan Right-handed
9 10 USA Right-handed
As part of some survey analysis, we might want to summarize this data by nationality and handedness. You could use pivot_table
to do this, but the pandas.crosstab
function can be more convenient:
172]: pd.crosstab(data["Nationality"], data["Handedness"], margins=True)
In [172]:
Out[-handed Right-handed All
Handedness Left
Nationality 2 3 5
Japan 1 4 5
USA 3 7 10 All
The first two arguments to crosstab
can each be an array or Series or a list of arrays. As in the tips data:
173]: pd.crosstab([tips["time"], tips["day"]], tips["smoker"], margins=True)
In [173]:
Out[
smoker No Yes All
time day 3 9 12
Dinner Fri 45 42 87
Sat 57 19 76
Sun 1 0 1
Thur 1 6 7
Lunch Fri 44 17 61
Thur 151 93 244 All
10.6 Conclusion
Mastering pandas's data grouping tools can help with data cleaning and modeling or statistical analysis work. In Ch 13: Data Analysis Examples we will look at several more example use cases for groupby
on real data.
In the next chapter, we turn our attention to time series data.