8 Data Wrangling: Join, Combine, and Reshape
In many applications, data may be spread across a number of files or databases, or be arranged in a form that is not convenient to analyze. This chapter focuses on tools to help combine, join, and rearrange data.
First, I introduce the concept of hierarchical indexing in pandas, which is used extensively in some of these operations. I then dig into the particular data manipulations. You can see various applied usages of these tools in Ch 13: Data Analysis Examples.
8.1 Hierarchical Indexing
Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Another way of thinking about it is that it provides a way for you to work with higher dimensional data in a lower dimensional form. Let’s start with a simple example: create a Series with a list of lists (or arrays) as the index:
11]: data = pd.Series(np.random.uniform(size=9),
In [=[["a", "a", "a", "b", "b", "c", "c", "d", "d"],
....: index1, 2, 3, 1, 3, 1, 2, 2, 3]])
....: [
12]: data
In [12]:
Out[1 0.929616
a 2 0.316376
3 0.183919
1 0.204560
b 3 0.567725
1 0.595545
c 2 0.964515
2 0.653177
d 3 0.748907
dtype: float64
What you’re seeing is a prettified view of a Series with a MultiIndex
as its index. The “gaps” in the index display mean “use the label directly above”:
13]: data.index
In [13]:
Out['a', 1),
MultiIndex([('a', 2),
('a', 3),
('b', 1),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
( )
With a hierarchically indexed object, so-called partial indexing is possible, enabling you to concisely select subsets of the data:
14]: data["b"]
In [14]:
Out[1 0.204560
3 0.567725
dtype: float64
15]: data["b":"c"]
In [15]:
Out[1 0.204560
b 3 0.567725
1 0.595545
c 2 0.964515
dtype: float64
16]: data.loc[["b", "d"]]
In [16]:
Out[1 0.204560
b 3 0.567725
2 0.653177
d 3 0.748907
dtype: float64
Selection is even possible from an “inner” level. Here I select all of the values having the value 2
from the second index level:
17]: data.loc[:, 2]
In [17]:
Out[0.316376
a 0.964515
c 0.653177
d dtype: float64
Hierarchical indexing plays an important role in reshaping data and in group-based operations like forming a pivot table. For example, you can rearrange this data into a DataFrame using its unstack
method:
18]: data.unstack()
In [18]:
Out[1 2 3
0.929616 0.316376 0.183919
a 0.204560 NaN 0.567725
b 0.595545 0.964515 NaN
c 0.653177 0.748907 d NaN
The inverse operation of unstack
is stack
:
19]: data.unstack().stack()
In [19]:
Out[1 0.929616
a 2 0.316376
3 0.183919
1 0.204560
b 3 0.567725
1 0.595545
c 2 0.964515
2 0.653177
d 3 0.748907
dtype: float64
stack
and unstack
will be explored in more detail later in Reshaping and Pivoting.
With a DataFrame, either axis can have a hierarchical index:
20]: frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
In [=[["a", "a", "b", "b"], [1, 2, 1, 2]],
....: index=[["Ohio", "Ohio", "Colorado"],
....: columns"Green", "Red", "Green"]])
....: [
21]: frame
In [21]:
Out[
Ohio Colorado
Green Red Green1 0 1 2
a 2 3 4 5
1 6 7 8
b 2 9 10 11
The hierarchical levels can have names (as strings or any Python objects). If so, these will show up in the console output:
22]: frame.index.names = ["key1", "key2"]
In [
23]: frame.columns.names = ["state", "color"]
In [
24]: frame
In [24]:
Out[
state Ohio Colorado
color Green Red Green
key1 key2 1 0 1 2
a 2 3 4 5
1 6 7 8
b 2 9 10 11
These names supersede the name
attribute, which is used only with single-level indexes.
Be careful to note that the index names "state"
and "color"
are not part of the row labels (the frame.index
values).
You can see how many levels an index has by accessing its nlevels
attribute:
25]: frame.index.nlevels
In [25]: 2 Out[
With partial column indexing you can similarly select groups of columns:
26]: frame["Ohio"]
In [26]:
Out[
color Green Red
key1 key2 1 0 1
a 2 3 4
1 6 7
b 2 9 10
A MultiIndex
can be created by itself and then reused; the columns in the preceding DataFrame with level names could also be created like this:
"Ohio", "Ohio", "Colorado"],
pd.MultiIndex.from_arrays([["Green", "Red", "Green"]],
[=["state", "color"]) names
Reordering and Sorting Levels
At times you may need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The swaplevel
method takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwise unaltered):
27]: frame.swaplevel("key1", "key2")
In [27]:
Out[
state Ohio Colorado
color Green Red Green
key2 key1 1 a 0 1 2
2 a 3 4 5
1 b 6 7 8
2 b 9 10 11
sort_index
by default sorts the data lexicographically using all the index levels, but you can choose to use only a single level or a subset of levels to sort by passing the level
argument. For example:
28]: frame.sort_index(level=1)
In [28]:
Out[
state Ohio Colorado
color Green Red Green
key1 key2 1 0 1 2
a 1 6 7 8
b 2 3 4 5
a 2 9 10 11
b
29]: frame.swaplevel(0, 1).sort_index(level=0)
In [29]:
Out[
state Ohio Colorado
color Green Red Green
key2 key1 1 a 0 1 2
6 7 8
b 2 a 3 4 5
9 10 11 b
Data selection performance is much better on hierarchically indexed objects if the index is lexicographically sorted starting with the outermost level—that is, the result of calling sort_index(level=0)
or sort_index()
.
Summary Statistics by Level
Many descriptive and summary statistics on DataFrame and Series have a level
option in which you can specify the level you want to aggregate by on a particular axis. Consider the above DataFrame; we can aggregate by level on either the rows or columns, like so:
30]: frame.groupby(level="key2").sum()
In [30]:
Out[
state Ohio Colorado
color Green Red Green
key2 1 6 8 10
2 12 14 16
31]: frame.groupby(level="color", axis="columns").sum()
In [31]:
Out[
color Green Red
key1 key2 1 2 1
a 2 8 4
1 14 7
b 2 20 10
We will discuss groupby
in much more detail later in Ch 10: Data Aggregation and Group Operations.
Indexing with a DataFrame's columns
It’s not unusual to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame’s columns. Here’s an example DataFrame:
32]: frame = pd.DataFrame({"a": range(7), "b": range(7, 0, -1),
In ["c": ["one", "one", "one", "two", "two",
....: "two", "two"],
....: "d": [0, 1, 2, 0, 1, 2, 3]})
....:
33]: frame
In [33]:
Out[
a b c d0 0 7 one 0
1 1 6 one 1
2 2 5 one 2
3 3 4 two 0
4 4 3 two 1
5 5 2 two 2
6 6 1 two 3
DataFrame’s set_index
function will create a new DataFrame using one or more of its columns as the index:
34]: frame2 = frame.set_index(["c", "d"])
In [
35]: frame2
In [35]:
Out[
a b
c d 0 0 7
one 1 1 6
2 2 5
0 3 4
two 1 4 3
2 5 2
3 6 1
By default, the columns are removed from the DataFrame, though you can leave them in by passing drop=False
to set_index
:
36]: frame.set_index(["c", "d"], drop=False)
In [36]:
Out[
a b c d
c d 0 0 7 one 0
one 1 1 6 one 1
2 2 5 one 2
0 3 4 two 0
two 1 4 3 two 1
2 5 2 two 2
3 6 1 two 3
reset_index
, on the other hand, does the opposite of set_index
; the hierarchical index levels are moved into the columns:
37]: frame2.reset_index()
In [37]:
Out[
c d a b0 one 0 0 7
1 one 1 1 6
2 one 2 2 5
3 two 0 3 4
4 two 1 4 3
5 two 2 5 2
6 two 3 6 1
8.2 Combining and Merging Datasets
Data contained in pandas objects can be combined in a number of ways:
pandas.merge
-
Connect rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
pandas.concat
-
Concatenate or "stack" objects together along an axis.
combine_first
-
Splice together overlapping data to fill in missing values in one object with values from another.
I will address each of these and give a number of examples. They’ll be utilized in examples throughout the rest of the book.
Database-Style DataFrame Joins
Merge or join operations combine datasets by linking rows using one or more keys. These operations are particularly important in relational databases (e.g., SQL-based). The pandas.merge
function in pandas is the main entry point for using these algorithms on your data.
Let’s start with a simple example:
38]: df1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "a", "b"],
In ["data1": pd.Series(range(7), dtype="Int64")})
....:
39]: df2 = pd.DataFrame({"key": ["a", "b", "d"],
In ["data2": pd.Series(range(3), dtype="Int64")})
....:
40]: df1
In [40]:
Out[
key data10 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 a 5
6 b 6
41]: df2
In [41]:
Out[
key data20 a 0
1 b 1
2 d 2
Here I am using pandas's Int64
extension type for nullable integers, discussed in Ch 7.3: Extension Data Types.
This is an example of a many-to-one join; the data in df1
has multiple rows labeled a
and b
, whereas df2
has only one row for each value in the key
column. Calling pandas.merge
with these objects, we obtain:
42]: pd.merge(df1, df2)
In [42]:
Out[
key data1 data20 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
Note that I didn’t specify which column to join on. If that information is not specified, pandas.merge
uses the overlapping column names as the keys. It’s a good practice to specify explicitly, though:
43]: pd.merge(df1, df2, on="key")
In [43]:
Out[
key data1 data20 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
In general, the order of column output in pandas.merge
operations is unspecified.
If the column names are different in each object, you can specify them separately:
44]: df3 = pd.DataFrame({"lkey": ["b", "b", "a", "c", "a", "a", "b"],
In ["data1": pd.Series(range(7), dtype="Int64")})
....:
45]: df4 = pd.DataFrame({"rkey": ["a", "b", "d"],
In ["data2": pd.Series(range(3), dtype="Int64")})
....:
46]: pd.merge(df3, df4, left_on="lkey", right_on="rkey")
In [46]:
Out[
lkey data1 rkey data20 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
You may notice that the "c"
and "d"
values and associated data are missing from the result. By default, pandas.merge
does an "inner"
join; the keys in the result are the intersection, or the common set found in both tables. Other possible options are "left"
, "right"
, and "outer"
. The outer join takes the union of the keys, combining the effect of applying both left and right joins:
47]: pd.merge(df1, df2, how="outer")
In [47]:
Out[
key data1 data20 b 0 1
1 b 1 1
2 b 6 1
3 a 2 0
4 a 4 0
5 a 5 0
6 c 3 <NA>
7 d <NA> 2
48]: pd.merge(df3, df4, left_on="lkey", right_on="rkey", how="outer")
In [48]:
Out[
lkey data1 rkey data20 b 0 b 1
1 b 1 b 1
2 b 6 b 1
3 a 2 a 0
4 a 4 a 0
5 a 5 a 0
6 c 3 NaN <NA>
7 NaN <NA> d 2
In an outer join, rows from the left or right DataFrame objects that do not match on keys in the other DataFrame will appear with NA values in the other DataFrame's columns for the nonmatching rows.
See Table 8.1 for a summary of the options for how
.
Option | Behavior |
---|---|
how="inner" |
Use only the key combinations observed in both tables |
how="left" |
Use all key combinations found in the left table |
how="right" |
Use all key combinations found in the right table |
how="outer" |
Use all key combinations observed in both tables together |
Many-to-many merges form the Cartesian product of the matching keys. Here’s an example:
49]: df1 = pd.DataFrame({"key": ["b", "b", "a", "c", "a", "b"],
In ["data1": pd.Series(range(6), dtype="Int64")})
....:
50]: df2 = pd.DataFrame({"key": ["a", "b", "a", "b", "d"],
In ["data2": pd.Series(range(5), dtype="Int64")})
....:
51]: df1
In [51]:
Out[
key data10 b 0
1 b 1
2 a 2
3 c 3
4 a 4
5 b 5
52]: df2
In [52]:
Out[
key data20 a 0
1 b 1
2 a 2
3 b 3
4 d 4
53]: pd.merge(df1, df2, on="key", how="left")
In [53]:
Out[
key data1 data20 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 a 2 0
5 a 2 2
6 c 3 <NA>
7 a 4 0
8 a 4 2
9 b 5 1
10 b 5 3
Since there were three "b"
rows in the left DataFrame and two in the right one, there are six "b"
rows in the result. The join method passed to the how
keyword argument affects only the distinct key values appearing in the result:
54]: pd.merge(df1, df2, how="inner")
In [54]:
Out[
key data1 data20 b 0 1
1 b 0 3
2 b 1 1
3 b 1 3
4 b 5 1
5 b 5 3
6 a 2 0
7 a 2 2
8 a 4 0
9 a 4 2
To merge with multiple keys, pass a list of column names:
55]: left = pd.DataFrame({"key1": ["foo", "foo", "bar"],
In ["key2": ["one", "two", "one"],
....: "lval": pd.Series([1, 2, 3], dtype='Int64')})
....:
56]: right = pd.DataFrame({"key1": ["foo", "foo", "bar", "bar"],
In ["key2": ["one", "one", "one", "two"],
....: "rval": pd.Series([4, 5, 6, 7], dtype='Int64')})
....:
57]: pd.merge(left, right, on=["key1", "key2"], how="outer")
In [57]:
Out[
key1 key2 lval rval0 foo one 1 4
1 foo one 1 5
2 foo two 2 <NA>
3 bar one 3 6
4 bar two <NA> 7
To determine which key combinations will appear in the result depending on the choice of merge method, think of the multiple keys as forming an array of tuples to be used as a single join key.
When you're joining columns on columns, the indexes on the passed DataFrame objects are discarded. If you need to preserve the index values, you can use reset_index
to append the index to the columns.
A last issue to consider in merge operations is the treatment of overlapping column names. For example:
58]: pd.merge(left, right, on="key1")
In [58]:
Out[
key1 key2_x lval key2_y rval0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
While you can address the overlap manually (see the section Ch 7.2.4: Renaming Axis Indexes for renaming axis labels), pandas.merge
has a suffixes
option for specifying strings to append to overlapping names in the left and right DataFrame objects:
59]: pd.merge(left, right, on="key1", suffixes=("_left", "_right"))
In [59]:
Out[
key1 key2_left lval key2_right rval0 foo one 1 one 4
1 foo one 1 one 5
2 foo two 2 one 4
3 foo two 2 one 5
4 bar one 3 one 6
5 bar one 3 two 7
See Table 8.2 for an argument reference on pandas.merge
. The next section covers joining using the DataFrame's row index.
Argument | Description |
---|---|
left |
DataFrame to be merged on the left side. |
right |
DataFrame to be merged on the right side. |
how |
Type of join to apply: one of "inner" , "outer" , "left" , or "right" ; defaults to "inner" . |
on |
Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given, will use the intersection of the column names in left and right as the join keys. |
left_on |
Columns in left DataFrame to use as join keys. Can be a single column name or a list of column names. |
right_on |
Analogous to left_on for right DataFrame. |
left_index |
Use row index in left as its join key (or keys, if a MultiIndex ). |
right_index |
Analogous to left_index . |
sort |
Sort merged data lexicographically by join keys; False by default. |
suffixes |
Tuple of string values to append to column names in case of overlap; defaults to ("_x", "_y") (e.g., if "data" in both DataFrame objects, would appear as "data_x" and "data_y" in result). |
copy |
If False , avoid copying data into resulting data structure in some exceptional cases; by default always copies. |
validate |
Verifies if the merge is of the specified type, whether one-to-one, one-to-many, or many-to-many. See the docstring for full details on the options. |
indicator |
Adds a special column _merge that indicates the source of each row; values will be "left_only" , "right_only" , or "both" based on the origin of the joined data in each row. |
Merging on Index
In some cases, the merge key(s) in a DataFrame will be found in its index (row labels). In this case, you can pass left_index=True
or right_index=True
(or both) to indicate that the index should be used as the merge key:
60]: left1 = pd.DataFrame({"key": ["a", "b", "a", "a", "b", "c"],
In ["value": pd.Series(range(6), dtype="Int64")})
....:
61]: right1 = pd.DataFrame({"group_val": [3.5, 7]}, index=["a", "b"])
In [
62]: left1
In [62]:
Out[
key value0 a 0
1 b 1
2 a 2
3 a 3
4 b 4
5 c 5
63]: right1
In [63]:
Out[
group_val3.5
a 7.0
b
64]: pd.merge(left1, right1, left_on="key", right_index=True)
In [64]:
Out[
key value group_val0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
If you look carefully here, you will see that the index values for left1
have been preserved, whereas in other examples above, the indexes of the input DataFrame objects are dropped. Because the index of right1
is unique, this "many-to-one" merge (with the default how="inner"
method) can preserve the index values from left1
that correspond to rows in the output.
Since the default merge method is to intersect the join keys, you can instead form the union of them with an outer join:
65]: pd.merge(left1, right1, left_on="key", right_index=True, how="outer")
In [65]:
Out[
key value group_val0 a 0 3.5
2 a 2 3.5
3 a 3 3.5
1 b 1 7.0
4 b 4 7.0
5 c 5 NaN
With hierarchically indexed data, things are more complicated, as joining on index is equivalent to a multiple-key merge:
66]: lefth = pd.DataFrame({"key1": ["Ohio", "Ohio", "Ohio",
In ["Nevada", "Nevada"],
....: "key2": [2000, 2001, 2002, 2001, 2002],
....: "data": pd.Series(range(5), dtype="Int64")})
....:
67]: righth_index = pd.MultiIndex.from_arrays(
In [
....: ["Nevada", "Nevada", "Ohio", "Ohio", "Ohio", "Ohio"],
....: [2001, 2000, 2000, 2000, 2001, 2002]
....: [
....: ]
....: )
68]: righth = pd.DataFrame({"event1": pd.Series([0, 2, 4, 6, 8, 10], dtype="I
In [nt64",
=righth_index),
....: index"event2": pd.Series([1, 3, 5, 7, 9, 11], dtype="I
....: nt64",
=righth_index)})
....: index
69]: lefth
In [69]:
Out[
key1 key2 data0 Ohio 2000 0
1 Ohio 2001 1
2 Ohio 2002 2
3 Nevada 2001 3
4 Nevada 2002 4
70]: righth
In [70]:
Out[
event1 event22001 0 1
Nevada 2000 2 3
2000 4 5
Ohio 2000 6 7
2001 8 9
2002 10 11
In this case, you have to indicate multiple columns to merge on as a list (note the handling of duplicate index values with how="outer"
):
71]: pd.merge(lefth, righth, left_on=["key1", "key2"], right_index=True)
In [71]:
Out[
key1 key2 data event1 event20 Ohio 2000 0 4 5
0 Ohio 2000 0 6 7
1 Ohio 2001 1 8 9
2 Ohio 2002 2 10 11
3 Nevada 2001 3 0 1
72]: pd.merge(lefth, righth, left_on=["key1", "key2"],
In [=True, how="outer")
....: right_index72]:
Out[
key1 key2 data event1 event20 Ohio 2000 0 4 5
0 Ohio 2000 0 6 7
1 Ohio 2001 1 8 9
2 Ohio 2002 2 10 11
3 Nevada 2001 3 0 1
4 Nevada 2002 4 <NA> <NA>
4 Nevada 2000 <NA> 2 3
Using the indexes of both sides of the merge is also possible:
73]: left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
In [=["a", "c", "e"],
....: index=["Ohio", "Nevada"]).astype("Int64")
....: columns
74]: right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
In [=["b", "c", "d", "e"],
....: index=["Missouri", "Alabama"]).astype("Int64")
....: columns
75]: left2
In [75]:
Out[
Ohio Nevada1 2
a 3 4
c 5 6
e
76]: right2
In [76]:
Out[
Missouri Alabama7 8
b 9 10
c 11 12
d 13 14
e
77]: pd.merge(left2, right2, how="outer", left_index=True, right_index=True)
In [77]:
Out[
Ohio Nevada Missouri Alabama1 2 <NA> <NA>
a <NA> <NA> 7 8
b 3 4 9 10
c <NA> <NA> 11 12
d 5 6 13 14 e
DataFrame has a join
instance method to simplify merging by index. It can also be used to combine many DataFrame objects having the same or similar indexes but nonoverlapping columns. In the prior example, we could have written:
78]: left2.join(right2, how="outer")
In [78]:
Out[
Ohio Nevada Missouri Alabama1 2 <NA> <NA>
a <NA> <NA> 7 8
b 3 4 9 10
c <NA> <NA> 11 12
d 5 6 13 14 e
Compared with pandas.merge
, DataFrame’s join
method performs a left join on the join keys by default. It also supports joining the index of the passed DataFrame on one of the columns of the calling DataFrame:
79]: left1.join(right1, on="key")
In [79]:
Out[
key value group_val0 a 0 3.5
1 b 1 7.0
2 a 2 3.5
3 a 3 3.5
4 b 4 7.0
5 c 5 NaN
You can think of this method as joining data "into" the object whose join
method was called.
Lastly, for simple index-on-index merges, you can pass a list of DataFrames to join
as an alternative to using the more general pandas.concat
function described in the next section:
80]: another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
In [=["a", "c", "e", "f"],
....: index=["New York", "Oregon"])
....: columns
81]: another
In [81]:
Out[
New York Oregon7.0 8.0
a 9.0 10.0
c 11.0 12.0
e 16.0 17.0
f
82]: left2.join([right2, another])
In [82]:
Out[
Ohio Nevada Missouri Alabama New York Oregon1 2 <NA> <NA> 7.0 8.0
a 3 4 9 10 9.0 10.0
c 5 6 13 14 11.0 12.0
e
83]: left2.join([right2, another], how="outer")
In [83]:
Out[
Ohio Nevada Missouri Alabama New York Oregon1 2 <NA> <NA> 7.0 8.0
a 3 4 9 10 9.0 10.0
c 5 6 13 14 11.0 12.0
e <NA> <NA> 7 8 NaN NaN
b <NA> <NA> 11 12 NaN NaN
d <NA> <NA> <NA> <NA> 16.0 17.0 f
Concatenating Along an Axis
Another kind of data combination operation is referred to interchangeably as concatenation or stacking. NumPy's concatenate
function can do this with NumPy arrays:
84]: arr = np.arange(12).reshape((3, 4))
In [
85]: arr
In [85]:
Out[0, 1, 2, 3],
array([[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
[
86]: np.concatenate([arr, arr], axis=1)
In [86]:
Out[0, 1, 2, 3, 0, 1, 2, 3],
array([[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]]) [
In the context of pandas objects such as Series and DataFrame, having labeled axes enable you to further generalize array concatenation. In particular, you have a number of additional concerns:
If the objects are indexed differently on the other axes, should we combine the distinct elements in these axes or use only the values in common?
Do the concatenated chunks of data need to be identifiable as such in the resulting object?
Does the "concatenation axis" contain data that needs to be preserved? In many cases, the default integer labels in a DataFrame are best discarded during concatenation.
The concat
function in pandas provides a consistent way to address each of these questions. I’ll give a number of examples to illustrate how it works. Suppose we have three Series with no index overlap:
87]: s1 = pd.Series([0, 1], index=["a", "b"], dtype="Int64")
In [
88]: s2 = pd.Series([2, 3, 4], index=["c", "d", "e"], dtype="Int64")
In [
89]: s3 = pd.Series([5, 6], index=["f", "g"], dtype="Int64") In [
Calling pandas.concat
with these objects in a list glues together the values and indexes:
90]: s1
In [90]:
Out[0
a 1
b
dtype: Int64
91]: s2
In [91]:
Out[2
c 3
d 4
e
dtype: Int64
92]: s3
In [92]:
Out[5
f 6
g
dtype: Int64
93]: pd.concat([s1, s2, s3])
In [93]:
Out[0
a 1
b 2
c 3
d 4
e 5
f 6
g dtype: Int64
By default, pandas.concat
works along axis="index"
, producing another Series. If you pass axis="columns"
, the result will instead be a DataFrame:
94]: pd.concat([s1, s2, s3], axis="columns")
In [94]:
Out[0 1 2
0 <NA> <NA>
a 1 <NA> <NA>
b <NA> 2 <NA>
c <NA> 3 <NA>
d <NA> 4 <NA>
e <NA> <NA> 5
f <NA> <NA> 6 g
In this case there is no overlap on the other axis, which as you can see is the union (the "outer"
join) of the indexes. You can instead intersect them by passing join="inner"
:
95]: s4 = pd.concat([s1, s3])
In [
96]: s4
In [96]:
Out[0
a 1
b 5
f 6
g
dtype: Int64
97]: pd.concat([s1, s4], axis="columns")
In [97]:
Out[0 1
0 0
a 1 1
b <NA> 5
f <NA> 6
g
98]: pd.concat([s1, s4], axis="columns", join="inner")
In [98]:
Out[0 1
0 0
a 1 1 b
In this last example, the "f"
and "g"
labels disappeared because of the join="inner"
option.
A potential issue is that the concatenated pieces are not identifiable in the result. Suppose instead you wanted to create a hierarchical index on the concatenation axis. To do this, use the keys
argument:
99]: result = pd.concat([s1, s1, s3], keys=["one", "two", "three"])
In [
100]: result
In [100]:
Out[0
one a 1
b 0
two a 1
b 5
three f 6
g
dtype: Int64
101]: result.unstack()
In [101]:
Out[
a b f g0 1 <NA> <NA>
one 0 1 <NA> <NA>
two <NA> <NA> 5 6 three
In the case of combining Series along axis="columns"
, the keys
become the DataFrame column headers:
102]: pd.concat([s1, s2, s3], axis="columns", keys=["one", "two", "three"])
In [102]:
Out[
one two three0 <NA> <NA>
a 1 <NA> <NA>
b <NA> 2 <NA>
c <NA> 3 <NA>
d <NA> 4 <NA>
e <NA> <NA> 5
f <NA> <NA> 6 g
The same logic extends to DataFrame objects:
103]: df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=["a", "b", "c"],
In [=["one", "two"])
.....: columns
104]: df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=["a", "c"],
In [=["three", "four"])
.....: columns
105]: df1
In [105]:
Out[
one two0 1
a 2 3
b 4 5
c
106]: df2
In [106]:
Out[
three four5 6
a 7 8
c
107]: pd.concat([df1, df2], axis="columns", keys=["level1", "level2"])
In [107]:
Out[
level1 level2
one two three four0 1 5.0 6.0
a 2 3 NaN NaN
b 4 5 7.0 8.0 c
Here the keys
argument is used to create a hierarchical index where the first level can be used to identify each of the concatenated DataFrame objects.
If you pass a dictionary of objects instead of a list, the dictionary’s keys will be used for the keys
option:
108]: pd.concat({"level1": df1, "level2": df2}, axis="columns")
In [108]:
Out[
level1 level2
one two three four0 1 5.0 6.0
a 2 3 NaN NaN
b 4 5 7.0 8.0 c
There are additional arguments governing how the hierarchical index is created (see Table 8.3). For example, we can name the created axis levels with the names
argument:
109]: pd.concat([df1, df2], axis="columns", keys=["level1", "level2"],
In [=["upper", "lower"])
.....: names109]:
Out[
upper level1 level2
lower one two three four0 1 5.0 6.0
a 2 3 NaN NaN
b 4 5 7.0 8.0 c
A last consideration concerns DataFrames in which the row index does not contain any relevant data:
110]: df1 = pd.DataFrame(np.random.standard_normal((3, 4)),
In [=["a", "b", "c", "d"])
.....: columns
111]: df2 = pd.DataFrame(np.random.standard_normal((2, 3)),
In [=["b", "d", "a"])
.....: columns
112]: df1
In [112]:
Out[
a b c d0 1.248804 0.774191 -0.319657 -0.624964
1 1.078814 0.544647 0.855588 1.343268
2 -0.267175 1.793095 -0.652929 -1.886837
113]: df2
In [113]:
Out[
b d a0 1.059626 0.644448 -0.007799
1 -0.449204 2.448963 0.667226
In this case, you can pass ignore_index=True
, which discards the indexes from each DataFrame and concatenates the data in the columns only, assigning a new default index:
114]: pd.concat([df1, df2], ignore_index=True)
In [114]:
Out[
a b c d0 1.248804 0.774191 -0.319657 -0.624964
1 1.078814 0.544647 0.855588 1.343268
2 -0.267175 1.793095 -0.652929 -1.886837
3 -0.007799 1.059626 NaN 0.644448
4 0.667226 -0.449204 NaN 2.448963
Table 8.3 describes the pandas.concat
function arguments.
Argument | Description |
---|---|
objs |
List or dictionary of pandas objects to be concatenated; this is the only required argument |
axis |
Axis to concatenate along; defaults to concatenating along rows (axis="index" ) |
join |
Either "inner" or "outer" ("outer" by default); whether to intersect (inner) or union (outer) indexes along the other axes |
keys |
Values to associate with objects being concatenated, forming a hierarchical index along the concatenation axis; can be a list or array of arbitrary values, an array of tuples, or a list of arrays (if multiple-level arrays passed in levels ) |
levels |
Specific indexes to use as hierarchical index level or levels if keys passed |
names |
Names for created hierarchical levels if keys and/or levels passed |
verify_integrity |
Check new axis in concatenated object for duplicates and raise an exception if so; by default (False ) allows duplicates |
ignore_index |
Do not preserve indexes along concatenation axis , instead produce a new range(total_length) index |
Combining Data with Overlap
There is another data combination situation that can’t be expressed as either a merge or concatenation operation. You may have two datasets with indexes that overlap in full or in part. As a motivating example, consider NumPy’s where
function, which performs the array-oriented equivalent of an if-else expression:
115]: a = pd.Series([np.nan, 2.5, 0.0, 3.5, 4.5, np.nan],
In [=["f", "e", "d", "c", "b", "a"])
.....: index
116]: b = pd.Series([0., np.nan, 2., np.nan, np.nan, 5.],
In [=["a", "b", "c", "d", "e", "f"])
.....: index
117]: a
In [117]:
Out[
f NaN2.5
e 0.0
d 3.5
c 4.5
b
a NaN
dtype: float64
118]: b
In [118]:
Out[0.0
a
b NaN2.0
c
d NaN
e NaN5.0
f
dtype: float64
119]: np.where(pd.isna(a), b, a)
In [119]: array([0. , 2.5, 0. , 3.5, 4.5, 5. ]) Out[
Here, whenever values in a
are null, values from b
are selected, otherwise the non-null values from a
are selected. Using numpy.where
does not check whether the index labels are aligned or not (and does not even require the objects to be the same length), so if you want to line up values by index, use the Series combine_first
method:
120]: a.combine_first(b)
In [120]:
Out[0.0
a 4.5
b 3.5
c 0.0
d 2.5
e 5.0
f dtype: float64
With DataFrames, combine_first
does the same thing column by column, so you can think of it as “patching” missing data in the calling object with data from the object you pass:
121]: df1 = pd.DataFrame({"a": [1., np.nan, 5., np.nan],
In ["b": [np.nan, 2., np.nan, 6.],
.....: "c": range(2, 18, 4)})
.....:
122]: df2 = pd.DataFrame({"a": [5., 4., np.nan, 3., 7.],
In ["b": [np.nan, 3., 4., 6., 8.]})
.....:
123]: df1
In [123]:
Out[
a b c0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
124]: df2
In [124]:
Out[
a b0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
125]: df1.combine_first(df2)
In [125]:
Out[
a b c0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN
The output of combine_first
with DataFrame objects will have the union of all the column names.
8.3 Reshaping and Pivoting
There are a number of basic operations for rearranging tabular data. These are referred to as reshape or pivot operations.
Reshaping with Hierarchical Indexing
Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
stack
-
This “rotates” or pivots from the columns in the data to the rows.
unstack
-
This pivots from the rows into the columns.
I’ll illustrate these operations through a series of examples. Consider a small DataFrame with string arrays as row and column indexes:
126]: data = pd.DataFrame(np.arange(6).reshape((2, 3)),
In [=pd.Index(["Ohio", "Colorado"], name="state"),
.....: index=pd.Index(["one", "two", "three"],
.....: columns="number"))
.....: name
127]: data
In [127]:
Out[
number one two three
state 0 1 2
Ohio 3 4 5 Colorado
Using the stack
method on this data pivots the columns into the rows, producing a Series:
128]: result = data.stack()
In [
129]: result
In [129]:
Out[
state number0
Ohio one 1
two 2
three 3
Colorado one 4
two 5
three dtype: int64
From a hierarchically indexed Series, you can rearrange the data back into a DataFrame with unstack
:
130]: result.unstack()
In [130]:
Out[
number one two three
state 0 1 2
Ohio 3 4 5 Colorado
By default, the innermost level is unstacked (same with stack
). You can unstack a different level by passing a level number or name:
131]: result.unstack(level=0)
In [131]:
Out[
state Ohio Colorado
number 0 3
one 1 4
two 2 5
three
132]: result.unstack(level="state")
In [132]:
Out[
state Ohio Colorado
number 0 3
one 1 4
two 2 5 three
Unstacking might introduce missing data if all of the values in the level aren’t found in each subgroup:
133]: s1 = pd.Series([0, 1, 2, 3], index=["a", "b", "c", "d"], dtype="Int64")
In [
134]: s2 = pd.Series([4, 5, 6], index=["c", "d", "e"], dtype="Int64")
In [
135]: data2 = pd.concat([s1, s2], keys=["one", "two"])
In [
136]: data2
In [136]:
Out[0
one a 1
b 2
c 3
d 4
two c 5
d 6
e dtype: Int64
Stacking filters out missing data by default, so the operation is more easily invertible:
137]: data2.unstack()
In [137]:
Out[
a b c d e0 1 2 3 <NA>
one <NA> <NA> 4 5 6
two
138]: data2.unstack().stack()
In [138]:
Out[0
one a 1
b 2
c 3
d 4
two c 5
d 6
e
dtype: Int64
139]: data2.unstack().stack(dropna=False)
In [139]:
Out[0
one a 1
b 2
c 3
d <NA>
e <NA>
two a <NA>
b 4
c 5
d 6
e dtype: Int64
When you unstack in a DataFrame, the level unstacked becomes the lowest level in the result:
140]: df = pd.DataFrame({"left": result, "right": result + 5},
In [=pd.Index(["left", "right"], name="side"))
.....: columns
141]: df
In [141]:
Out[
side left right
state number 0 5
Ohio one 1 6
two 2 7
three 3 8
Colorado one 4 9
two 5 10
three
142]: df.unstack(level="state")
In [142]:
Out[
side left right
state Ohio Colorado Ohio Colorado
number 0 3 5 8
one 1 4 6 9
two 2 5 7 10 three
As with unstack
, when calling stack
we can indicate the name of the axis to stack:
143]: df.unstack(level="state").stack(level="side")
In [143]:
Out[
state Colorado Ohio
number side 3 0
one left 8 5
right 4 1
two left 9 6
right 5 2
three left 10 7 right
Pivoting “Long” to “Wide” Format
A common way to store multiple time series in databases and CSV files is what is sometimes called long or stacked format. In this format, individual values are represented by a single row in a table rather than multiple values per row.
Let's load some example data and do a small amount of time series wrangling and other data cleaning:
144]: data = pd.read_csv("examples/macrodata.csv")
In [
145]: data = data.loc[:, ["year", "quarter", "realgdp", "infl", "unemp"]]
In [
146]: data.head()
In [146]:
Out[
year quarter realgdp infl unemp0 1959 1 2710.349 0.00 5.8
1 1959 2 2778.801 2.34 5.1
2 1959 3 2775.488 2.74 5.3
3 1959 4 2785.204 0.27 5.6
4 1960 1 2847.699 2.31 5.2
First, I use pandas.PeriodIndex
(which represents time intervals rather than points in time), discussed in more detail in Ch 11: Time Series, to combine the year
and quarter
columns to set the index to consist of datetime
values at the end of each quarter:
147]: periods = pd.PeriodIndex(year=data.pop("year"),
In [=data.pop("quarter"),
.....: quarter="date")
.....: name
148]: periods
In [148]:
Out['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
PeriodIndex(['1960Q3', '1960Q4', '1961Q1', '1961Q2',
...'2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
'2008Q4', '2009Q1', '2009Q2', '2009Q3'],
='period[Q-DEC]', name='date', length=203)
dtype
149]: data.index = periods.to_timestamp("D")
In [
150]: data.head()
In [150]:
Out[
realgdp infl unemp
date 1959-01-01 2710.349 0.00 5.8
1959-04-01 2778.801 2.34 5.1
1959-07-01 2775.488 2.74 5.3
1959-10-01 2785.204 0.27 5.6
1960-01-01 2847.699 2.31 5.2
Here I used the pop
method on the DataFrame, which returns a column while deleting it from the DataFrame at the same time.
Then, I select a subset of columns and give the columns
index the name "item"
:
151]: data = data.reindex(columns=["realgdp", "infl", "unemp"])
In [
152]: data.columns.name = "item"
In [
153]: data.head()
In [153]:
Out[
item realgdp infl unemp
date 1959-01-01 2710.349 0.00 5.8
1959-04-01 2778.801 2.34 5.1
1959-07-01 2775.488 2.74 5.3
1959-10-01 2785.204 0.27 5.6
1960-01-01 2847.699 2.31 5.2
Lastly, I reshape with stack
, turn the new index levels into columns with reset_index
, and finally give the column containing the data values the name "value"
:
154]: long_data = (data.stack()
In [
.....: .reset_index()={0: "value"})) .....: .rename(columns
Now, ldata
looks like:
155]: long_data[:10]
In [155]:
Out[
date item value0 1959-01-01 realgdp 2710.349
1 1959-01-01 infl 0.000
2 1959-01-01 unemp 5.800
3 1959-04-01 realgdp 2778.801
4 1959-04-01 infl 2.340
5 1959-04-01 unemp 5.100
6 1959-07-01 realgdp 2775.488
7 1959-07-01 infl 2.740
8 1959-07-01 unemp 5.300
9 1959-10-01 realgdp 2785.204
In this so-called long format for multiple time series, each row in the table represents a single observation.
Data is frequently stored this way in relational SQL databases, as a fixed schema (column names and data types) allows the number of distinct values in the item
column to change as data is added to the table. In the previous example, date
and item
would usually be the primary keys (in relational database parlance), offering both relational integrity and easier joins. In some cases, the data may be more difficult to work with in this format; you might prefer to have a DataFrame containing one column per distinct item
value indexed by timestamps in the date
column. DataFrame’s pivot
method performs exactly this transformation:
156]: pivoted = long_data.pivot(index="date", columns="item",
In [="value")
.....: values
157]: pivoted.head()
In [157]:
Out[
item infl realgdp unemp
date 1959-01-01 0.00 2710.349 5.8
1959-04-01 2.34 2778.801 5.1
1959-07-01 2.74 2775.488 5.3
1959-10-01 0.27 2785.204 5.6
1960-01-01 2.31 2847.699 5.2
The first two values passed are the columns to be used, respectively, as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously:
159]: long_data["value2"] = np.random.standard_normal(len(long_data))
In [
160]: long_data[:10]
In [160]:
Out[
date item value value20 1959-01-01 realgdp 2710.349 0.802926
1 1959-01-01 infl 0.000 0.575721
2 1959-01-01 unemp 5.800 1.381918
3 1959-04-01 realgdp 2778.801 0.000992
4 1959-04-01 infl 2.340 -0.143492
5 1959-04-01 unemp 5.100 -0.206282
6 1959-07-01 realgdp 2775.488 -0.222392
7 1959-07-01 infl 2.740 -1.682403
8 1959-07-01 unemp 5.300 1.811659
9 1959-10-01 realgdp 2785.204 -0.351305
By omitting the last argument, you obtain a DataFrame with hierarchical columns:
161]: pivoted = long_data.pivot(index="date", columns="item")
In [
162]: pivoted.head()
In [162]:
Out[
value value2
item infl realgdp unemp infl realgdp unemp
date 1959-01-01 0.00 2710.349 5.8 0.575721 0.802926 1.381918
1959-04-01 2.34 2778.801 5.1 -0.143492 0.000992 -0.206282
1959-07-01 2.74 2775.488 5.3 -1.682403 -0.222392 1.811659
1959-10-01 0.27 2785.204 5.6 0.128317 -0.351305 -1.313554
1960-01-01 2.31 2847.699 5.2 -0.615939 0.498327 0.174072
163]: pivoted["value"].head()
In [163]:
Out[
item infl realgdp unemp
date 1959-01-01 0.00 2710.349 5.8
1959-04-01 2.34 2778.801 5.1
1959-07-01 2.74 2775.488 5.3
1959-10-01 0.27 2785.204 5.6
1960-01-01 2.31 2847.699 5.2
Note that pivot
is equivalent to creating a hierarchical index using set_index
followed by a call to unstack
:
164]: unstacked = long_data.set_index(["date", "item"]).unstack(level="item")
In [
165]: unstacked.head()
In [165]:
Out[
value value2
item infl realgdp unemp infl realgdp unemp
date 1959-01-01 0.00 2710.349 5.8 0.575721 0.802926 1.381918
1959-04-01 2.34 2778.801 5.1 -0.143492 0.000992 -0.206282
1959-07-01 2.74 2775.488 5.3 -1.682403 -0.222392 1.811659
1959-10-01 0.27 2785.204 5.6 0.128317 -0.351305 -1.313554
1960-01-01 2.31 2847.699 5.2 -0.615939 0.498327 0.174072
Pivoting “Wide” to “Long” Format
An inverse operation to pivot
for DataFrames is pandas.melt
. Rather than transforming one column into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input. Let's look at an example:
167]: df = pd.DataFrame({"key": ["foo", "bar", "baz"],
In ["A": [1, 2, 3],
.....: "B": [4, 5, 6],
.....: "C": [7, 8, 9]})
.....:
168]: df
In [168]:
Out[
key A B C0 foo 1 4 7
1 bar 2 5 8
2 baz 3 6 9
The "key"
column may be a group indicator, and the other columns are data values. When using pandas.melt
, we must indicate which columns (if any) are group indicators. Let's use "key"
as the only group indicator here:
169]: melted = pd.melt(df, id_vars="key")
In [
170]: melted
In [170]:
Out[
key variable value0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6
6 foo C 7
7 bar C 8
8 baz C 9
Using pivot
, we can reshape back to the original layout:
171]: reshaped = melted.pivot(index="key", columns="variable",
In [="value")
.....: values
172]: reshaped
In [172]:
Out[
variable A B C
key 2 5 8
bar 3 6 9
baz 1 4 7 foo
Since the result of pivot
creates an index from the column used as the row labels, we may want to use reset_index
to move the data back into a column:
173]: reshaped.reset_index()
In [173]:
Out[
variable key A B C0 bar 2 5 8
1 baz 3 6 9
2 foo 1 4 7
You can also specify a subset of columns to use as value
columns:
174]: pd.melt(df, id_vars="key", value_vars=["A", "B"])
In [174]:
Out[
key variable value0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6
pandas.melt
can be used without any group identifiers, too:
175]: pd.melt(df, value_vars=["A", "B", "C"])
In [175]:
Out[
variable value0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 C 8
8 C 9
176]: pd.melt(df, value_vars=["key", "A", "B"])
In [176]:
Out[
variable value0 key foo
1 key bar
2 key baz
3 A 1
4 A 2
5 A 3
6 B 4
7 B 5
8 B 6
8.4 Conclusion
Now that you have some pandas basics for data import, cleaning, and reorganization under your belt, we are ready to move on to data visualization with matplotlib. We will return to explore other areas of pandas later in the book when we discuss more advanced analytics.