11 Time Series
Time series data is an important form of structured data in many different fields, such as finance, economics, ecology, neuroscience, and physics. Anything that is recorded repeatedly at many points in time forms a time series. Many time series are fixed frequency, which is to say that data points occur at regular intervals according to some rule, such as every 15 seconds, every 5 minutes, or once per month. Time series can also be irregular without a fixed unit of time or offset between units. How you mark and refer to time series data depends on the application, and you may have one of the following:
 Timestamps

Specific instants in time.
 Fixed periods

Such as the whole month of January 2017, or the whole year 2020.
 Intervals of time

Indicated by a start and end timestamp. Periods can be thought of as special cases of intervals.
 Experiment or elapsed time

Each timestamp is a measure of time relative to a particular start time (e.g., the diameter of a cookie baking each second since being placed in the oven), starting from 0.
In this chapter, I am mainly concerned with time series in the first three categories, though many of the techniques can be applied to experimental time series where the index may be an integer or floatingpoint number indicating elapsed time from the start of the experiment. The simplest kind of time series is indexed by timestamp.
pandas also supports indexes based on timedeltas, which can be a useful way of representing experiment or elapsed time. We do not explore timedelta indexes in this book, but you can learn more in the pandas documentation.
pandas provides many builtin time series tools and algorithms. You can efficiently work with large time series, and slice and dice, aggregate, and resample irregular and fixedfrequency time series. Some of these tools are useful for financial and economics applications, but you could certainly use them to analyze server log data, too.
As with the rest of the chapters, we start by importing NumPy and pandas:
12]: import numpy as np
In [
13]: import pandas as pd In [
11.1 Date and Time Data Types and Tools
The Python standard library includes data types for date and time data, as well as calendarrelated functionality. The datetime
, time
, and calendar
modules are the main places to start. The datetime.datetime
type, or simply datetime
, is widely used:
14]: from datetime import datetime
In [
15]: now = datetime.now()
In [
16]: now
In [16]: datetime.datetime(2023, 4, 12, 13, 9, 16, 484533)
Out[
17]: now.year, now.month, now.day
In [17]: (2023, 4, 12) Out[
datetime
stores both the date and time down to the microsecond. datetime.timedelta
, or simply timedelta
, represents the temporal difference between two datetime
objects:
18]: delta = datetime(2011, 1, 7)  datetime(2008, 6, 24, 8, 15)
In [
19]: delta
In [19]: datetime.timedelta(days=926, seconds=56700)
Out[
20]: delta.days
In [20]: 926
Out[
21]: delta.seconds
In [21]: 56700 Out[
You can add (or subtract) a timedelta
or multiple thereof to a datetime
object to yield a new shifted object:
22]: from datetime import timedelta
In [
23]: start = datetime(2011, 1, 7)
In [
24]: start + timedelta(12)
In [24]: datetime.datetime(2011, 1, 19, 0, 0)
Out[
25]: start  2 * timedelta(12)
In [25]: datetime.datetime(2010, 12, 14, 0, 0) Out[
Table 11.1 summarizes the data types in the datetime
module. While this chapter is mainly concerned with the data types in pandas and higherlevel time series manipulation, you may encounter the datetime
based types in many other places in Python in the wild.
Type  Description 

date 
Store calendar date (year, month, day) using the Gregorian calendar 
time 
Store time of day as hours, minutes, seconds, and microseconds 
datetime 
Store both date and time 
timedelta 
The difference between two datetime values (as days, seconds, and microseconds) 
tzinfo 
Base type for storing time zone information 
Converting Between String and Datetime
You can format datetime
objects and pandas Timestamp
objects, which I’ll introduce later, as strings using str
or the strftime
method, passing a format specification:
26]: stamp = datetime(2011, 1, 3)
In [
27]: str(stamp)
In [27]: '20110103 00:00:00'
Out[
28]: stamp.strftime("%Y%m%d")
In [28]: '20110103' Out[
See Table 11.2 for a complete list of the format codes.
Type  Description 

%Y 
Fourdigit year 
%y 
Twodigit year 
%m 
Twodigit month [01, 12] 
%d 
Twodigit day [01, 31] 
%H 
Hour (24hour clock) [00, 23] 
%I 
Hour (12hour clock) [01, 12] 
%M 
Twodigit minute [00, 59] 
%S 
Second [00, 61] (seconds 60, 61 account for leap seconds) 
%f 
Microsecond as an integer, zeropadded (from 000000 to 999999) 
%j 
Day of the year as a zeropadded integer (from 001 to 336) 
%w 
Weekday as an integer [0 (Sunday), 6] 
%u 
Weekday as an integer starting from 1, where 1 is Monday. 
%U 
Week number of the year [00, 53]; Sunday is considered the first day of the week, and days before the first Sunday of the year are “week 0” 
%W 
Week number of the year [00, 53]; Monday is considered the first day of the week, and days before the first Monday of the year are “week 0” 
%z 
UTC time zone offset as +HHMM or HHMM ; empty if time zone naive 
%Z 
Time zone name as a string, or empty string if no time zone 
%F 
Shortcut for %Y%m%d (e.g., 2012418 ) 
%D 
Shortcut for %m/%d/%y (e.g., 04/18/12 ) 
You can use many of the same format codes to convert strings to dates using datetime.strptime
(but some codes, like %F
, cannot be used):
29]: value = "20110103"
In [
30]: datetime.strptime(value, "%Y%m%d")
In [30]: datetime.datetime(2011, 1, 3, 0, 0)
Out[
31]: datestrs = ["7/6/2011", "8/6/2011"]
In [
32]: [datetime.strptime(x, "%m/%d/%Y") for x in datestrs]
In [32]:
Out[2011, 7, 6, 0, 0),
[datetime.datetime(2011, 8, 6, 0, 0)] datetime.datetime(
datetime.strptime
is one way to parse a date with a known format.
pandas is generally oriented toward working with arrays of dates, whether used as an axis index or a column in a DataFrame. The pandas.to_datetime
method parses many different kinds of date representations. Standard date formats like ISO 8601 can be parsed quickly:
33]: datestrs = ["20110706 12:00:00", "20110806 00:00:00"]
In [
34]: pd.to_datetime(datestrs)
In [34]: DatetimeIndex(['20110706 12:00:00', '20110806 00:00:00'], dtype='dat
Out[etime64[ns]', freq=None)
It also handles values that should be considered missing (None
, empty string, etc.):
35]: idx = pd.to_datetime(datestrs + [None])
In [
36]: idx
In [36]: DatetimeIndex(['20110706 12:00:00', '20110806 00:00:00', 'NaT'], dty
Out[='datetime64[ns]', freq=None)
pe
37]: idx[2]
In [37]: NaT
Out[
38]: pd.isna(idx)
In [38]: array([False, False, True]) Out[
NaT
(Not a Time) is pandas’s null value for timestamp data.
dateutil.parser
is a useful but imperfect tool. Notably, it will recognize some strings as dates that you might prefer that it didn’t; for example, "42"
will be parsed as the year 2042
with today’s calendar date.
datetime
objects also have a number of localespecific formatting options for systems in other countries or languages. For example, the abbreviated month names will be different on German or French systems compared with English systems. See Table 11.3 for a listing.
Type  Description 

%a 
Abbreviated weekday name 
%A 
Full weekday name 
%b 
Abbreviated month name 
%B 
Full month name 
%c 
Full date and time (e.g., ‘Tue 01 May 2012 04:20:57 PM’) 
%p 
Locale equivalent of AM or PM 
%x 
Localeappropriate formatted date (e.g., in the United States, May 1, 2012 yields ‘05/01/2012’) 
%X 
Localeappropriate time (e.g., ‘04:24:12 PM’) 
11.2 Time Series Basics
A basic kind of time series object in pandas is a Series indexed by timestamps, which is often represented outside of pandas as Python strings or datetime
objects:
39]: dates = [datetime(2011, 1, 2), datetime(2011, 1, 5),
In [2011, 1, 7), datetime(2011, 1, 8),
....: datetime(2011, 1, 10), datetime(2011, 1, 12)]
....: datetime(
40]: ts = pd.Series(np.random.standard_normal(6), index=dates)
In [
41]: ts
In [41]:
Out[20110102 0.204708
20110105 0.478943
20110107 0.519439
20110108 0.555730
20110110 1.965781
20110112 1.393406
dtype: float64
Under the hood, these datetime
objects have been put in a DatetimeIndex
:
42]: ts.index
In [42]:
Out['20110102', '20110105', '20110107', '20110108',
DatetimeIndex(['20110110', '20110112'],
='datetime64[ns]', freq=None) dtype
Like other Series, arithmetic operations between differently indexed time series automatically align on the dates:
43]: ts + ts[::2]
In [43]:
Out[20110102 0.409415
20110105 NaN
20110107 1.038877
20110108 NaN
20110110 3.931561
20110112 NaN
dtype: float64
Recall that ts[::2]
selects every second element in ts
.
pandas stores timestamps using NumPy’s datetime64
data type at the nanosecond resolution:
44]: ts.index.dtype
In [44]: dtype('<M8[ns]') Out[
Scalar values from a DatetimeIndex
are pandas Timestamp
objects:
45]: stamp = ts.index[0]
In [
46]: stamp
In [46]: Timestamp('20110102 00:00:00') Out[
A pandas.Timestamp
can be substituted most places where you would use a datetime
object. The reverse is not true, however, because pandas.Timestamp
can store nanosecond precision data, while datetime
stores only up to microseconds. Additionally, pandas.Timestamp
can store frequency information (if any) and understands how to do time zone conversions and other kinds of manipulations. More on both of these things later in Time Zone Handling.
Indexing, Selection, Subsetting
Time series behaves like any other Series when you are indexing and selecting data based on the label:
47]: stamp = ts.index[2]
In [
48]: ts[stamp]
In [48]: 0.5194387150567381 Out[
As a convenience, you can also pass a string that is interpretable as a date:
49]: ts["20110110"]
In [49]: 1.9657805725027142 Out[
For longer time series, a year or only a year and month can be passed to easily select slices of data (pandas.date_range
is discussed in more detail in Generating Date Ranges):
50]: longer_ts = pd.Series(np.random.standard_normal(1000),
In [=pd.date_range("20000101", periods=1000))
....: index
51]: longer_ts
In [51]:
Out[20000101 0.092908
20000102 0.281746
20000103 0.769023
20000104 1.246435
20000105 1.007189
... 20020922 0.930944
20020923 0.811676
20020924 1.830156
20020925 0.138730
20020926 0.334088
1000, dtype: float64
Freq: D, Length:
52]: longer_ts["2001"]
In [52]:
Out[20010101 1.599534
20010102 0.474071
20010103 0.151326
20010104 0.542173
20010105 0.475496
... 20011227 0.057874
20011228 0.433739
20011229 0.092698
20011230 1.397820
20011231 1.457823
365, dtype: float64 Freq: D, Length:
Here, the string "2001"
is interpreted as a year and selects that time period. This also works if you specify the month:
53]: longer_ts["200105"]
In [53]:
Out[20010501 0.622547
20010502 0.936289
20010503 0.750018
20010504 0.056715
20010505 2.300675
... 20010527 0.235477
20010528 0.111835
20010529 1.251504
20010530 2.949343
20010531 0.634634
31, dtype: float64 Freq: D, Length:
Slicing with datetime
objects works as well:
54]: ts[datetime(2011, 1, 7):]
In [54]:
Out[20110107 0.519439
20110108 0.555730
20110110 1.965781
20110112 1.393406
dtype: float64
55]: ts[datetime(2011, 1, 7):datetime(2011, 1, 10)]
In [55]:
Out[20110107 0.519439
20110108 0.555730
20110110 1.965781
dtype: float64
Because most time series data is ordered chronologically, you can slice with timestamps not contained in a time series to perform a range query:
56]: ts
In [56]:
Out[20110102 0.204708
20110105 0.478943
20110107 0.519439
20110108 0.555730
20110110 1.965781
20110112 1.393406
dtype: float64
57]: ts["20110106":"20110111"]
In [57]:
Out[20110107 0.519439
20110108 0.555730
20110110 1.965781
dtype: float64
As before, you can pass a string date, datetime
, or timestamp. Remember that slicing in this manner produces views on the source time series, like slicing NumPy arrays. This means that no data is copied, and modifications on the slice will be reflected in the original data.
There is an equivalent instance method, truncate
, that slices a Series between two dates:
58]: ts.truncate(after="20110109")
In [58]:
Out[20110102 0.204708
20110105 0.478943
20110107 0.519439
20110108 0.555730
dtype: float64
All of this holds true for DataFrame as well, indexing on its rows:
59]: dates = pd.date_range("20000101", periods=100, freq="WWED")
In [
60]: long_df = pd.DataFrame(np.random.standard_normal((100, 4)),
In [=dates,
....: index=["Colorado", "Texas",
....: columns"New York", "Ohio"])
....:
61]: long_df.loc["200105"]
In [61]:
Out[
Colorado Texas New York Ohio20010502 0.006045 0.490094 0.277186 0.707213
20010509 0.560107 2.735527 0.927335 1.513906
20010516 0.538600 1.273768 0.667876 0.969206
20010523 1.676091 0.817649 0.050188 1.951312
20010530 3.260383 0.963301 1.201206 1.852001
Time Series with Duplicate Indices
In some applications, there may be multiple data observations falling on a particular timestamp. Here is an example:
62]: dates = pd.DatetimeIndex(["20000101", "20000102", "20000102",
In ["20000102", "20000103"])
....:
63]: dup_ts = pd.Series(np.arange(5), index=dates)
In [
64]: dup_ts
In [64]:
Out[20000101 0
20000102 1
20000102 2
20000102 3
20000103 4
dtype: int64
We can tell that the index is not unique by checking its is_unique
property:
65]: dup_ts.index.is_unique
In [65]: False Out[
Indexing into this time series will now either produce scalar values or slices, depending on whether a timestamp is duplicated:
66]: dup_ts["20000103"] # not duplicated
In [66]: 4
Out[
67]: dup_ts["20000102"] # duplicated
In [67]:
Out[20000102 1
20000102 2
20000102 3
dtype: int64
Suppose you wanted to aggregate the data having nonunique timestamps. One way to do this is to use groupby
and pass level=0
(the one and only level):
68]: grouped = dup_ts.groupby(level=0)
In [
69]: grouped.mean()
In [69]:
Out[20000101 0.0
20000102 2.0
20000103 4.0
dtype: float64
70]: grouped.count()
In [70]:
Out[20000101 1
20000102 3
20000103 1
dtype: int64
11.3 Date Ranges, Frequencies, and Shifting
Generic time series in pandas are assumed to be irregular; that is, they have no fixed frequency. For many applications this is sufficient. However, it’s often desirable to work relative to a fixed frequency, such as daily, monthly, or every 15 minutes, even if that means introducing missing values into a time series. Fortunately, pandas has a full suite of standard time series frequencies and tools for resampling (discussed in more detail later in Resampling and Frequency Conversion), inferring frequencies, and generating fixedfrequency date ranges. For example, you can convert the sample time series to fixed daily frequency by calling resample
:
71]: ts
In [71]:
Out[20110102 0.204708
20110105 0.478943
20110107 0.519439
20110108 0.555730
20110110 1.965781
20110112 1.393406
dtype: float64
72]: resampler = ts.resample("D")
In [
73]: resampler
In [73]: <pandas.core.resample.DatetimeIndexResampler object at 0x17b0e7bb0> Out[
The string "D"
is interpreted as daily frequency.
Conversion between frequencies or resampling is a big enough topic to have its own section later (Resampling and Frequency Conversion). Here, I’ll show you how to use the base frequencies and multiples thereof.
Generating Date Ranges
While I used it previously without explanation, pandas.date_range
is responsible for generating a DatetimeIndex
with an indicated length according to a particular frequency:
74]: index = pd.date_range("20120401", "20120601")
In [
75]: index
In [75]:
Out['20120401', '20120402', '20120403', '20120404',
DatetimeIndex(['20120405', '20120406', '20120407', '20120408',
'20120409', '20120410', '20120411', '20120412',
'20120413', '20120414', '20120415', '20120416',
'20120417', '20120418', '20120419', '20120420',
'20120421', '20120422', '20120423', '20120424',
'20120425', '20120426', '20120427', '20120428',
'20120429', '20120430', '20120501', '20120502',
'20120503', '20120504', '20120505', '20120506',
'20120507', '20120508', '20120509', '20120510',
'20120511', '20120512', '20120513', '20120514',
'20120515', '20120516', '20120517', '20120518',
'20120519', '20120520', '20120521', '20120522',
'20120523', '20120524', '20120525', '20120526',
'20120527', '20120528', '20120529', '20120530',
'20120531', '20120601'],
='datetime64[ns]', freq='D') dtype
By default, pandas.date_range
generates daily timestamps. If you pass only a start or end date, you must pass a number of periods to generate:
76]: pd.date_range(start="20120401", periods=20)
In [76]:
Out['20120401', '20120402', '20120403', '20120404',
DatetimeIndex(['20120405', '20120406', '20120407', '20120408',
'20120409', '20120410', '20120411', '20120412',
'20120413', '20120414', '20120415', '20120416',
'20120417', '20120418', '20120419', '20120420'],
='datetime64[ns]', freq='D')
dtype
77]: pd.date_range(end="20120601", periods=20)
In [77]:
Out['20120513', '20120514', '20120515', '20120516',
DatetimeIndex(['20120517', '20120518', '20120519', '20120520',
'20120521', '20120522', '20120523', '20120524',
'20120525', '20120526', '20120527', '20120528',
'20120529', '20120530', '20120531', '20120601'],
='datetime64[ns]', freq='D') dtype
The start and end dates define strict boundaries for the generated date index. For example, if you wanted a date index containing the last business day of each month, you would pass the "BM"
frequency (business end of month; see a more complete listing of frequencies in Table 11.4), and only dates falling on or inside the date interval will be included:
78]: pd.date_range("20000101", "20001201", freq="BM")
In [78]:
Out['20000131', '20000229', '20000331', '20000428',
DatetimeIndex(['20000531', '20000630', '20000731', '20000831',
'20000929', '20001031', '20001130'],
='datetime64[ns]', freq='BM') dtype
Alias  Offset type  Description 

D 
Day 
Calendar daily 
B 
BusinessDay 
Business daily 
H 
Hour 
Hourly 
T or min 
Minute 
Once a minute 
S 
Second 
Once a second 
L or ms 
Milli 
Millisecond (1/1,000 of 1 second) 
U 
Micro 
Microsecond (1/1,000,000 of 1 second) 
M 
MonthEnd 
Last calendar day of month 
BM 
BusinessMonthEnd 
Last business day (weekday) of month 
MS 
MonthBegin 
First calendar day of month 
BMS 
BusinessMonthBegin 
First weekday of month 
WMON, WTUE, ... 
Week 
Weekly on given day of week (MON, TUE, WED, THU, FRI, SAT, or SUN) 
WOM1MON, WOM2MON, ... 
WeekOfMonth 
Generate weekly dates in the first, second, third, or fourth week of the month (e.g., WOM3FRI for the third Friday of each month) 
QJAN, QFEB, ... 
QuarterEnd 
Quarterly dates anchored on last calendar day of each month, for year ending in indicated month (JAN, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, or DEC) 
BQJAN, BQFEB, ... 
BusinessQuarterEnd 
Quarterly dates anchored on last weekday day of each month, for year ending in indicated month 
QSJAN, QSFEB, ... 
QuarterBegin 
Quarterly dates anchored on first calendar day of each month, for year ending in indicated month 
BQSJAN, BQSFEB, ... 
BusinessQuarterBegin 
Quarterly dates anchored on first weekday day of each month, for year ending in indicated month 
AJAN, AFEB, ... 
YearEnd 
Annual dates anchored on last calendar day of given month (JAN, FEB, MAR, APR, MAY, JUN, JUL, AUG, SEP, OCT, NOV, or DEC) 
BAJAN, BAFEB, ... 
BusinessYearEnd 
Annual dates anchored on last weekday of given month 
ASJAN, ASFEB, ... 
YearBegin 
Annual dates anchored on first day of given month 
BASJAN, BASFEB, ... 
BusinessYearBegin 
Annual dates anchored on first weekday of given month 
pandas.date_range
by default preserves the time (if any) of the start or end timestamp:
79]: pd.date_range("20120502 12:56:31", periods=5)
In [79]:
Out['20120502 12:56:31', '20120503 12:56:31',
DatetimeIndex(['20120504 12:56:31', '20120505 12:56:31',
'20120506 12:56:31'],
='datetime64[ns]', freq='D') dtype
Sometimes you will have start or end dates with time information but want to generate a set of timestamps normalized to midnight as a convention. To do this, there is a normalize
option:
80]: pd.date_range("20120502 12:56:31", periods=5, normalize=True)
In [80]:
Out['20120502', '20120503', '20120504', '20120505',
DatetimeIndex(['20120506'],
='datetime64[ns]', freq='D') dtype
Frequencies and Date Offsets
Frequencies in pandas are composed of a base frequency and a multiplier. Base frequencies are typically referred to by a string alias, like "M"
for monthly or "H"
for hourly. For each base frequency, there is an object referred to as a date offset. For example, hourly frequency can be represented with the Hour
class:
81]: from pandas.tseries.offsets import Hour, Minute
In [
82]: hour = Hour()
In [
83]: hour
In [83]: <Hour> Out[
You can define a multiple of an offset by passing an integer:
84]: four_hours = Hour(4)
In [
85]: four_hours
In [85]: <4 * Hours> Out[
In most applications, you would never need to explicitly create one of these objects; instead you'd use a string alias like "H"
or "4H"
. Putting an integer before the base frequency creates a multiple:
86]: pd.date_range("20000101", "20000103 23:59", freq="4H")
In [86]:
Out['20000101 00:00:00', '20000101 04:00:00',
DatetimeIndex(['20000101 08:00:00', '20000101 12:00:00',
'20000101 16:00:00', '20000101 20:00:00',
'20000102 00:00:00', '20000102 04:00:00',
'20000102 08:00:00', '20000102 12:00:00',
'20000102 16:00:00', '20000102 20:00:00',
'20000103 00:00:00', '20000103 04:00:00',
'20000103 08:00:00', '20000103 12:00:00',
'20000103 16:00:00', '20000103 20:00:00'],
='datetime64[ns]', freq='4H') dtype
Many offsets can be combined by addition:
87]: Hour(2) + Minute(30)
In [87]: <150 * Minutes> Out[
Similarly, you can pass frequency strings, like "1h30min"
, that will effectively be parsed to the same expression:
88]: pd.date_range("20000101", periods=10, freq="1h30min")
In [88]:
Out['20000101 00:00:00', '20000101 01:30:00',
DatetimeIndex(['20000101 03:00:00', '20000101 04:30:00',
'20000101 06:00:00', '20000101 07:30:00',
'20000101 09:00:00', '20000101 10:30:00',
'20000101 12:00:00', '20000101 13:30:00'],
='datetime64[ns]', freq='90T') dtype
Some frequencies describe points in time that are not evenly spaced. For example, "M"
(calendar month end) and "BM"
(last business/weekday of month) depend on the number of days in a month and, in the latter case, whether the month ends on a weekend or not. We refer to these as anchored offsets.
Refer to Table 11.4 for a listing of frequency codes and date offset classes available in pandas.
Users can define their own custom frequency classes to provide date logic not available in pandas, though the full details of that are outside the scope of this book.
Week of month dates
One useful frequency class is “week of month,” starting with WOM
. This enables you to get dates like the third Friday of each month:
89]: monthly_dates = pd.date_range("20120101", "20120901", freq="WOM3FRI
In [")
In [90]: list(monthly_dates)
Out[90]:
[Timestamp('20120120 00:00:00'),
Timestamp('20120217 00:00:00'),
'20120316 00:00:00'),
Timestamp('20120420 00:00:00'),
Timestamp('20120518 00:00:00'),
Timestamp('20120615 00:00:00'),
Timestamp('20120720 00:00:00'),
Timestamp('20120817 00:00:00')] Timestamp(
Shifting (Leading and Lagging) Data
Shifting refers to moving data backward and forward through time. Both Series and DataFrame have a shift
method for doing naive shifts forward or backward, leaving the index unmodified:
91]: ts = pd.Series(np.random.standard_normal(4),
In [=pd.date_range("20000101", periods=4, freq="M"))
....: index
92]: ts
In [92]:
Out[20000131 0.066748
20000229 0.838639
20000331 0.117388
20000430 0.517795
Freq: M, dtype: float64
93]: ts.shift(2)
In [93]:
Out[20000131 NaN
20000229 NaN
20000331 0.066748
20000430 0.838639
Freq: M, dtype: float64
94]: ts.shift(2)
In [94]:
Out[20000131 0.117388
20000229 0.517795
20000331 NaN
20000430 NaN
Freq: M, dtype: float64
When we shift like this, missing data is introduced either at the start or the end of the time series.
A common use of shift
is computing consecutive percent changes in a time series or multiple time series as DataFrame columns. This is expressed as:
/ ts.shift(1)  1 ts
Because naive shifts leave the index unmodified, some data is discarded. Thus if the frequency is known, it can be passed to shift
to advance the timestamps instead of simply the data:
95]: ts.shift(2, freq="M")
In [95]:
Out[20000331 0.066748
20000430 0.838639
20000531 0.117388
20000630 0.517795
Freq: M, dtype: float64
Other frequencies can be passed, too, giving you some flexibility in how to lead and lag the data:
96]: ts.shift(3, freq="D")
In [96]:
Out[20000203 0.066748
20000303 0.838639
20000403 0.117388
20000503 0.517795
dtype: float64
97]: ts.shift(1, freq="90T")
In [97]:
Out[20000131 01:30:00 0.066748
20000229 01:30:00 0.838639
20000331 01:30:00 0.117388
20000430 01:30:00 0.517795
dtype: float64
The T
here stands for minutes. Note that the freq
parameter here indicates the offset to apply to the timestamps, but it does not change the underlying frequency of the data, if any.
Shifting dates with offsets
The pandas date offsets can also be used with datetime
or Timestamp
objects:
98]: from pandas.tseries.offsets import Day, MonthEnd
In [
99]: now = datetime(2011, 11, 17)
In [
100]: now + 3 * Day()
In [100]: Timestamp('20111120 00:00:00') Out[
If you add an anchored offset like MonthEnd
, the first increment will "roll forward" a date to the next date according to the frequency rule:
101]: now + MonthEnd()
In [101]: Timestamp('20111130 00:00:00')
Out[
102]: now + MonthEnd(2)
In [102]: Timestamp('20111231 00:00:00') Out[
Anchored offsets can explicitly “roll” dates forward or backward by simply using their rollforward
and rollback
methods, respectively:
103]: offset = MonthEnd()
In [
104]: offset.rollforward(now)
In [104]: Timestamp('20111130 00:00:00')
Out[
105]: offset.rollback(now)
In [105]: Timestamp('20111031 00:00:00') Out[
A creative use of date offsets is to use these methods with groupby
:
106]: ts = pd.Series(np.random.standard_normal(20),
In [=pd.date_range("20000115", periods=20, freq="4D")
.....: index
)
107]: ts
In [107]:
Out[20000115 0.116696
20000119 2.389645
20000123 0.932454
20000127 0.229331
20000131 1.140330
20000204 0.439920
20000208 0.823758
20000212 0.520930
20000216 0.350282
20000220 0.204395
20000224 0.133445
20000228 0.327905
20000303 0.072153
20000307 0.131678
20000311 1.297459
20000315 0.997747
20000319 0.870955
20000323 0.991253
20000327 0.151699
20000331 1.266151
4D, dtype: float64
Freq:
108]: ts.groupby(MonthEnd().rollforward).mean()
In [108]:
Out[20000131 0.005833
20000229 0.015894
20000331 0.150209
dtype: float64
Of course, an easier and faster way to do this is with resample
(we'll discuss this in much more depth in Resampling and Frequency Conversion):
109]: ts.resample("M").mean()
In [109]:
Out[20000131 0.005833
20000229 0.015894
20000331 0.150209
Freq: M, dtype: float64
11.4 Time Zone Handling
Working with time zones can be one of the most unpleasant parts of time series manipulation. As a result, many time series users choose to work with time series in coordinated universal time or UTC, which is the geographyindependent international standard. Time zones are expressed as offsets from UTC; for example, New York is four hours behind UTC during daylight saving time (DST) and five hours behind the rest of the year.
In Python, time zone information comes from the thirdparty pytz
library (installable with pip or conda), which exposes the Olson database, a compilation of world time zone information. This is especially important for historical data because the DST transition dates (and even UTC offsets) have been changed numerous times depending on the regional laws. In the United States, the DST transition times have been changed many times since 1900!
For detailed information about the pytz
library, you’ll need to look at that library’s documentation. As far as this book is concerned, pandas wraps pytz
’s functionality so you can ignore its API outside of the time zone names. Since pandas has a hard dependency on pytz
, it isn't necessary to install it separately. Time zone names can be found interactively and in the docs:
110]: import pytz
In [
111]: pytz.common_timezones[5:]
In [111]: ['US/Eastern', 'US/Hawaii', 'US/Mountain', 'US/Pacific', 'UTC'] Out[
To get a time zone object from pytz
, use pytz.timezone
:
112]: tz = pytz.timezone("America/New_York")
In [
113]: tz
In [113]: <DstTzInfo 'America/New_York' LMT1 day, 19:04:00 STD> Out[
Methods in pandas will accept either time zone names or these objects.
Time Zone Localization and Conversion
By default, time series in pandas are time zone naive. For example, consider the following time series:
114]: dates = pd.date_range("20120309 09:30", periods=6)
In [
115]: ts = pd.Series(np.random.standard_normal(len(dates)), index=dates)
In [
116]: ts
In [116]:
Out[20120309 09:30:00 0.202469
20120310 09:30:00 0.050718
20120311 09:30:00 0.639869
20120312 09:30:00 0.597594
20120313 09:30:00 0.797246
20120314 09:30:00 0.472879
Freq: D, dtype: float64
The index’s tz
field is None
:
117]: print(ts.index.tz)
In [None
Date ranges can be generated with a time zone set:
118]: pd.date_range("20120309 09:30", periods=10, tz="UTC")
In [118]:
Out['20120309 09:30:00+00:00', '20120310 09:30:00+00:00',
DatetimeIndex(['20120311 09:30:00+00:00', '20120312 09:30:00+00:00',
'20120313 09:30:00+00:00', '20120314 09:30:00+00:00',
'20120315 09:30:00+00:00', '20120316 09:30:00+00:00',
'20120317 09:30:00+00:00', '20120318 09:30:00+00:00'],
='datetime64[ns, UTC]', freq='D') dtype
Conversion from naive to localized (reinterpreted as having been observed in a particular time zone) is handled by the tz_localize
method:
119]: ts
In [119]:
Out[20120309 09:30:00 0.202469
20120310 09:30:00 0.050718
20120311 09:30:00 0.639869
20120312 09:30:00 0.597594
20120313 09:30:00 0.797246
20120314 09:30:00 0.472879
Freq: D, dtype: float64
120]: ts_utc = ts.tz_localize("UTC")
In [
121]: ts_utc
In [121]:
Out[20120309 09:30:00+00:00 0.202469
20120310 09:30:00+00:00 0.050718
20120311 09:30:00+00:00 0.639869
20120312 09:30:00+00:00 0.597594
20120313 09:30:00+00:00 0.797246
20120314 09:30:00+00:00 0.472879
Freq: D, dtype: float64
122]: ts_utc.index
In [122]:
Out['20120309 09:30:00+00:00', '20120310 09:30:00+00:00',
DatetimeIndex(['20120311 09:30:00+00:00', '20120312 09:30:00+00:00',
'20120313 09:30:00+00:00', '20120314 09:30:00+00:00'],
='datetime64[ns, UTC]', freq='D') dtype
Once a time series has been localized to a particular time zone, it can be converted to another time zone with tz_convert
:
123]: ts_utc.tz_convert("America/New_York")
In [123]:
Out[20120309 04:30:0005:00 0.202469
20120310 04:30:0005:00 0.050718
20120311 05:30:0004:00 0.639869
20120312 05:30:0004:00 0.597594
20120313 05:30:0004:00 0.797246
20120314 05:30:0004:00 0.472879
Freq: D, dtype: float64
In the case of the preceding time series, which straddles a DST transition in the America/New_York
time zone, we could localize to US Eastern time and convert to, say, UTC or Berlin time:
124]: ts_eastern = ts.tz_localize("America/New_York")
In [
125]: ts_eastern.tz_convert("UTC")
In [125]:
Out[20120309 14:30:00+00:00 0.202469
20120310 14:30:00+00:00 0.050718
20120311 13:30:00+00:00 0.639869
20120312 13:30:00+00:00 0.597594
20120313 13:30:00+00:00 0.797246
20120314 13:30:00+00:00 0.472879
dtype: float64
126]: ts_eastern.tz_convert("Europe/Berlin")
In [126]:
Out[20120309 15:30:00+01:00 0.202469
20120310 15:30:00+01:00 0.050718
20120311 14:30:00+01:00 0.639869
20120312 14:30:00+01:00 0.597594
20120313 14:30:00+01:00 0.797246
20120314 14:30:00+01:00 0.472879
dtype: float64
tz_localize
and tz_convert
are also instance methods on DatetimeIndex
:
127]: ts.index.tz_localize("Asia/Shanghai")
In [127]:
Out['20120309 09:30:00+08:00', '20120310 09:30:00+08:00',
DatetimeIndex(['20120311 09:30:00+08:00', '20120312 09:30:00+08:00',
'20120313 09:30:00+08:00', '20120314 09:30:00+08:00'],
='datetime64[ns, Asia/Shanghai]', freq=None) dtype
Localizing naive timestamps also checks for ambiguous or nonexistent times around daylight saving time transitions.
Operations with Time ZoneAware Timestamp Objects
Similar to time series and date ranges, individual Timestamp
objects similarly can be localized from naive to time zoneaware and converted from one time zone to another:
128]: stamp = pd.Timestamp("20110312 04:00")
In [
129]: stamp_utc = stamp.tz_localize("utc")
In [
130]: stamp_utc.tz_convert("America/New_York")
In [130]: Timestamp('20110311 23:00:000500', tz='America/New_York') Out[
You can also pass a time zone when creating the Timestamp
:
131]: stamp_moscow = pd.Timestamp("20110312 04:00", tz="Europe/Moscow")
In [
132]: stamp_moscow
In [132]: Timestamp('20110312 04:00:00+0300', tz='Europe/Moscow') Out[
Time zoneaware Timestamp
objects internally store a UTC timestamp value as nanoseconds since the Unix epoch (January 1, 1970), so changing the time zone does not alter the internal UTC value:
133]: stamp_utc.value
In [133]: 1299902400000000000
Out[
134]: stamp_utc.tz_convert("America/New_York").value
In [134]: 1299902400000000000 Out[
When performing time arithmetic using pandas’s DateOffset
objects, pandas respects daylight saving time transitions where possible. Here we construct timestamps that occur right before DST transitions (forward and backward). First, 30 minutes before transitioning to DST:
135]: stamp = pd.Timestamp("20120311 01:30", tz="US/Eastern")
In [
136]: stamp
In [136]: Timestamp('20120311 01:30:000500', tz='US/Eastern')
Out[
137]: stamp + Hour()
In [137]: Timestamp('20120311 03:30:000400', tz='US/Eastern') Out[
Then, 90 minutes before transitioning out of DST:
138]: stamp = pd.Timestamp("20121104 00:30", tz="US/Eastern")
In [
139]: stamp
In [139]: Timestamp('20121104 00:30:000400', tz='US/Eastern')
Out[
140]: stamp + 2 * Hour()
In [140]: Timestamp('20121104 01:30:000500', tz='US/Eastern') Out[
Operations Between Different Time Zones
If two time series with different time zones are combined, the result will be UTC. Since the timestamps are stored under the hood in UTC, this is a straightforward operation and requires no conversion:
141]: dates = pd.date_range("20120307 09:30", periods=10, freq="B")
In [
142]: ts = pd.Series(np.random.standard_normal(len(dates)), index=dates)
In [
143]: ts
In [143]:
Out[20120307 09:30:00 0.522356
20120308 09:30:00 0.546348
20120309 09:30:00 0.733537
20120312 09:30:00 1.302736
20120313 09:30:00 0.022199
20120314 09:30:00 0.364287
20120315 09:30:00 0.922839
20120316 09:30:00 0.312656
20120319 09:30:00 1.128497
20120320 09:30:00 0.333488
Freq: B, dtype: float64
144]: ts1 = ts[:7].tz_localize("Europe/London")
In [
145]: ts2 = ts1[2:].tz_convert("Europe/Moscow")
In [
146]: result = ts1 + ts2
In [
147]: result.index
In [147]:
Out['20120307 09:30:00+00:00', '20120308 09:30:00+00:00',
DatetimeIndex(['20120309 09:30:00+00:00', '20120312 09:30:00+00:00',
'20120313 09:30:00+00:00', '20120314 09:30:00+00:00',
'20120315 09:30:00+00:00'],
='datetime64[ns, UTC]', freq=None) dtype
Operations between time zonenaive and time zoneaware data are not supported and will raise an exception.
11.5 Periods and Period Arithmetic
Periods represent time spans, like days, months, quarters, or years. The pandas.Period
class represents this data type, requiring a string or integer and a supported frequency from Table 11.4:
148]: p = pd.Period("2011", freq="ADEC")
In [
149]: p
In [149]: Period('2011', 'ADEC') Out[
In this case, the Period
object represents the full time span from January 1, 2011, to December 31, 2011, inclusive. Conveniently, adding and subtracting integers from periods has the effect of shifting their frequency:
150]: p + 5
In [150]: Period('2016', 'ADEC')
Out[
151]: p  2
In [151]: Period('2009', 'ADEC') Out[
If two periods have the same frequency, their difference is the number of units between them as a date offset:
152]: pd.Period("2014", freq="ADEC")  p
In [152]: <3 * YearEnds: month=12> Out[
Regular ranges of periods can be constructed with the period_range
function:
153]: periods = pd.period_range("20000101", "20000630", freq="M")
In [
154]: periods
In [154]: PeriodIndex(['200001', '200002', '200003', '200004', '200005', '20
Out[0006'], dtype='period[M]')
The PeriodIndex
class stores a sequence of periods and can serve as an axis index in any pandas data structure:
155]: pd.Series(np.random.standard_normal(6), index=periods)
In [155]:
Out[200001 0.514551
200002 0.559782
200003 0.783408
200004 1.797685
200005 0.172670
200006 0.680215
Freq: M, dtype: float64
If you have an array of strings, you can also use the PeriodIndex
class, where all of its values are periods:
156]: values = ["2001Q3", "2002Q2", "2003Q1"]
In [
157]: index = pd.PeriodIndex(values, freq="QDEC")
In [
158]: index
In [158]: PeriodIndex(['2001Q3', '2002Q2', '2003Q1'], dtype='period[QDEC]') Out[
Period Frequency Conversion
Periods and PeriodIndex
objects can be converted to another frequency with their asfreq
method. As an example, suppose we had an annual period and wanted to convert it into a monthly period either at the start or end of the year. This can be done like so:
159]: p = pd.Period("2011", freq="ADEC")
In [
160]: p
In [160]: Period('2011', 'ADEC')
Out[
161]: p.asfreq("M", how="start")
In [161]: Period('201101', 'M')
Out[
162]: p.asfreq("M", how="end")
In [162]: Period('201112', 'M')
Out[
163]: p.asfreq("M")
In [163]: Period('201112', 'M') Out[
You can think of Period("2011", "ADEC")
as being a sort of cursor pointing to a span of time, subdivided by monthly periods. See Figure 11.1 for an illustration of this. For a fiscal year ending on a month other than December, the corresponding monthly subperiods are different:
164]: p = pd.Period("2011", freq="AJUN")
In [
165]: p
In [165]: Period('2011', 'AJUN')
Out[
166]: p.asfreq("M", how="start")
In [166]: Period('201007', 'M')
Out[
167]: p.asfreq("M", how="end")
In [167]: Period('201106', 'M') Out[
When you are converting from high to low frequency, pandas determines the subperiod, depending on where the superperiod “belongs.” For example, in AJUN
frequency, the month Aug2011
is actually part of the 2012
period:
168]: p = pd.Period("Aug2011", "M")
In [
169]: p.asfreq("AJUN")
In [169]: Period('2012', 'AJUN') Out[
Whole PeriodIndex
objects or time series can be similarly converted with the same semantics:
170]: periods = pd.period_range("2006", "2009", freq="ADEC")
In [
171]: ts = pd.Series(np.random.standard_normal(len(periods)), index=periods)
In [
172]: ts
In [172]:
Out[2006 1.607578
2007 0.200381
2008 0.834068
2009 0.302988
DEC, dtype: float64
Freq: A
173]: ts.asfreq("M", how="start")
In [173]:
Out[200601 1.607578
200701 0.200381
200801 0.834068
200901 0.302988
Freq: M, dtype: float64
Here, the annual periods are replaced with monthly periods corresponding to the first month falling within each annual period. If we instead wanted the last business day of each year, we can use the "B"
frequency and indicate that we want the end of the period:
174]: ts.asfreq("B", how="end")
In [174]:
Out[20061229 1.607578
20071231 0.200381
20081231 0.834068
20091231 0.302988
Freq: B, dtype: float64
Quarterly Period Frequencies
Quarterly data is standard in accounting, finance, and other fields. Much quarterly data is reported relative to a fiscal year end, typically the last calendar or business day of one of the 12 months of the year. Thus, the period 2012Q4
has a different meaning depending on fiscal year end. pandas supports all 12 possible quarterly frequencies as QJAN
through QDEC
:
175]: p = pd.Period("2012Q4", freq="QJAN")
In [
176]: p
In [176]: Period('2012Q4', 'QJAN') Out[
In the case of a fiscal year ending in January, 2012Q4
runs from November 2011 through January 2012, which you can check by converting to daily frequency:
177]: p.asfreq("D", how="start")
In [177]: Period('20111101', 'D')
Out[
178]: p.asfreq("D", how="end")
In [178]: Period('20120131', 'D') Out[
See Figure 11.2 for an illustration.
Thus, it’s possible to do convenient period arithmetic; for example, to get the timestamp at 4 P.M. on the secondtolast business day of the quarter, you could do:
179]: p4pm = (p.asfreq("B", how="end")  1).asfreq("T", how="start") + 16 * 6
In [0
180]: p4pm
In [180]: Period('20120130 16:00', 'T')
Out[
181]: p4pm.to_timestamp()
In [181]: Timestamp('20120130 16:00:00') Out[
The to_timestamp
method returns the Timestamp
at the start of the period by default.
You can generate quarterly ranges using pandas.period_range
. The arithmetic is identical, too:
182]: periods = pd.period_range("2011Q3", "2012Q4", freq="QJAN")
In [
183]: ts = pd.Series(np.arange(len(periods)), index=periods)
In [
184]: ts
In [184]:
Out[2011Q3 0
2011Q4 1
2012Q1 2
2012Q2 3
2012Q3 4
2012Q4 5
JAN, dtype: int64
Freq: Q
185]: new_periods = (periods.asfreq("B", "end")  1).asfreq("H", "start") + 1
In [6
186]: ts.index = new_periods.to_timestamp()
In [
187]: ts
In [187]:
Out[20101028 16:00:00 0
20110128 16:00:00 1
20110428 16:00:00 2
20110728 16:00:00 3
20111028 16:00:00 4
20120130 16:00:00 5
dtype: int64
Converting Timestamps to Periods (and Back)
Series and DataFrame objects indexed by timestamps can be converted to periods with the to_period
method:
188]: dates = pd.date_range("20000101", periods=3, freq="M")
In [
189]: ts = pd.Series(np.random.standard_normal(3), index=dates)
In [
190]: ts
In [190]:
Out[20000131 1.663261
20000229 0.996206
20000331 1.521760
Freq: M, dtype: float64
191]: pts = ts.to_period()
In [
192]: pts
In [192]:
Out[200001 1.663261
200002 0.996206
200003 1.521760
Freq: M, dtype: float64
Since periods refer to nonoverlapping time spans, a timestamp can only belong to a single period for a given frequency. While the frequency of the new PeriodIndex
is inferred from the timestamps by default, you can specify any supported frequency (most of those listed in Table 11.4 are supported). There is also no problem with having duplicate periods in the result:
193]: dates = pd.date_range("20000129", periods=6)
In [
194]: ts2 = pd.Series(np.random.standard_normal(6), index=dates)
In [
195]: ts2
In [195]:
Out[20000129 0.244175
20000130 0.423331
20000131 0.654040
20000201 2.089154
20000202 0.060220
20000203 0.167933
Freq: D, dtype: float64
196]: ts2.to_period("M")
In [196]:
Out[200001 0.244175
200001 0.423331
200001 0.654040
200002 2.089154
200002 0.060220
200002 0.167933
Freq: M, dtype: float64
To convert back to timestamps, use the to_timestamp
method, which returns a DatetimeIndex
:
197]: pts = ts2.to_period()
In [
198]: pts
In [198]:
Out[20000129 0.244175
20000130 0.423331
20000131 0.654040
20000201 2.089154
20000202 0.060220
20000203 0.167933
Freq: D, dtype: float64
199]: pts.to_timestamp(how="end")
In [199]:
Out[20000129 23:59:59.999999999 0.244175
20000130 23:59:59.999999999 0.423331
20000131 23:59:59.999999999 0.654040
20000201 23:59:59.999999999 2.089154
20000202 23:59:59.999999999 0.060220
20000203 23:59:59.999999999 0.167933
Freq: D, dtype: float64
Creating a PeriodIndex from Arrays
Fixed frequency datasets are sometimes stored with time span information spread across multiple columns. For example, in this macroeconomic dataset, the year and quarter are in different columns:
200]: data = pd.read_csv("examples/macrodata.csv")
In [
201]: data.head(5)
In [201]:
Out[
year quarter realgdp realcons realinv realgovt realdpi cpi 0 1959 1 2710.349 1707.4 286.898 470.045 1886.9 28.98 \
1 1959 2 2778.801 1733.7 310.859 481.301 1919.7 29.15
2 1959 3 2775.488 1751.8 289.226 491.260 1916.4 29.35
3 1959 4 2785.204 1753.7 299.356 484.052 1931.3 29.37
4 1960 1 2847.699 1770.5 331.722 462.199 1955.5 29.54
m1 tbilrate unemp pop infl realint 0 139.7 2.82 5.8 177.146 0.00 0.00
1 141.7 3.08 5.1 177.830 2.34 0.74
2 140.5 3.82 5.3 178.657 2.74 1.09
3 140.0 4.33 5.6 179.386 0.27 4.06
4 139.6 3.50 5.2 180.007 2.31 1.19
202]: data["year"]
In [202]:
Out[0 1959
1 1959
2 1959
3 1959
4 1960
... 198 2008
199 2008
200 2009
201 2009
202 2009
203, dtype: int64
Name: year, Length:
203]: data["quarter"]
In [203]:
Out[0 1
1 2
2 3
3 4
4 1
..198 3
199 4
200 1
201 2
202 3
203, dtype: int64 Name: quarter, Length:
By passing these arrays to PeriodIndex
with a frequency, you can combine them to form an index for the DataFrame:
204]: index = pd.PeriodIndex(year=data["year"], quarter=data["quarter"],
In [="QDEC")
.....: freq
205]: index
In [205]:
Out['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
PeriodIndex(['1960Q3', '1960Q4', '1961Q1', '1961Q2',
...'2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
'2008Q4', '2009Q1', '2009Q2', '2009Q3'],
='period[QDEC]', length=203)
dtype
206]: data.index = index
In [
207]: data["infl"]
In [207]:
Out[1959Q1 0.00
1959Q2 2.34
1959Q3 2.74
1959Q4 0.27
1960Q1 2.31
... 2008Q3 3.16
2008Q4 8.79
2009Q1 0.94
2009Q2 3.37
2009Q3 3.56
DEC, Name: infl, Length: 203, dtype: float64 Freq: Q
11.6 Resampling and Frequency Conversion
Resampling refers to the process of converting a time series from one frequency to another. Aggregating higher frequency data to lower frequency is called downsampling, while converting lower frequency to higher frequency is called upsampling. Not all resampling falls into either of these categories; for example, converting WWED
(weekly on Wednesday) to WFRI
is neither upsampling nor downsampling.
pandas objects are equipped with a resample
method, which is the workhorse function for all frequency conversion. resample
has a similar API to groupby
; you call resample
to group the data, then call an aggregation function:
208]: dates = pd.date_range("20000101", periods=100)
In [
209]: ts = pd.Series(np.random.standard_normal(len(dates)), index=dates)
In [
210]: ts
In [210]:
Out[20000101 0.631634
20000102 1.594313
20000103 1.519937
20000104 1.108752
20000105 1.255853
... 20000405 0.423776
20000406 0.789740
20000407 0.937568
20000408 2.253294
20000409 1.772919
100, dtype: float64
Freq: D, Length:
211]: ts.resample("M").mean()
In [211]:
Out[20000131 0.165893
20000229 0.078606
20000331 0.223811
20000430 0.063643
Freq: M, dtype: float64
212]: ts.resample("M", kind="period").mean()
In [212]:
Out[200001 0.165893
200002 0.078606
200003 0.223811
200004 0.063643
Freq: M, dtype: float64
resample
is a flexible method that can be used to process large time series. The examples in the following sections illustrate its semantics and use. Table 11.5 summarizes some of its options.
Argument  Description 

rule 
String, DateOffset, or timedelta indicating desired resampled frequency (for example, ’M', ’5min', or Second(15) ) 
axis 
Axis to resample on; default axis=0 
fill_method 
How to interpolate when upsampling, as in "ffill" or "bfill" ; by default does no interpolation 
closed 
In downsampling, which end of each interval is closed (inclusive), "right" or "left" 
label 
In downsampling, how to label the aggregated result, with the "right" or "left" bin edge (e.g., the 9:30 to 9:35 fiveminute interval could be labeled 9:30 or 9:35 ) 
limit 
When forward or backward filling, the maximum number of periods to fill 
kind 
Aggregate to periods ("period" ) or timestamps ("timestamp" ); defaults to the type of index the time series has 
convention 
When resampling periods, the convention ("start" or "end" ) for converting the lowfrequency period to high frequency; defaults to "start" 
origin 
The "base" timestamp from which to determine the resampling bin edges; can also be one of "epoch" , "start" , "start_day" , "end" , or "end_day" ; see the resample docstring for full details 
offset 
An offset timedelta added to the origin; defaults to None 
Downsampling
Downsampling is aggregating data to a regular, lower frequency. The data you’re aggregating doesn’t need to be fixed frequently; the desired frequency defines bin edges that are used to slice the time series into pieces to aggregate. For example, to convert to monthly, "M"
or "BM"
, you need to chop up the data into onemonth intervals. Each interval is said to be halfopen; a data point can belong only to one interval, and the union of the intervals must make up the whole time frame. There are a couple things to think about when using resample
to downsample data:
Which side of each interval is closed
How to label each aggregated bin, either with the start of the interval or the end
To illustrate, let’s look at some oneminute frequency data:
213]: dates = pd.date_range("20000101", periods=12, freq="T")
In [
214]: ts = pd.Series(np.arange(len(dates)), index=dates)
In [
215]: ts
In [215]:
Out[20000101 00:00:00 0
20000101 00:01:00 1
20000101 00:02:00 2
20000101 00:03:00 3
20000101 00:04:00 4
20000101 00:05:00 5
20000101 00:06:00 6
20000101 00:07:00 7
20000101 00:08:00 8
20000101 00:09:00 9
20000101 00:10:00 10
20000101 00:11:00 11
Freq: T, dtype: int64
Suppose you wanted to aggregate this data into fiveminute chunks or bars by taking the sum of each group:
216]: ts.resample("5min").sum()
In [216]:
Out[20000101 00:00:00 10
20000101 00:05:00 35
20000101 00:10:00 21
5T, dtype: int64 Freq:
The frequency you pass defines bin edges in fiveminute increments. For this frequency, by default the left bin edge is inclusive, so the 00:00
value is included in the 00:00
to 00:05
interval, and the 00:05
value is excluded from that interval.^{1}
217]: ts.resample("5min", closed="right").sum()
In [217]:
Out[19991231 23:55:00 0
20000101 00:00:00 15
20000101 00:05:00 40
20000101 00:10:00 11
5T, dtype: int64 Freq:
The resulting time series is labeled by the timestamps from the left side of each bin. By passing label="right"
you can label them with the right bin edge:
218]: ts.resample("5min", closed="right", label="right").sum()
In [218]:
Out[20000101 00:00:00 0
20000101 00:05:00 15
20000101 00:10:00 40
20000101 00:15:00 11
5T, dtype: int64 Freq:
See Figure 11.3 for an illustration of minute frequency data being resampled to fiveminute frequency.
Lastly, you might want to shift the result index by some amount, say subtracting one second from the right edge to make it more clear which interval the timestamp refers to. To do this, add an offset to the resulting index:
219]: from pandas.tseries.frequencies import to_offset
In [
220]: result = ts.resample("5min", closed="right", label="right").sum()
In [
221]: result.index = result.index + to_offset("1s")
In [
222]: result
In [222]:
Out[19991231 23:59:59 0
20000101 00:04:59 15
20000101 00:09:59 40
20000101 00:14:59 11
5T, dtype: int64 Freq:
Openhighlowclose (OHLC) resampling
In finance, a popular way to aggregate a time series is to compute four values for each bucket: the first (open), last (close), maximum (high), and minimal (low) values. By using the ohlc
aggregate function, you will obtain a DataFrame having columns containing these four aggregates, which are efficiently computed in a single function call:
223]: ts = pd.Series(np.random.permutation(np.arange(len(dates))), index=date
In [
s)
224]: ts.resample("5min").ohlc()
In [224]:
Out[open high low close
20000101 00:00:00 8 8 1 5
20000101 00:05:00 6 11 2 2
20000101 00:10:00 0 7 0 7
Upsampling and Interpolation
Upsampling is converting from a lower frequency to a higher frequency, where no aggregation is needed. Let’s consider a DataFrame with some weekly data:
225]: frame = pd.DataFrame(np.random.standard_normal((2, 4)),
In [=pd.date_range("20000101", periods=2,
.....: index="WWED"),
.....: freq=["Colorado", "Texas", "New York", "Ohio"])
.....: columns
226]: frame
In [226]:
Out[
Colorado Texas New York Ohio20000105 0.896431 0.927238 0.482284 0.867130
20000112 0.493841 0.155434 1.397286 1.507055
When you are using an aggregation function with this data, there is only one value per group, and missing values result in the gaps. We use the asfreq
method to convert to the higher frequency without any aggregation:
227]: df_daily = frame.resample("D").asfreq()
In [
228]: df_daily
In [228]:
Out[
Colorado Texas New York Ohio20000105 0.896431 0.927238 0.482284 0.867130
20000106 NaN NaN NaN NaN
20000107 NaN NaN NaN NaN
20000108 NaN NaN NaN NaN
20000109 NaN NaN NaN NaN
20000110 NaN NaN NaN NaN
20000111 NaN NaN NaN NaN
20000112 0.493841 0.155434 1.397286 1.507055
Suppose you wanted to fill forward each weekly value on the nonWednesdays. The same filling or interpolation methods available in the fillna
and reindex
methods are available for resampling:
229]: frame.resample("D").ffill()
In [229]:
Out[
Colorado Texas New York Ohio20000105 0.896431 0.927238 0.482284 0.867130
20000106 0.896431 0.927238 0.482284 0.867130
20000107 0.896431 0.927238 0.482284 0.867130
20000108 0.896431 0.927238 0.482284 0.867130
20000109 0.896431 0.927238 0.482284 0.867130
20000110 0.896431 0.927238 0.482284 0.867130
20000111 0.896431 0.927238 0.482284 0.867130
20000112 0.493841 0.155434 1.397286 1.507055
You can similarly choose to only fill a certain number of periods forward to limit how far to continue using an observed value:
230]: frame.resample("D").ffill(limit=2)
In [230]:
Out[
Colorado Texas New York Ohio20000105 0.896431 0.927238 0.482284 0.867130
20000106 0.896431 0.927238 0.482284 0.867130
20000107 0.896431 0.927238 0.482284 0.867130
20000108 NaN NaN NaN NaN
20000109 NaN NaN NaN NaN
20000110 NaN NaN NaN NaN
20000111 NaN NaN NaN NaN
20000112 0.493841 0.155434 1.397286 1.507055
Notably, the new date index need not coincide with the old one at all:
231]: frame.resample("WTHU").ffill()
In [231]:
Out[
Colorado Texas New York Ohio20000106 0.896431 0.927238 0.482284 0.867130
20000113 0.493841 0.155434 1.397286 1.507055
Resampling with Periods
Resampling data indexed by periods is similar to timestamps:
232]: frame = pd.DataFrame(np.random.standard_normal((24, 4)),
In [=pd.period_range("12000", "122001",
.....: index="M"),
.....: freq=["Colorado", "Texas", "New York", "Ohio"])
.....: columns
233]: frame.head()
In [233]:
Out[
Colorado Texas New York Ohio200001 1.179442 0.443171 1.395676 0.529658
200002 0.787358 0.248845 0.743239 1.267746
200003 1.302395 0.272154 0.051532 0.467740
200004 1.040816 0.426419 0.312945 1.115689
200005 1.234297 1.893094 1.661605 0.005477
234]: annual_frame = frame.resample("ADEC").mean()
In [
235]: annual_frame
In [235]:
Out[
Colorado Texas New York Ohio2000 0.487329 0.104466 0.020495 0.273945
2001 0.203125 0.162429 0.056146 0.103794
Upsampling is more nuanced, as before resampling you must make a decision about which end of the time span in the new frequency to place the values. The convention
argument defaults to "start"
but can also be "end"
:
# QDEC: Quarterly, year ending in December
236]: annual_frame.resample("QDEC").ffill()
In [236]:
Out[
Colorado Texas New York Ohio2000Q1 0.487329 0.104466 0.020495 0.273945
2000Q2 0.487329 0.104466 0.020495 0.273945
2000Q3 0.487329 0.104466 0.020495 0.273945
2000Q4 0.487329 0.104466 0.020495 0.273945
2001Q1 0.203125 0.162429 0.056146 0.103794
2001Q2 0.203125 0.162429 0.056146 0.103794
2001Q3 0.203125 0.162429 0.056146 0.103794
2001Q4 0.203125 0.162429 0.056146 0.103794
237]: annual_frame.resample("QDEC", convention="end").asfreq()
In [237]:
Out[
Colorado Texas New York Ohio2000Q4 0.487329 0.104466 0.020495 0.273945
2001Q1 NaN NaN NaN NaN
2001Q2 NaN NaN NaN NaN
2001Q3 NaN NaN NaN NaN
2001Q4 0.203125 0.162429 0.056146 0.103794
Since periods refer to time spans, the rules about upsampling and downsampling are more rigid:
In downsampling, the target frequency must be a subperiod of the source frequency.
In upsampling, the target frequency must be a superperiod of the source frequency.
If these rules are not satisfied, an exception will be raised. This mainly affects the quarterly, annual, and weekly frequencies; for example, the time spans defined by QMAR
only line up with AMAR
, AJUN
, ASEP
, and ADEC
:
238]: annual_frame.resample("QMAR").ffill()
In [238]:
Out[
Colorado Texas New York Ohio2000Q4 0.487329 0.104466 0.020495 0.273945
2001Q1 0.487329 0.104466 0.020495 0.273945
2001Q2 0.487329 0.104466 0.020495 0.273945
2001Q3 0.487329 0.104466 0.020495 0.273945
2001Q4 0.203125 0.162429 0.056146 0.103794
2002Q1 0.203125 0.162429 0.056146 0.103794
2002Q2 0.203125 0.162429 0.056146 0.103794
2002Q3 0.203125 0.162429 0.056146 0.103794
Grouped Time Resampling
For time series data, the resample
method is semantically a group operation based on a time intervalization. Here's a small example table:
239]: N = 15
In [
240]: times = pd.date_range("20170520 00:00", freq="1min", periods=N)
In [
241]: df = pd.DataFrame({"time": times,
In ["value": np.arange(N)})
.....:
242]: df
In [242]:
Out[
time value0 20170520 00:00:00 0
1 20170520 00:01:00 1
2 20170520 00:02:00 2
3 20170520 00:03:00 3
4 20170520 00:04:00 4
5 20170520 00:05:00 5
6 20170520 00:06:00 6
7 20170520 00:07:00 7
8 20170520 00:08:00 8
9 20170520 00:09:00 9
10 20170520 00:10:00 10
11 20170520 00:11:00 11
12 20170520 00:12:00 12
13 20170520 00:13:00 13
14 20170520 00:14:00 14
Here, we can index by "time"
and then resample:
243]: df.set_index("time").resample("5min").count()
In [243]:
Out[
value
time 20170520 00:00:00 5
20170520 00:05:00 5
20170520 00:10:00 5
Suppose that a DataFrame contains multiple time series, marked by an additional group key column:
244]: df2 = pd.DataFrame({"time": times.repeat(3),
In ["key": np.tile(["a", "b", "c"], N),
.....: "value": np.arange(N * 3.)})
.....:
245]: df2.head(7)
In [245]:
Out[
time key value0 20170520 00:00:00 a 0.0
1 20170520 00:00:00 b 1.0
2 20170520 00:00:00 c 2.0
3 20170520 00:01:00 a 3.0
4 20170520 00:01:00 b 4.0
5 20170520 00:01:00 c 5.0
6 20170520 00:02:00 a 6.0
To do the same resampling for each value of "key"
, we introduce the pandas.Grouper
object:
246]: time_key = pd.Grouper(freq="5min") In [
We can then set the time index, group by "key"
and time_key
, and aggregate:
247]: resampled = (df2.set_index("time")
In ["key", time_key])
.....: .groupby([sum())
.....: .
248]: resampled
In [248]:
Out[
value
key time 20170520 00:00:00 30.0
a 20170520 00:05:00 105.0
20170520 00:10:00 180.0
20170520 00:00:00 35.0
b 20170520 00:05:00 110.0
20170520 00:10:00 185.0
20170520 00:00:00 40.0
c 20170520 00:05:00 115.0
20170520 00:10:00 190.0
249]: resampled.reset_index()
In [249]:
Out[
key time value0 a 20170520 00:00:00 30.0
1 a 20170520 00:05:00 105.0
2 a 20170520 00:10:00 180.0
3 b 20170520 00:00:00 35.0
4 b 20170520 00:05:00 110.0
5 b 20170520 00:10:00 185.0
6 c 20170520 00:00:00 40.0
7 c 20170520 00:05:00 115.0
8 c 20170520 00:10:00 190.0
One constraint with using pandas.Grouper
is that the time must be the index of the Series or DataFrame.
11.7 Moving Window Functions
An important class of array transformations used for time series operations are statistics and other functions evaluated over a sliding window or with exponentially decaying weights. This can be useful for smoothing noisy or gappy data. I call these moving window functions, even though they include functions without a fixedlength window like exponentially weighted moving average. Like other statistical functions, these also automatically exclude missing data.
Before digging in, we can load up some time series data and resample it to business day frequency:
250]: close_px_all = pd.read_csv("examples/stock_px.csv",
In [=True, index_col=0)
.....: parse_dates
251]: close_px = close_px_all[["AAPL", "MSFT", "XOM"]]
In [
252]: close_px = close_px.resample("B").ffill() In [
I now introduce the rolling
operator, which behaves similarly to resample
and groupby
. It can be called on a Series or DataFrame along with a window
(expressed as a number of periods; see Apple price with 250day moving average for the plot created):
253]: close_px["AAPL"].plot()
In [253]: <Axes: >
Out[
254]: close_px["AAPL"].rolling(250).mean().plot() In [
The expression rolling(250)
is similar in behavior to groupby
, but instead of grouping, it creates an object that enables grouping over a 250day sliding window. So here we have the 250day moving window average of Apple's stock price.
By default, rolling functions require all of the values in the window to be nonNA. This behavior can be changed to account for missing data and, in particular, the fact that you will have fewer than window
periods of data at the beginning of the time series (see Apple 250day daily return standard deviation):
255]: plt.figure()
In [255]: <Figure size 1000x600 with 0 Axes>
Out[
256]: std250 = close_px["AAPL"].pct_change().rolling(250, min_periods=10).std
In [
()
257]: std250[5:12]
In [257]:
Out[20030109 NaN
20030110 NaN
20030113 NaN
20030114 NaN
20030115 NaN
20030116 0.009628
20030117 0.013818
Freq: B, Name: AAPL, dtype: float64
258]: std250.plot() In [
To compute an expanding window mean, use the expanding
operator instead of rolling
. The expanding mean starts the time window from the same point as the rolling window and increases the size of the window until it encompasses the whole series. An expanding window mean on the std250
time series looks like this:
259]: expanding_mean = std250.expanding().mean() In [
Calling a moving window function on a DataFrame applies the transformation to each column (see Stock prices 60day moving average (log yaxis)):
261]: plt.style.use('grayscale')
In [
262]: close_px.rolling(60).mean().plot(logy=True) In [
The rolling
function also accepts a string indicating a fixedsize time offset rolling() in moving window functions rather than a set number of periods. Using this notation can be useful for irregular time series. These are the same strings that you can pass to resample
. For example, we could compute a 20day rolling mean like so:
263]: close_px.rolling("20D").mean()
In [263]:
Out[
AAPL MSFT XOM20030102 7.400000 21.110000 29.220000
20030103 7.425000 21.125000 29.230000
20030106 7.433333 21.256667 29.473333
20030107 7.432500 21.425000 29.342500
20030108 7.402000 21.402000 29.240000
... ... ... ...20111010 389.351429 25.602143 72.527857
20111011 388.505000 25.674286 72.835000
20111012 388.531429 25.810000 73.400714
20111013 388.826429 25.961429 73.905000
20111014 391.038000 26.048667 74.185333
2292 rows x 3 columns] [
Exponentially Weighted Functions
An alternative to using a fixed window size with equally weighted observations is to specify a constant decay factor to give more weight to more recent observations. There are a couple of ways to specify the decay factor. A popular one is using a span, which makes the result comparable to a simple moving window function with window size equal to the span.
Since an exponentially weighted statistic places more weight on more recent observations, it “adapts” faster to changes compared with the equalweighted version.
pandas has the ewm
operator (which stands for exponentially weighted moving) to go along with rolling
and expanding
. Here’s an example comparing a 30day moving average of Apple’s stock price with an exponentially weighted (EW) moving average with span=60
(see Simple moving average versus exponentially weighted):
265]: aapl_px = close_px["AAPL"]["2006":"2007"]
In [
266]: ma30 = aapl_px.rolling(30, min_periods=20).mean()
In [
267]: ewma30 = aapl_px.ewm(span=30).mean()
In [
268]: aapl_px.plot(style="k", label="Price")
In [268]: <Axes: >
Out[
269]: ma30.plot(style="k", label="Simple Moving Avg")
In [269]: <Axes: >
Out[
270]: ewma30.plot(style="k", label="EW MA")
In [270]: <Axes: >
Out[
271]: plt.legend() In [
Binary Moving Window Functions
Some statistical operators, like correlation and covariance, need to operate on two time series. As an example, financial analysts are often interested in a stock’s correlation to a benchmark index like the S&P 500. To have a look at this, we first compute the percent change for all of our time series of interest:
273]: spx_px = close_px_all["SPX"]
In [
274]: spx_rets = spx_px.pct_change()
In [
275]: returns = close_px.pct_change() In [
After we call rolling
, the corr
aggregation function can then compute the rolling correlation with spx_rets
(see Sixmonth AAPL return correlation to S&P 500 for the resulting plot):
276]: corr = returns["AAPL"].rolling(125, min_periods=100).corr(spx_rets)
In [
277]: corr.plot() In [
Suppose you wanted to compute the rolling correlation of the S&P 500 index with many stocks at once. You could write a loop computing this for each stock like we did for Apple above, but if each stock is a column in a single DataFrame, we can compute all of the rolling correlations in one shot by calling rolling
on the DataFrame and passing the spx_rets
Series.
See Sixmonth return correlations to S&P 500 for the plot of the result:
279]: corr = returns.rolling(125, min_periods=100).corr(spx_rets)
In [
280]: corr.plot() In [
UserDefined Moving Window Functions
The apply
method on rolling
and related methods provides a way to apply an array function of your own creation over a moving window. The only requirement is that the function produce a single value (a reduction) from each piece of the array. For example, while we can compute sample quantiles using rolling(...).quantile(q)
, we might be interested in the percentile rank of a particular value over the sample. The scipy.stats.percentileofscore
function does just this (see Percentile rank of 2% AAPL return over oneyear window for the resulting plot):
282]: from scipy.stats import percentileofscore
In [
283]: def score_at_2percent(x):
In [return percentileofscore(x, 0.02)
.....:
284]: result = returns["AAPL"].rolling(250).apply(score_at_2percent)
In [
285]: result.plot() In [
If you don't have SciPy installed already, you can install it with conda or pip:
conda install scipy
11.8 Conclusion
Time series data calls for different types of analysis and data transformation tools than the other types of data we have explored in previous chapters.
In the following chapter, we will show how to start using modeling libraries like statsmodels and scikitlearn.
The choice of the default values for
closed
andlabel
might seem a bit odd to some users. The default isclosed="left"
for all but a specific set ("M"
,"A"
,"Q"
,"BM"
,"BQ"
, and"W"
) for which the default isclosed="right"
. The defaults were chosen to make the results more intuitive, but it is worth knowing that the default is not always one or the other.↩︎