A O(n log n) NA-friendly time series “as of” using array operations

In time series data, it’s fairly common to need to compute the last known value “as of” a particular date. However, missing data is the norm, so it’s a touch more complicated than doing a simple binary search. Here is an implementation using array operations that takes these things into account:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def asof_locs(stamps, where, mask):
"""
Parameters
----------
stamps : array of timestamps
where : array of timestamps
Values to determine the "as of" for
mask : array of booleans where data is not NA
Returns
-------
locs : array of ints
Locations of each "as of" value, -1 for NA
"""
locs = stamps[mask].searchsorted(where, side='right')
locs = np.where(locs > 0, locs - 1, -1)
result = np.arange(len(stamps))[mask].take(locs)
first = mask.argmax()
result[(locs == 0) & (where < stamps[first])] = -1
return result

Some algorithmic notes. First, let’s run this through line_profiler on a large time series to see where time is being taken up:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
In [9]: rng = date_range('1/1/2000', '6/1/2000', freq='10s')
In [10]: s = Series(np.random.randn(len(rng)), index=rng)
In [11]: rng
Out[11]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:00:00, ..., 2000-06-01 00:00:00]
Length: 1313281, Freq: 10S, Timezone: None
In [12]: s = s.resample('s')
In [13]: len(s)
Out[13]: 13132801
In [24]: mask = notnull(s)
In [25]: lprun -f asof_locs asof_locs(s.index, s.index[5::5], mask)
Timer unit: 1e-06 s
File: foo.py
Function: asof_locs at line 3
Total time: 0.793543 s
Line # % Time Line Contents
==============================================================
3 def asof_locs(stamps, where, mask):
4 """
5 Parameters
6 ----------
7 stamps : array of timestamps
8 where : array of timestamps
9 Values to determine the "as of" for
10 mask : array of booleans where data is NA
11
12 Returns
13 -------
14 locs : array of ints
15 Locations of each "as of" value, -1 for NA
16 """
17 63.3 locs = stamps[mask].searchsorted(where, side='right')
18 8.3 locs = np.where(locs > 0, locs - 1, -1)
19
20 23.3 result = np.arange(len(stamps))[mask].take(locs)
21
22 1.6 first = mask.argmax()
23 3.6 result[(locs == 0) & (where < stamps[first])] = -1
24
25 0.0 return result

The main trickiness here is this step:

1
result = np.arange(len(stamps))[mask].take(locs)

Since the indices returned by searchsorted are relative to the filtered values, you need to remap to what would have been the indices in the original array, and this does exactly that. Lastly, you might be interested in breaking up stamps[mask].searchsorted(where, side='right') to see how much time is spend in stamps[mask] versus searchsorted:

1
2
3
4
5
6
7
In [28]: timeit stamps[mask]
10 loops, best of 3: 127 ms per loop
In [29]: filt = stamps[mask]
In [31]: timeit filt.searchsorted(where)
1 loops, best of 3: 345 ms per loop

So, this could definitely be sped up by a factor or 2 or more by doing the whole operation in C or Cython, but it’s nice to be able to express it concisely and speedily with array operations.