In time series data, it’s fairly common to need to compute the last known value “as of” a particular date. However, missing data is the norm, so it’s a touch more complicated than doing a simple binary search. Here is an implementation using array operations that takes these things into account:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
def asof_locs(stamps, where, mask):
"""
Parameters
----------
stamps : array of timestamps
where : array of timestamps
Values to determine the "as of" for
mask : array of booleans where data is not NA
Returns
-------
locs : array of ints
Locations of each "as of" value, -1 for NA
"""
locs = stamps[mask].searchsorted(where, side='right')
locs = np.where(locs > 0, locs - 1, -1)
result = np.arange(len(stamps))[mask].take(locs)
first = mask.argmax()
result[(locs == 0) & (where < stamps[first])] = -1
return result |

Some algorithmic notes. First, let’s run this through `line_profiler` on a large time series to see where time is being taken up:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
In [9]: rng = date_range('1/1/2000', '6/1/2000', freq='10s')
In [10]: s = Series(np.random.randn(len(rng)), index=rng)
In [11]: rng
Out[11]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01 00:00:00, ..., 2000-06-01 00:00:00]
Length: 1313281, Freq: 10S, Timezone: None
In [12]: s = s.resample('s')
In [13]: len(s)
Out[13]: 13132801
In [24]: mask = notnull(s)
In [25]: lprun -f asof_locs asof_locs(s.index, s.index[5::5], mask)
Timer unit: 1e-06 s
File: foo.py
Function: asof_locs at line 3
Total time: 0.793543 s
Line # % Time Line Contents
==============================================================
3 def asof_locs(stamps, where, mask):
4 """
5 Parameters
6 ----------
7 stamps : array of timestamps
8 where : array of timestamps
9 Values to determine the "as of" for
10 mask : array of booleans where data is NA
11
12 Returns
13 -------
14 locs : array of ints
15 Locations of each "as of" value, -1 for NA
16 """
17 63.3 locs = stamps[mask].searchsorted(where, side='right')
18 8.3 locs = np.where(locs > 0, locs - 1, -1)
19
20 23.3 result = np.arange(len(stamps))[mask].take(locs)
21
22 1.6 first = mask.argmax()
23 3.6 result[(locs == 0) & (where < stamps[first])] = -1
24
25 0.0 return result |

The main trickiness here is this step:

1 |
result = np.arange(len(stamps))[mask].take(locs) |

Since the indices returned by `searchsorted` are relative to the *filtered* values, you need to remap to what would have been the indices in the original array, and this does exactly that. Lastly, you might be interested in breaking up `stamps[mask].searchsorted(where, side='right')` to see how much time is spend in `stamps[mask]` versus `searchsorted`:

1 2 3 4 5 6 7 |
In [28]: timeit stamps[mask]
10 loops, best of 3: 127 ms per loop
In [29]: filt = stamps[mask]
In [31]: timeit filt.searchsorted(where)
1 loops, best of 3: 345 ms per loop |

So, this could definitely be sped up by a factor or 2 or more by doing the whole operation in C or Cython, but it’s nice to be able to express it concisely and speedily with array operations.