In time series data, it’s fairly common to need to compute the last known value “as of” a particular date. However, missing data is the norm, so it’s a touch more complicated than doing a simple binary search. Here is an implementation using array operations that takes these things into account:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
def asof_locs(stamps, where, mask): """ Parameters ---------- stamps : array of timestamps where : array of timestamps Values to determine the "as of" for mask : array of booleans where data is not NA Returns ------- locs : array of ints Locations of each "as of" value, -1 for NA """ locs = stamps[mask].searchsorted(where, side='right') locs = np.where(locs > 0, locs - 1, -1) result = np.arange(len(stamps))[mask].take(locs) first = mask.argmax() result[(locs == 0) & (where < stamps[first])] = -1 return result |

Some algorithmic notes. First, let’s run this through `line_profiler` on a large time series to see where time is being taken up:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
In [9]: rng = date_range('1/1/2000', '6/1/2000', freq='10s') In [10]: s = Series(np.random.randn(len(rng)), index=rng) In [11]: rng Out[11]: <class 'pandas.tseries.index.DatetimeIndex'> [2000-01-01 00:00:00, ..., 2000-06-01 00:00:00] Length: 1313281, Freq: 10S, Timezone: None In [12]: s = s.resample('s') In [13]: len(s) Out[13]: 13132801 In [24]: mask = notnull(s) In [25]: lprun -f asof_locs asof_locs(s.index, s.index[5::5], mask) Timer unit: 1e-06 s File: foo.py Function: asof_locs at line 3 Total time: 0.793543 s Line # % Time Line Contents ============================================================== 3 def asof_locs(stamps, where, mask): 4 """ 5 Parameters 6 ---------- 7 stamps : array of timestamps 8 where : array of timestamps 9 Values to determine the "as of" for 10 mask : array of booleans where data is NA 11 12 Returns 13 ------- 14 locs : array of ints 15 Locations of each "as of" value, -1 for NA 16 """ 17 63.3 locs = stamps[mask].searchsorted(where, side='right') 18 8.3 locs = np.where(locs > 0, locs - 1, -1) 19 20 23.3 result = np.arange(len(stamps))[mask].take(locs) 21 22 1.6 first = mask.argmax() 23 3.6 result[(locs == 0) & (where < stamps[first])] = -1 24 25 0.0 return result |

The main trickiness here is this step:

1 |
result = np.arange(len(stamps))[mask].take(locs) |

Since the indices returned by `searchsorted` are relative to the *filtered* values, you need to remap to what would have been the indices in the original array, and this does exactly that. Lastly, you might be interested in breaking up `stamps[mask].searchsorted(where, side='right')` to see how much time is spend in `stamps[mask]` versus `searchsorted`:

1 2 3 4 5 6 7 |
In [28]: timeit stamps[mask] 10 loops, best of 3: 127 ms per loop In [29]: filt = stamps[mask] In [31]: timeit filt.searchsorted(where) 1 loops, best of 3: 345 ms per loop |

So, this could definitely be sped up by a factor or 2 or more by doing the whole operation in C or Cython, but it’s nice to be able to express it concisely and speedily with array operations.