## A O(n log n) NA-friendly time series “as of” using array operations

In time series data, it’s fairly common to need to compute the last known value “as of” a particular date. However, missing data is the norm, so it’s a touch more complicated than doing a simple binary search. Here is an implementation using array operations that takes these things into account:

Python
 1234567891011121314151617181920212223 def asof_locs(stamps, where, mask): """ Parameters ---------- stamps : array of timestamps where : array of timestamps Values to determine the "as of" for mask : array of booleans where data is not NA Returns ------- locs : array of ints Locations of each "as of" value, -1 for NA """ locs = stamps[mask].searchsorted(where, side='right') locs = np.where(locs > 0, locs - 1, -1) result = np.arange(len(stamps))[mask].take(locs) first = mask.argmax() result[(locs == 0) & (where < stamps[first])] = -1 return result

Some algorithmic notes. First, let’s run this through line_profiler on a large time series to see where time is being taken up:

 12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849 In [9]: rng = date_range('1/1/2000', '6/1/2000', freq='10s') In [10]: s = Series(np.random.randn(len(rng)), index=rng) In [11]: rng Out[11]: [2000-01-01 00:00:00, ..., 2000-06-01 00:00:00] Length: 1313281, Freq: 10S, Timezone: None In [12]: s = s.resample('s') In [13]: len(s) Out[13]: 13132801 In [24]: mask = notnull(s) In [25]: lprun -f asof_locs asof_locs(s.index, s.index[5::5], mask) Timer unit: 1e-06 s File: foo.py Function: asof_locs at line 3 Total time: 0.793543 s Line # % Time Line Contents ============================================================== 3 def asof_locs(stamps, where, mask): 4 """ 5 Parameters 6 ---------- 7 stamps : array of timestamps 8 where : array of timestamps 9 Values to determine the "as of" for 10 mask : array of booleans where data is NA 11 12 Returns 13 ------- 14 locs : array of ints 15 Locations of each "as of" value, -1 for NA 16 """ 17 63.3 locs = stamps[mask].searchsorted(where, side='right') 18 8.3 locs = np.where(locs > 0, locs - 1, -1) 19 20 23.3 result = np.arange(len(stamps))[mask].take(locs) 21 22 1.6 first = mask.argmax() 23 3.6 result[(locs == 0) & (where < stamps[first])] = -1 24 25 0.0 return result

The main trickiness here is this step:

Since the indices returned by searchsorted are relative to the filtered values, you need to remap to what would have been the indices in the original array, and this does exactly that. Lastly, you might be interested in breaking up stamps[mask].searchsorted(where, side='right') to see how much time is spend in stamps[mask] versus searchsorted:

 1234567 In [28]: timeit stamps[mask] 10 loops, best of 3: 127 ms per loop In [29]: filt = stamps[mask] In [31]: timeit filt.searchsorted(where) 1 loops, best of 3: 345 ms per loop

So, this could definitely be sped up by a factor or 2 or more by doing the whole operation in C or Cython, but it’s nice to be able to express it concisely and speedily with array operations.

• Thomas

Hi Wes, isn’t one of the masks meant to be ‘~mask’ instead? This throws up an error when there aren’t any NaN values in the series.

Thanks,

Thomas