Faster time series alignment / joins for pandas, beating R’s xts package

In anticipation of integrating NumPy’s shiny new datetime64 dtype into pandas, I set about writing some faster alignment and merging functions for ordered time series data indexed by datetime64 timestamps. Many people have pointed me to the widely used R xts package as a baseline for highly optimized joining functions.

Anyway, long story short, with a little NumPy- and Cython-fu I think I’ve matched or beaten xts for almost all of its supported join types by up to 40% (left/outer/inner) using the merge.xts function.

In a blog article earlier today I wrote about some of the performance problems I had to address to do this. The rest of the joining code is pretty straightforward Cython code. Though it’ll probably be a few weeks before this new code gets incorporated into DataFrame.join. You’ll just have to take my word for it that I’m doing an apples-to-apples comparison (or read the source yourself) =)

Python benchmarks

Here are the Python timings in milliseconds for joining two time series data sets. The column labels are the lengths (in scientific notation, from 100 through 1,000,000). The two timings are with two univariate time series and two collections of 5 time series.

EDIT (9/24/2011): after corresponding with the xts author, Jeff Ryan, I reran the benchmarks with the code modified to ensure that garbage collection time isn’t being included in the runtime. The results after the change to the benchmark are less disparate than before. I also tweaked the Cython algos to determine the outer/inner join time index and re-ran the benchmarks. In the 1e6 outer join case the new algo trimmed 8 ms off, 4-5ms in the inner join case. Whenever I develop a strong desire to hack up a pile of spaghetti-like Cython code (combining the index union/intersection routines with the take / row-copying code) I can probably shave off another few millis…

Python Benchmark on 9/25

Joining two univariate time series
       1e2      1e3      1e4     1e5    1e6    
outer  0.0605   0.0637   0.1966  1.898  26.26
left   0.0187   0.02282  0.1157  1.023  13.89
inner  0.04526  0.05052  0.1523  1.382  22.25

Joining two 5-variate time series
       1e2      1e3      1e4     1e5    1e6    
outer  0.07944  0.0638   0.3178  6.498  67.46
left   0.0255   0.03512  0.2467  4.711  51.88
inner  0.06176  0.05262  0.2283  5.267  56.46

EDIT 9/28: I put in some work integrating the new merging routines throughout DataFrame and friends in pandas and added a new Int64Index class to facilitate fast joining of time series data. Here are the updated benchmarks, which now have pandas a bit slower than xts for outer/inner joins in the univariate case but still significantly faster in the multivariate case:

Python Benchmark on 9/28 post integration

Joining two univariate time series
       1e2     1e3     1e4     1e5    1e6    
outer  0.4501  0.314   0.5362  3.162  30.78
left   0.2742  0.2879  0.408   2.025  19.84
inner  0.2715  0.2863  0.4306  2.504  26.64

Joining two 5-variate time series
       1e2      1e3    1e4     1e5    1e6    
outer  0.4507  0.3375  0.6158  7.028  71.34
left   0.2959  0.3184  0.4927  5.068  55.2
inner  0.2767  0.305   0.5368  5.782  59.65

As you can see in the 1 million row case there is an additional 4-5 ms of overhead across the board which largely has to do with handling types other than floating point. With some effort I could eliminate this overhead but I’m going to leave it for now.

And the source code for the benchmark:

from pandas import *
import gc
import numpy as np

def bench_python(pct_overlap=0.20, K=1):
    ns = [2, 3, 4, 5, 6]
    iterations = 50
    pct_overlap = 0.2
    kinds = ['outer', 'left', 'inner']

    all_results = {}
    for logn in ns:
        n = 10**logn
        a = np.arange(n, dtype=np.int64)
        b = np.arange(n * pct_overlap, n * pct_overlap + n, dtype=np.int64)
        a_frame = DataFrame(np.random.randn(n, K), index=a, columns=range(K))
        b_frame = DataFrame(np.random.randn(n, K), index=b, columns=range(K, 2 * K))
        all_results[logn] = result = {}
        for kind in kinds:
            gc.disable(); _s = time.clock()
            # do the join
            for _ in range(iterations):
                a_frame.join(b_frame, how=kind)
           
            elapsed = time.clock() - _s; gc.enable()
            result[kind] = (elapsed / iterations) * 1000
    return DataFrame(all_results, index=kinds)

R/xts benchmarks

And the R benchmark using xts. The results for the smaller datasets are unreliable due to the low precision of system.time.

R Benchmark

Joining two univariate time series
       1e2  1e3  1e4  1e5   1e6
outer 0.30 0.26 0.48 3.12 28.58
left  0.22 0.24 0.36 2.78 24.18
inner 0.20 0.24 0.40 2.42 21.06

Joining two 5-variate time series
       1e2  1e3  1e4   1e5   1e6
outer 0.26 0.46 1.30 11.56 97.02
left  0.34 0.28 1.06 10.04 85.72
inner 0.30 0.28 0.94  8.02 67.22

The Python code for the benchmark is all found here.

Here is the R code for the benchmark (GitHub link):

library(xts)

iterations <- 50

ns = c(100, 1000, 10000, 100000, 1000000)
kinds = c("outer", "left", "inner")

result = matrix(0, nrow=3, ncol=length(ns))
n <- 100000
pct.overlap <- 0.2

k <- 1

for (ni in 1:length(ns)){
 n <- ns[ni]
 rng1 <- 1:n
 offset <- as.integer(n * pct.overlap)
 rng2 <- rng1 + offset
 x <- xts(matrix(rnorm(n * k), nrow=n, ncol=k),
          as.POSIXct(Sys.Date()) + rng1)
 y <- xts(matrix(rnorm(n * k), nrow=n, ncol=k),
          as.POSIXct(Sys.Date()) + rng2)
 timing <- numeric()
 for (i in 1:3) {
     kind = kinds[i]
     for(j in 1:iterations) {
       gc()  # just to be sure
       timing[j] <- system.time(merge(x,y,join=kind))[3]
     }
     #timing <- system.time(for (j in 1:iterations) merge.xts(x, y, join=kind),
     #                      gcFirst=F)
     #timing <- as.list(timing)
     result[i, ni] <- mean(timing) * 1000
     #result[i, ni] = (timing$elapsed / iterations) * 1000
   }
}

rownames(result) <- kinds
colnames(result) <- log10(ns)