In anticipation of integrating NumPy’s shiny new `datetime64` dtype into pandas, I set about writing some faster alignment and merging functions for ordered time series data indexed by datetime64 timestamps. Many people have pointed me to the widely used R xts package as a baseline for highly optimized joining functions.

Anyway, long story short, with a little NumPy- and Cython-fu I think I’ve matched or beaten xts for almost all of its supported join types by up to 40% (left/outer/inner) using the `merge.xts` function.

In a blog article earlier today I wrote about some of the performance problems I had to address to do this. The rest of the joining code is pretty straightforward Cython code. Though it’ll probably be a few weeks before this new code gets incorporated into `DataFrame.join`. You’ll just have to take my word for it that I’m doing an apples-to-apples comparison (or read the source yourself) =)

### Python benchmarks

Here are the Python timings in milliseconds for joining two time series data sets. The column labels are the lengths (in scientific notation, from 100 through 1,000,000). The two timings are with two univariate time series and two collections of 5 time series.

**EDIT (9/24/2011):** after corresponding with the xts author, Jeff Ryan, I reran the benchmarks with the code modified to ensure that garbage collection time isn’t being included in the runtime. The results after the change to the benchmark are less disparate than before. I also tweaked the Cython algos to determine the outer/inner join time index and re-ran the benchmarks. In the 1e6 outer join case the new algo trimmed 8 ms off, 4-5ms in the inner join case. Whenever I develop a strong desire to hack up a pile of spaghetti-like Cython code (combining the index union/intersection routines with the take / row-copying code) I can probably shave off another few millis…

Joining two univariate time series

1e2 1e3 1e4 1e5 1e6

outer 0.0605 0.0637 0.1966 1.898 26.26

left 0.0187 0.02282 0.1157 1.023 13.89

inner 0.04526 0.05052 0.1523 1.382 22.25

Joining two 5-variate time series

1e2 1e3 1e4 1e5 1e6

outer 0.07944 0.0638 0.3178 6.498 67.46

left 0.0255 0.03512 0.2467 4.711 51.88

inner 0.06176 0.05262 0.2283 5.267 56.46

**EDIT 9/28:** I put in some work integrating the new merging routines throughout DataFrame and friends in pandas and added a new `Int64Index` class to facilitate fast joining of time series data. Here are the updated benchmarks, which now have pandas a bit slower than xts for outer/inner joins in the univariate case but still significantly faster in the multivariate case:

Joining two univariate time series

1e2 1e3 1e4 1e5 1e6

outer 0.4501 0.314 0.5362 3.162 30.78

left 0.2742 0.2879 0.408 2.025 19.84

inner 0.2715 0.2863 0.4306 2.504 26.64

Joining two 5-variate time series

1e2 1e3 1e4 1e5 1e6

outer 0.4507 0.3375 0.6158 7.028 71.34

left 0.2959 0.3184 0.4927 5.068 55.2

inner 0.2767 0.305 0.5368 5.782 59.65

As you can see in the 1 million row case there is an additional 4-5 ms of overhead across the board which largely has to do with handling types other than floating point. With some effort I could eliminate this overhead but I’m going to leave it for now.

And the source code for the benchmark:

import gc

import numpy as np

def bench_python(pct_overlap=0.20, K=1):

ns = [2, 3, 4, 5, 6]

iterations = 50

pct_overlap = 0.2

kinds = ['outer', 'left', 'inner']

all_results = {}

for logn in ns:

n = 10**logn

a = np.arange(n, dtype=np.int64)

b = np.arange(n * pct_overlap, n * pct_overlap + n, dtype=np.int64)

a_frame = DataFrame(np.random.randn(n, K), index=a, columns=range(K))

b_frame = DataFrame(np.random.randn(n, K), index=b, columns=range(K, 2 * K))

all_results[logn] = result = {}

for kind in kinds:

gc.disable(); _s = time.clock()

# do the join

for _ in range(iterations):

a_frame.join(b_frame, how=kind)

elapsed = time.clock() - _s; gc.enable()

result[kind] = (elapsed / iterations) * 1000

return DataFrame(all_results, index=kinds)

### R/xts benchmarks

And the R benchmark using xts. The results for the smaller datasets are unreliable due to the low precision of `system.time`.

Joining two univariate time series

1e2 1e3 1e4 1e5 1e6

outer 0.30 0.26 0.48 3.12 28.58

left 0.22 0.24 0.36 2.78 24.18

inner 0.20 0.24 0.40 2.42 21.06

Joining two 5-variate time series

1e2 1e3 1e4 1e5 1e6

outer 0.26 0.46 1.30 11.56 97.02

left 0.34 0.28 1.06 10.04 85.72

inner 0.30 0.28 0.94 8.02 67.22

The Python code for the benchmark is all found here.

Here is the R code for the benchmark (GitHub link):

iterations <- 50

ns = c(100, 1000, 10000, 100000, 1000000)

kinds = c("outer", "left", "inner")

result = matrix(0, nrow=3, ncol=length(ns))

n <- 100000

pct.overlap <- 0.2

k <- 1

for (ni in 1:length(ns)){

n <- ns[ni]

rng1 <- 1:n

offset <- as.integer(n * pct.overlap)

rng2 <- rng1 + offset

x <- xts(matrix(rnorm(n * k), nrow=n, ncol=k),

as.POSIXct(Sys.Date()) + rng1)

y <- xts(matrix(rnorm(n * k), nrow=n, ncol=k),

as.POSIXct(Sys.Date()) + rng2)

timing <- numeric()

for (i in 1:3) {

kind = kinds[i]

for(j in 1:iterations) {

gc() # just to be sure

timing[j] <- system.time(merge(x,y,join=kind))[3]

}

#timing <- system.time(for (j in 1:iterations) merge.xts(x, y, join=kind),

# gcFirst=F)

#timing <- as.list(timing)

result[i, ni] <- mean(timing) * 1000

#result[i, ni] = (timing$elapsed / iterations) * 1000

}

}

rownames(result) <- kinds

colnames(result) <- log10(ns)