September 29, 2015: This post is so old, I suggest you install latest versions of everything and try the benchmarks out yourself. -- Wes

In anticipation of integrating NumPy's shiny new `datetime64`

dtype into pandas, I set about writing some faster alignment and merging functions for ordered time series data indexed by datetime64 timestamps. Many people have pointed me to the widely used R xts package as a baseline for highly optimized joining functions.

Anyway, long story short, with a little NumPy- and Cython-fu I think I've matched or beaten xts for almost all of its supported join types by up to 40% (left/outer/inner) using the `merge.xts`

function.

In a blog article earlier today I wrote about some of the performance problems I had to address to do this. The rest of the joining code is pretty straightforward Cython code. Though it'll probably be a few weeks before this new code gets incorporated into `DataFrame.join`

. You'll just have to take my word for it that I'm doing an apples-to-apples comparison (or read the source yourself) =)

### Python benchmarks

Here are the Python timings in milliseconds for joining two time series data sets. The column labels are the lengths (in scientific notation, from 100 through 1,000,000). The two timings are with two univariate time series and two collections of 5 time series.

**EDIT (9/24/2011):** after corresponding with the xts author, Jeff Ryan, I reran the benchmarks with the code modified to ensure that garbage collection time isn't being included in the runtime. The results after the change to the benchmark are less disparate than before. I also tweaked the Cython algos to determine the outer/inner join time index and re-ran the benchmarks. In the 1e6 outer join case the new algo trimmed 8 ms off, 4-5ms in the inner join case. Whenever I develop a strong desire to hack up a pile of spaghetti-like Cython code (combining the index union/intersection routines with the take / row-copying code) I can probably shave off another few millis…

Python Benchmark on 9/25 Joining two univariate time series 1e2 1e3 1e4 1e5 1e6 outer 0.0605 0.0637 0.1966 1.898 26.26 left 0.0187 0.02282 0.1157 1.023 13.89 inner 0.04526 0.05052 0.1523 1.382 22.25 Joining two 5-variate time series 1e2 1e3 1e4 1e5 1e6 outer 0.07944 0.0638 0.3178 6.498 67.46 left 0.0255 0.03512 0.2467 4.711 51.88 inner 0.06176 0.05262 0.2283 5.267 56.46

**EDIT 9/28:** I put in some work integrating the new merging routines throughout DataFrame and friends in pandas and added a new `Int64Index`

class to facilitate fast joining of time series data. Here are the updated benchmarks, which now have pandas a bit slower than xts for outer/inner joins in the univariate case but still significantly faster in the multivariate case:

Python Benchmark on 9/28 post integration Joining two univariate time series 1e2 1e3 1e4 1e5 1e6 outer 0.4501 0.314 0.5362 3.162 30.78 left 0.2742 0.2879 0.408 2.025 19.84 inner 0.2715 0.2863 0.4306 2.504 26.64 Joining two 5-variate time series 1e2 1e3 1e4 1e5 1e6 outer 0.4507 0.3375 0.6158 7.028 71.34 left 0.2959 0.3184 0.4927 5.068 55.2 inner 0.2767 0.305 0.5368 5.782 59.65

As you can see in the 1 million row case there is an additional 4-5 ms of overhead across the board which largely has to do with handling types other than floating point. With some effort I could eliminate this overhead but I'm going to leave it for now.

And the source code for the benchmark:

from pandas import * import gc import numpy as np def bench_python(pct_overlap=0.20, K=1): ns = [2, 3, 4, 5, 6] iterations = 50 pct_overlap = 0.2 kinds = ['outer', 'left', 'inner'] all_results = {} for logn in ns: n = 10**logn a = np.arange(n, dtype=np.int64) b = np.arange(n * pct_overlap, n * pct_overlap + n, dtype=np.int64) a_frame = DataFrame(np.random.randn(n, K), index=a, columns=range(K)) b_frame = DataFrame(np.random.randn(n, K), index=b, columns=range(K, 2 * K)) all_results[logn] = result = {} for kind in kinds: gc.disable(); _s = time.clock() # do the join for _ in range(iterations): a_frame.join(b_frame, how=kind) elapsed = time.clock() - _s; gc.enable() result[kind] = (elapsed / iterations) * 1000 return DataFrame(all_results, index=kinds)

### R/xts benchmarks

And the R benchmark using xts. The results for the smaller datasets are unreliable due to the low precision of `system.time`

.

R Benchmark Joining two univariate time series 1e2 1e3 1e4 1e5 1e6 outer 0.30 0.26 0.48 3.12 28.58 left 0.22 0.24 0.36 2.78 24.18 inner 0.20 0.24 0.40 2.42 21.06 Joining two 5-variate time series 1e2 1e3 1e4 1e5 1e6 outer 0.26 0.46 1.30 11.56 97.02 left 0.34 0.28 1.06 10.04 85.72 inner 0.30 0.28 0.94 8.02 67.22

The Python code for the benchmark is all found here.

Here is the R code for the benchmark (GitHub link):

library(xts) iterations <- 50 ns = c(100, 1000, 10000, 100000, 1000000) kinds = c("outer", "left", "inner") result = matrix(0, nrow=3, ncol=length(ns)) n <- 100000 pct.overlap <- 0.2 k <- 1 for (ni in 1:length(ns)){ n <- ns[ni] rng1 <- 1:n offset <- as.integer(n * pct.overlap) rng2 <- rng1 + offset x <- xts(matrix(rnorm(n * k), nrow=n, ncol=k), as.POSIXct(Sys.Date()) + rng1) y <- xts(matrix(rnorm(n * k), nrow=n, ncol=k), as.POSIXct(Sys.Date()) + rng2) timing <- numeric() for (i in 1:3) { kind = kinds[i] for(j in 1:iterations) { gc() # just to be sure timing[j] <- system.time(merge(x,y,join=kind))[3] } #timing <- system.time(for (j in 1:iterations) merge.xts(x, y, join=kind), # gcFirst=F) #timing <- as.list(timing) result[i, ni] <- mean(timing) * 1000 #result[i, ni] = (timing$elapsed / iterations) * 1000 } } rownames(result) <- kinds colnames(result) <- log10(ns)