In anticipation of integrating NumPy’s shiny new datetime64 dtype into pandas, I set about writing some faster alignment and merging functions for ordered time series data indexed by datetime64 timestamps. Many people have pointed me to the widely used R xts package as a baseline for highly optimized joining functions.
Anyway, long story short, with a little NumPy- and Cython-fu I think I’ve matched or beaten xts for almost all of its supported join types by up to 40% (left/outer/inner) using the merge.xts function.
In a blog article earlier today I wrote about some of the performance problems I had to address to do this. The rest of the joining code is pretty straightforward Cython code. Though it’ll probably be a few weeks before this new code gets incorporated into DataFrame.join. You’ll just have to take my word for it that I’m doing an apples-to-apples comparison (or read the source yourself) =)
Python benchmarks
Here are the Python timings in milliseconds for joining two time series data sets. The column labels are the lengths (in scientific notation, from 100 through 1,000,000). The two timings are with two univariate time series and two collections of 5 time series.
EDIT (9/24/2011): after corresponding with the xts author, Jeff Ryan, I reran the benchmarks with the code modified to ensure that garbage collection time isn’t being included in the runtime. The results after the change to the benchmark are less disparate than before. I also tweaked the Cython algos to determine the outer/inner join time index and re-ran the benchmarks. In the 1e6 outer join case the new algo trimmed 8 ms off, 4-5ms in the inner join case. Whenever I develop a strong desire to hack up a pile of spaghetti-like Cython code (combining the index union/intersection routines with the take / row-copying code) I can probably shave off another few millis…
Python Benchmark on 9/25
Joining two univariate time series
1e2 1e3 1e4 1e5 1e6
outer 0.0605 0.0637 0.1966 1.898 26.26
left 0.0187 0.02282 0.1157 1.023 13.89
inner 0.04526 0.05052 0.1523 1.382 22.25
Joining two 5-variate time series
1e2 1e3 1e4 1e5 1e6
outer 0.07944 0.0638 0.3178 6.498 67.46
left 0.0255 0.03512 0.2467 4.711 51.88
inner 0.06176 0.05262 0.2283 5.267 56.46
EDIT 9/28: I put in some work integrating the new merging routines throughout DataFrame and friends in pandas and added a new Int64Index class to facilitate fast joining of time series data. Here are the updated benchmarks, which now have pandas a bit slower than xts for outer/inner joins in the univariate case but still significantly faster in the multivariate case:
Python Benchmark on 9/28 post integration
Joining two univariate time series
1e2 1e3 1e4 1e5 1e6
outer 0.4501 0.314 0.5362 3.162 30.78
left 0.2742 0.2879 0.408 2.025 19.84
inner 0.2715 0.2863 0.4306 2.504 26.64
Joining two 5-variate time series
1e2 1e3 1e4 1e5 1e6
outer 0.4507 0.3375 0.6158 7.028 71.34
left 0.2959 0.3184 0.4927 5.068 55.2
inner 0.2767 0.305 0.5368 5.782 59.65
As you can see in the 1 million row case there is an additional 4-5 ms of overhead across the board which largely has to do with handling types other than floating point. With some effort I could eliminate this overhead but I’m going to leave it for now.
And the source code for the benchmark:
from pandas import *
import gc
import numpy as np
def bench_python(pct_overlap=0.20, K=1):
ns = [2, 3, 4, 5, 6]
iterations = 50
pct_overlap = 0.2
kinds = ['outer', 'left', 'inner']
all_results = {}
for logn in ns:
n = 10**logn
a = np.arange(n, dtype=np.int64)
b = np.arange(n * pct_overlap, n * pct_overlap + n, dtype=np.int64)
a_frame = DataFrame(np.random.randn(n, K), index=a, columns=range(K))
b_frame = DataFrame(np.random.randn(n, K), index=b, columns=range(K, 2 * K))
all_results[logn] = result = {}
for kind in kinds:
gc.disable(); _s = time.clock()
# do the join
for _ in range(iterations):
a_frame.join(b_frame, how=kind)
elapsed = time.clock() - _s; gc.enable()
result[kind] = (elapsed / iterations) * 1000
return DataFrame(all_results, index=kinds)
R/xts benchmarks
And the R benchmark using xts. The results for the smaller datasets are unreliable due to the low precision of system.time.
R Benchmark
Joining two univariate time series
1e2 1e3 1e4 1e5 1e6
outer 0.30 0.26 0.48 3.12 28.58
left 0.22 0.24 0.36 2.78 24.18
inner 0.20 0.24 0.40 2.42 21.06
Joining two 5-variate time series
1e2 1e3 1e4 1e5 1e6
outer 0.26 0.46 1.30 11.56 97.02
left 0.34 0.28 1.06 10.04 85.72
inner 0.30 0.28 0.94 8.02 67.22
The Python code for the benchmark is all found here.
Here is the R code for the benchmark (GitHub link):
library(xts)
iterations <- 50
ns = c(100, 1000, 10000, 100000, 1000000)
kinds = c("outer", "left", "inner")
result = matrix(0, nrow=3, ncol=length(ns))
n <- 100000
pct.overlap <- 0.2
k <- 1
for (ni in 1:length(ns)){
n <- ns[ni]
rng1 <- 1:n
offset <- as.integer(n * pct.overlap)
rng2 <- rng1 + offset
x <- xts(matrix(rnorm(n * k), nrow=n, ncol=k),
as.POSIXct(Sys.Date()) + rng1)
y <- xts(matrix(rnorm(n * k), nrow=n, ncol=k),
as.POSIXct(Sys.Date()) + rng2)
timing <- numeric()
for (i in 1:3) {
kind = kinds[i]
for(j in 1:iterations) {
gc() # just to be sure
timing[j] <- system.time(merge(x,y,join=kind))[3]
}
#timing <- system.time(for (j in 1:iterations) merge.xts(x, y, join=kind),
# gcFirst=F)
#timing <- as.list(timing)
result[i, ni] <- mean(timing) * 1000
#result[i, ni] = (timing$elapsed / iterations) * 1000
}
}
rownames(result) <- kinds
colnames(result) <- log10(ns)