Over the last week I have completely retooled pandas's "database" join infrastructure / algorithms in order to support the full gamut of SQLstyle manytomany merges (pandas has had onetoone and manytoone joins for a long time). I was curious about the performance with reasonably large data sets as compared with base::merge.data.frame
which I've used many times in R. So I just ran a little benchmark for joining a 100000row DataFrame with a 10000row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.
Note this new functionality will be shipping in the upcoming 0.7.0 release (!).
There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it's completely apples to apples. These are mean runtimes in seconds:


As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn't really an indictment of R's merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven't run benchmarks against SQLite or another SQL database yet; that will happen eventually.
I'm going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast—it was not easy!
Here is the R code.
N < 10000
indices = rep(NA, N)
for (i in 1:N)
indices[i] < paste(sample(letters, 10), collapse="")
left < data.frame(key=rep(indices, 10),
key2=sample(rep(indices, 10)),
value=rnorm(100000))
right < data.frame(key=indices,
key2=sample(indices),
value2=rnorm(10000))
timeit < function(func, niter=10) {
timing = rep(NA, niter)
for (i in 1:niter) {
gc()
timing[i] < system.time(func())[3]
}
mean(timing)
}
left.join < function(sort=TRUE) {
result < merge(left, right, all.x=TRUE, sort=sort)
}
right.join < function(sort=TRUE) {
result < merge(left, right, all.y=TRUE, sort=sort)
}
outer.join < function(sort=TRUE) {
result < merge(left, right, all=TRUE, sort=sort)
}
inner.join < function(sort=TRUE) {
reuslt < merge(left, right, sort=sort)
}
sort.options < c(FALSE, TRUE)
results < matrix(nrow=4, ncol=2)
colnames(results) < c("dont_sort", "sort")
rownames(results) < c("inner", "outer", "left", "right")
join.functions < c(inner.join, outer.join, left.join, right.join)
for (i in 1:4) {
results[1, 1] < timeit(function() {inner.join(sort=sort.options[1])})
results[1, 2] < timeit(function() {inner.join(sort=sort.options[2])})
results[2, 1] < timeit(function() {outer.join(sort=sort.options[1])})
results[2, 2] < timeit(function() {outer.join(sort=sort.options[2])})
results[3, 1] < timeit(function() {left.join(sort=sort.options[1])})
results[3, 2] < timeit(function() {left.join(sort=sort.options[2])})
results[4, 1] < timeit(function() {right.join(sort=sort.options[1])})
results[4, 2] < timeit(function() {right.join(sort=sort.options[2])})
}
And the Python code
from collections import defaultdict
import gc
import time
from pandas.util.testing import rands
N = 10000
indices = np.array([rands(10) for _ in xrange(N)], dtype='O')
key = np.tile(indices, 10)
key2 = key.copy()
random.shuffle(key2)
indices2 = indices.copy()
random.shuffle(indices2)
left = DataFrame({'key' : key, 'key2':key2,
'value' : np.random.randn(100000)})
right = DataFrame({'key': indices, 'key2':indices2,
'value2' : np.random.randn(10000)})
join_methods = ['inner', 'outer', 'left', 'right']
results = DataFrame(index=join_methods, columns=[False, True])
niter = 10
for sort in [False, True]:
for join_method in join_methods:
f = lambda: merge(left, right, how=join_method, sort=sort)
gc.disable()
start = time.time()
for _ in xrange(niter):
f()
elapsed = (time.time()  start) / niter
gc.enable()
results[sort][join_method] = elapsed
results.columns = ['dont_sort', 'sort']
# R results
from StringIO import StringIO
r_results = read_table(StringIO("""dont_sort sort
inner 0.2297 0.2286
outer 1.1811 1.2843
left 0.6706 0.7766
right 0.2995 0.3371
"""), sep='\s+')
sort_results = DataFrame.from_items([('pandas', results['sort']),
('R', r_results['sort'])])
sort_results['Ratio'] = sort_results['R'] / sort_results['pandas']
nosort_results = DataFrame.from_items([('pandas', results['dont_sort']),
('R', r_results['dont_sort'])])
nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas']