Over the last week I have completely retooled pandas's "database" join infrastructure / algorithms in order to support the full gamut of SQLstyle manytomany merges (pandas has had onetoone and manytoone joins for a long time). I was curious about the performance with reasonably large data sets as compared with base::merge.data.frame
which I've used many times in R. So I just ran a little benchmark for joining a 100000row DataFrame with a 10000row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.
Note this new functionality will be shipping in the upcoming 0.7.0 release (!).
There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it's completely apples to apples. These are mean runtimes in seconds:


As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn't really an indictment of R's merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven't run benchmarks against SQLite or another SQL database yet; that will happen eventually.
I'm going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast—it was not easy!
Here is the R code.
N < 10000 indices = rep(NA, N) for (i in 1:N) indices[i] < paste(sample(letters, 10), collapse="") left < data.frame(key=rep(indices, 10), key2=sample(rep(indices, 10)), value=rnorm(100000)) right < data.frame(key=indices, key2=sample(indices), value2=rnorm(10000)) timeit < function(func, niter=10) { timing = rep(NA, niter) for (i in 1:niter) { gc() timing[i] < system.time(func())[3] } mean(timing) } left.join < function(sort=TRUE) { result < merge(left, right, all.x=TRUE, sort=sort) } right.join < function(sort=TRUE) { result < merge(left, right, all.y=TRUE, sort=sort) } outer.join < function(sort=TRUE) { result < merge(left, right, all=TRUE, sort=sort) } inner.join < function(sort=TRUE) { reuslt < merge(left, right, sort=sort) } sort.options < c(FALSE, TRUE) results < matrix(nrow=4, ncol=2) colnames(results) < c("dont_sort", "sort") rownames(results) < c("inner", "outer", "left", "right") join.functions < c(inner.join, outer.join, left.join, right.join) for (i in 1:4) { results[1, 1] < timeit(function() {inner.join(sort=sort.options[1])}) results[1, 2] < timeit(function() {inner.join(sort=sort.options[2])}) results[2, 1] < timeit(function() {outer.join(sort=sort.options[1])}) results[2, 2] < timeit(function() {outer.join(sort=sort.options[2])}) results[3, 1] < timeit(function() {left.join(sort=sort.options[1])}) results[3, 2] < timeit(function() {left.join(sort=sort.options[2])}) results[4, 1] < timeit(function() {right.join(sort=sort.options[1])}) results[4, 2] < timeit(function() {right.join(sort=sort.options[2])}) }
And the Python code
from collections import defaultdict import gc import time from pandas.util.testing import rands N = 10000 indices = np.array([rands(10) for _ in xrange(N)], dtype='O') key = np.tile(indices, 10) key2 = key.copy() random.shuffle(key2) indices2 = indices.copy() random.shuffle(indices2) left = DataFrame({'key' : key, 'key2':key2, 'value' : np.random.randn(100000)}) right = DataFrame({'key': indices, 'key2':indices2, 'value2' : np.random.randn(10000)}) join_methods = ['inner', 'outer', 'left', 'right'] results = DataFrame(index=join_methods, columns=[False, True]) niter = 10 for sort in [False, True]: for join_method in join_methods: f = lambda: merge(left, right, how=join_method, sort=sort) gc.disable() start = time.time() for _ in xrange(niter): f() elapsed = (time.time()  start) / niter gc.enable() results[sort][join_method] = elapsed results.columns = ['dont_sort', 'sort'] # R results from StringIO import StringIO r_results = read_table(StringIO("""dont_sort sort inner 0.2297 0.2286 outer 1.1811 1.2843 left 0.6706 0.7766 right 0.2995 0.3371 """), sep='\s+') sort_results = DataFrame.from_items([('pandas', results['sort']), ('R', r_results['sort'])]) sort_results['Ratio'] = sort_results['R'] / sort_results['pandas'] nosort_results = DataFrame.from_items([('pandas', results['dont_sort']), ('R', r_results['dont_sort'])]) nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas']