Over the last week I have completely retooled pandas's "database" join infrastructure / algorithms in order to support the full gamut of SQL-style many-to-many merges (pandas has had one-to-one and many-to-one joins for a long time). I was curious about the performance with reasonably large data sets as compared with base::merge.data.frame which I've used many times in R. So I just ran a little benchmark for joining a 100000-row DataFrame with a 10000-row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.

Note this new functionality will be shipping in the upcoming 0.7.0 release (!).

There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it's completely apples to apples. These are mean runtimes in seconds:

Sort by key columns
pandas R Ratio
inner 0.07819 0.2286 2.924
outer 0.09013 1.284 14.25
left 0.0853 0.7766 9.104
right 0.08112 0.3371 4.156
Don't sort by key columns
pandas R Ratio
inner 0.02808 0.2297 8.18
outer 0.03879 1.181 30.45
left 0.0365 0.6706 18.37
right 0.03021 0.2995 9.912

As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn't really an indictment of R's merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven't run benchmarks against SQLite or another SQL database yet; that will happen eventually.

I'm going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast—it was not easy!

Here is the R code.

N <- 10000
indices = rep(NA, N)
for (i in 1:N)
  indices[i] <- paste(sample(letters, 10), collapse="")

left <- data.frame(key=rep(indices, 10),
                   key2=sample(rep(indices, 10)),
                   value=rnorm(100000))
right <- data.frame(key=indices,
                    key2=sample(indices),
                    value2=rnorm(10000))
timeit <- function(func, niter=10) {
  timing = rep(NA, niter)
  for (i in 1:niter) {
    gc()
    timing[i] <- system.time(func())[3]
  }
  mean(timing)
}

left.join <- function(sort=TRUE) {
  result <- merge(left, right, all.x=TRUE, sort=sort)
}
right.join <- function(sort=TRUE) {
  result <- merge(left, right, all.y=TRUE, sort=sort)
}
outer.join <- function(sort=TRUE) {
  result <- merge(left, right, all=TRUE, sort=sort)
}
inner.join <- function(sort=TRUE) {
  reuslt <- merge(left, right, sort=sort)
}

sort.options <- c(FALSE, TRUE)

results <- matrix(nrow=4, ncol=2)
colnames(results) <- c("dont_sort", "sort")
rownames(results) <- c("inner", "outer", "left", "right")

join.functions <- c(inner.join, outer.join, left.join, right.join)
for (i in 1:4) {
  results[1, 1] <- timeit(function() {inner.join(sort=sort.options[1])})
  results[1, 2] <- timeit(function() {inner.join(sort=sort.options[2])})
  results[2, 1] <- timeit(function() {outer.join(sort=sort.options[1])})
  results[2, 2] <- timeit(function() {outer.join(sort=sort.options[2])})
  results[3, 1] <- timeit(function() {left.join(sort=sort.options[1])})
  results[3, 2] <- timeit(function() {left.join(sort=sort.options[2])})
  results[4, 1] <- timeit(function() {right.join(sort=sort.options[1])})
  results[4, 2] <- timeit(function() {right.join(sort=sort.options[2])})
}

And the Python code

from collections import defaultdict
import gc
import time
from pandas.util.testing import rands
N = 10000
indices = np.array([rands(10) for _ in xrange(N)], dtype='O')

key = np.tile(indices, 10)
key2 = key.copy()
random.shuffle(key2)
indices2 = indices.copy()
random.shuffle(indices2)
left = DataFrame({'key' : key, 'key2':key2,
                  'value' : np.random.randn(100000)})
right = DataFrame({'key': indices, 'key2':indices2,
                   'value2' : np.random.randn(10000)})
join_methods = ['inner', 'outer', 'left', 'right']
results = DataFrame(index=join_methods, columns=[False, True])
niter = 10
for sort in [False, True]:
    for join_method in join_methods:
        f = lambda: merge(left, right, how=join_method, sort=sort)
        gc.disable()
        start = time.time()
        for _ in xrange(niter):
            f()
        elapsed = (time.time() - start) / niter
        gc.enable()
        results[sort][join_method] = elapsed
results.columns = ['dont_sort', 'sort']

# R results
from StringIO import StringIO
r_results = read_table(StringIO("""dont_sort   sort
inner    0.2297 0.2286
outer    1.1811 1.2843
left     0.6706 0.7766
right    0.2995 0.3371
"""), sep='\s+')

sort_results = DataFrame.from_items([('pandas', results['sort']),
                                     ('R', r_results['sort'])])
sort_results['Ratio'] = sort_results['R'] / sort_results['pandas']

nosort_results = DataFrame.from_items([('pandas', results['dont_sort']),
                                       ('R', r_results['dont_sort'])])
nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas']