Over the last week I have completely retooled pandas's "database" join infrastructure / algorithms in order to support the full gamut of SQL-style many-to-many merges (pandas has had one-to-one and many-to-one joins for a long time). I was curious about the performance with reasonably large data sets as compared with base::merge.data.frame which I've used many times in R. So I just ran a little benchmark for joining a 100000-row DataFrame with a 10000-row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.
Note this new functionality will be shipping in the upcoming 0.7.0 release (!).
There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it's completely apples to apples. These are mean runtimes in seconds:
Sort by key columns
Don't sort by key columns
As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn't really an indictment of R's merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven't run benchmarks against SQLite or another SQL database yet; that will happen eventually.
I'm going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast—it was not easy!