I posted a brief article with some preliminary benchmarks for the new merge/join infrastructure that I’ve built in pandas. I compared the performance with base::merge in R which, as various folks in the R community have pointed out, is fairly slow. There are a few other more intelligently-implemented functions available in CRAN, in particular plyr::join in plyr and the merge implemented for data.table objects in the data.table package.
So anyway, here are the two benchmarks I’m interested in to get a sense of the large-ish data runtime of these algorithms:
Note: that plyr::join does not implement (or least I’ve been told to avoid) many-to-many joins so I only run the many-to-one benchmarks for that.
I’ve normalized the results by the minimum runtime (which is pandas in all cases):
UPDATE (1/6/2012): Jared Lander pointed out that data.table is capable of much faster left and right joins by using the syntax left[right] instead of merge(left, right, all.y=True). I updated the benchmarks below and added the right join results which shows data.table performing very admirably.
A bit of care needs to be taken with SQLite3 benchmarks because the time to fetch the table from the database cursor (even though this is an in-memory SQLite database) is very significant. The performance as you can imagine is also quite different with and without indexes.
Here is the basic code to insert the data into SQLite:
create_sql_indexes = False
conn = sqlite3.connect(':memory:')
conn.execute('create table left( key varchar(10), key2 varchar(10), value int);')
conn.execute('create table right( key varchar(10), key2 varchar(10), value2 int);')
conn.executemany('insert into left values (?, ?, ?)',
zip(key, key2, left['value']))
conn.executemany('insert into right values (?, ?, ?)',
zip(right['key'], right['key2'], right['value2']))
# Create Indices
conn.execute('create index left_ix on left(key, key2)')
conn.execute('create index right_ix on right(key, key2)')
With no indexes, here is a comparison of just the SQL execution time vs. the total pandas execution time for the many-to-one case described above: