I posted a brief article with some preliminary benchmarks for the new merge/join infrastructure that I've built in pandas. I compared the performance with
base::merge in R which, as various folks in the R community have pointed out, is fairly slow. There are a few other more intelligently-implemented functions available in CRAN, in particular
plyr::join in plyr and the merge implemented for
data.table objects in the data.table package.
So anyway, here are the two benchmarks I'm interested in to get a sense of the large-ish data runtime of these algorithms:
Many-to-one joins. In these benchmarks I have a 80,000 row table with 10 copies of 8,000 key pairs and an 8,000 row table with a single copy of another 8,000 key pairs, only 6,000 of which are found in the larger table. The result of a left join between these tables should have 80,000 rows, an inner join 60,000, and an outer join 82,000.
Many-to-many joins. To keep things simple I use the same tables as above except the right able is the table above stacked on itself. You can do the combinatorics here, but the outer join between these two tables has 144,000 rows. Note: that
plyr::joindoes not implement (or least I've been told to avoid) many-to-many joins so I only run the many-to-one benchmarks for that.
I've normalized the results by the minimum runtime (which is pandas in all cases):
UPDATE (1/6/2012): Jared Lander pointed out that
data.table is capable of much faster left and right joins by using the syntax
left[right] instead of
merge(left, right, all.y=True). I updated the benchmarks below and added the right join results which shows
data.table performing very admirably.
A bit of care needs to be taken with SQLite3 benchmarks because the time to fetch the table from the database cursor (even though this is an in-memory SQLite database) is very significant. The performance as you can imagine is also quite different with and without indexes.
Here is the basic code to insert the data into SQLite:
import sqlite3 create_sql_indexes = False conn = sqlite3.connect(':memory:') conn.execute('create table left( key varchar(10), key2 varchar(10), value int);') conn.execute('create table right( key varchar(10), key2 varchar(10), value2 int);') conn.executemany('insert into left values (?, ?, ?)', zip(key, key2, left['value'])) conn.executemany('insert into right values (?, ?, ?)', zip(right['key'], right['key2'], right['value2'])) # Create Indices if create_sql_indexes: conn.execute('create index left_ix on left(key, key2)') conn.execute('create index right_ix on right(key, key2)')
With no indexes, here is a comparison of just the SQL execution time vs. the total pandas execution time for the many-to-one case described above: