I posted a brief article with some preliminary benchmarks for the new merge/join infrastructure that I've built in pandas. I compared the performance with base::merge in R which, as various folks in the R community have pointed out, is fairly slow. There are a few other more intelligently-implemented functions available in CRAN, in particular plyr::join in plyr and the merge implemented for data.table objects in the data.table package.

Lastly, Sean Taylor contributed a benchmark for SQLite3, by accounts the most widely deployed SQL engine.

So anyway, here are the two benchmarks I'm interested in to get a sense of the large-ish data runtime of these algorithms:

  • Many-to-one joins. In these benchmarks I have a 80,000 row table with 10 copies of 8,000 key pairs and an 8,000 row table with a single copy of another 8,000 key pairs, only 6,000 of which are found in the larger table. The result of a left join between these tables should have 80,000 rows, an inner join 60,000, and an outer join 82,000.

  • Many-to-many joins. To keep things simple I use the same tables as above except the right able is the table above stacked on itself. You can do the combinatorics here, but the outer join between these two tables has 144,000 rows. Note: that plyr::join does not implement (or least I've been told to avoid) many-to-many joins so I only run the many-to-one benchmarks for that.

I've normalized the results by the minimum runtime (which is pandas in all cases):

UPDATE (1/6/2012): Jared Lander pointed out that data.table is capable of much faster left and right joins by using the syntax left[right] instead of merge(left, right, all.y=True). I updated the benchmarks below and added the right join results which shows data.table performing very admirably.

pandas vs R merge benchmarks
Many-to-one
pandas data.table plyr base::merge
inner 1 5.905 6.35 13.29
outer 1 10.05 9.209 20.25
left 1 2.849 5.918 14.93
right 1 2.05 2.923 16.91
Many-to-many
pandas data.table plyr base::merge
inner 1 5.194 5.223 18.87
outer 1 10 6.903 33.75
left 1 2.528 4.688 24.46
right 1 1.681 2.05 25.24

SQLite3 Benchmarks

A bit of care needs to be taken with SQLite3 benchmarks because the time to fetch the table from the database cursor (even though this is an in-memory SQLite database) is very significant. The performance as you can imagine is also quite different with and without indexes.

Here is the basic code to insert the data into SQLite:

import sqlite3
create_sql_indexes = False
conn = sqlite3.connect(':memory:')
conn.execute('create table left( key varchar(10), key2 varchar(10), value int);')
conn.execute('create table right( key varchar(10), key2 varchar(10), value2 int);')
conn.executemany('insert into left values (?, ?, ?)',
                 zip(key, key2, left['value']))
conn.executemany('insert into right values (?, ?, ?)',
                 zip(right['key'], right['key2'], right['value2']))
# Create Indices
if create_sql_indexes:
    conn.execute('create index left_ix on left(key, key2)')
    conn.execute('create index right_ix on right(key, key2)')

With no indexes, here is a comparison of just the SQL execution time vs. the total pandas execution time for the many-to-one case described above:

sqlite3 pandas
inner 0.02328 0.01799
outer 0.02324 0.01943
left 0.02324 0.01923

So pandas very slightly edges out SQLite with no indexes. Note that it takes ~300ms (on my machine) to fetch the data out of the database into a list of tuples, so if you consider that, pandas is really crushing it by doing everything in memory.

The results after adding indexes are pretty indistinguishable as far as I can tell because the fetch time (which involves creating and populating a large number of Python tuples) dwarfs the join time it appears. The execute + fetch time varies between 310-340 ms for all three join types, with an without indexes, for the many-to-one case. The many-to-many case varies between 420-490 ms, whereas pandas is 22-25ms!

UPDATE: After some thought and discussions with people, these benchmarks are not fair to SQLite. A more appropriate benchmark would be to create the joined table inside the database using a statement like

CREATE TABLE test AS
SELECT *
   FROM left
   INNER JOIN right
     ON left.key=right.key
       AND left.key2 = right.key2;

Here are new benchmarks using this SQL statement:

Many-to-one Many-to-many
pandas sqlite3
inner 0.01674 0.04187
outer 0.01882 0.04534
left 0.01851 0.04514
pandas sqlite3
inner 0.02091 0.06733
outer 0.02266 0.06968
left 0.02203 0.06882

So pandas still significantly outperforms SQLite3 (even with SQL indexes as in these benchmarks). But it's not totally apples-to-apples as SQLite3 is able to perform joins on extremely large data sets on disk.

Conclusions

As far as I can tell, pandas now has one of the fastest in-memory database join operators out there. If you're using Python to do relational algebra, you'd be crazy to pick SQLite3 over pandas due to the high cost of reading and writing large data sets (in the form of Python tuples) to SQL format.

Code links