High performance database joins with pandas DataFrame, more benchmarks

I posted a brief article with some preliminary benchmarks for the new merge/join infrastructure that I’ve built in pandas. I compared the performance with base::merge in R which, as various folks in the R community have pointed out, is fairly slow. There are a few other more intelligently-implemented functions available in CRAN, in particular plyr::join in plyr and the merge implemented for data.table objects in the data.table package.

Lastly, Sean Taylor contributed a benchmark for SQLite3, by accounts the most widely deployed SQL engine.

So anyway, here are the two benchmarks I’m interested in to get a sense of the large-ish data runtime of these algorithms:

  • Many-to-one joins. In these benchmarks I have a 80,000 row table with 10 copies of 8,000 key pairs and an 8,000 row table with a single copy of another 8,000 key pairs, only 6,000 of which are found in the larger table. The result of a left join between these tables should have 80,000 rows, an inner join 60,000, and an outer join 82,000.
  • Many-to-many joins. To keep things simple I use the same tables as above except the right able is the table above stacked on itself. You can do the combinatorics here, but the outer join between these two tables has 144,000 rows.
  • Note: that plyr::join does not implement (or least I’ve been told to avoid) many-to-many joins so I only run the many-to-one benchmarks for that.

    I’ve normalized the results by the minimum runtime (which is pandas in all cases):

    UPDATE (1/6/2012): Jared Lander pointed out that data.table is capable of much faster left and right joins by using the syntax left[right] instead of merge(left, right, all.y=True). I updated the benchmarks below and added the right join results which shows data.table performing very admirably.

    pandas vs R merge benchmarks
    Many-to-one

    pandas data.table plyr base::merge
    inner 1 5.905 6.35 13.29
    outer 1 10.05 9.209 20.25
    left 1 2.849 5.918 14.93
    right 1 2.05 2.923 16.91

    Many-to-many

    pandas data.table plyr base::merge
    inner 1 5.194 5.223 18.87
    outer 1 10 6.903 33.75
    left 1 2.528 4.688 24.46
    right 1 1.681 2.05 25.24

    SQLite3 Benchmarks

    A bit of care needs to be taken with SQLite3 benchmarks because the time to fetch the table from the database cursor (even though this is an in-memory SQLite database) is very significant. The performance as you can imagine is also quite different with and without indexes.

    Here is the basic code to insert the data into SQLite:

    import sqlite3
    create_sql_indexes = False
    conn = sqlite3.connect(':memory:')
    conn.execute('create table left( key varchar(10), key2 varchar(10), value int);')
    conn.execute('create table right( key varchar(10), key2 varchar(10), value2 int);')
    conn.executemany('insert into left values (?, ?, ?)',
                     zip(key, key2, left['value']))
    conn.executemany('insert into right values (?, ?, ?)',
                     zip(right['key'], right['key2'], right['value2']))
    # Create Indices
    if create_sql_indexes:
        conn.execute('create index left_ix on left(key, key2)')
        conn.execute('create index right_ix on right(key, key2)')

    With no indexes, here is a comparison of just the SQL execution time vs. the total pandas execution time for the many-to-one case described above:


    sqlite3 pandas
    inner 0.02328 0.01799
    outer 0.02324 0.01943
    left 0.02324 0.01923

    So pandas very slightly edges out SQLite with no indexes. Note that it takes ~300ms (on my machine) to fetch the data out of the database into a list of tuples, so if you consider that, pandas is really crushing it by doing everything in memory.

    The results after adding indexes are pretty indistinguishable as far as I can tell because the fetch time (which involves creating and populating a large number of Python tuples) dwarfs the join time it appears. The execute + fetch time varies between 310-340 ms for all three join types, with an without indexes, for the many-to-one case. The many-to-many case varies between 420-490 ms, whereas pandas is 22-25ms!

    UPDATE: After some thought and discussions with people, these benchmarks are not fair to SQLite. A more appropriate benchmark would be to create the joined table inside the database using a statement like

    CREATE TABLE test AS
    SELECT *
       FROM LEFT
       INNER JOIN RIGHT
         ON LEFT.KEY=RIGHT.KEY
           AND LEFT.key2 = RIGHT.key2;

    Here are new benchmarks using this SQL statement:

    Many-to-one Many-to-many


    pandas sqlite3
    inner 0.01674 0.04187
    outer 0.01882 0.04534
    left 0.01851 0.04514


    pandas sqlite3
    inner 0.02091 0.06733
    outer 0.02266 0.06968
    left 0.02203 0.06882

    So pandas still significantly outperforms SQLite3 (even with SQL indexes as in these benchmarks). But it’s not totally apples-to-apples as SQLite3 is able to perform joins on extremely large data sets on disk.

    Conclusions

    As far as I can tell, pandas now has one of the fastest in-memory database join operators out there. If you’re using Python to do relational algebra, you’d be crazy to pick SQLite3 over pandas due to the high cost of reading and writing large data sets (in the form of Python tuples) to SQL format.

    Code links

  • R code
  • pandas code
  • SQLite3 (Python) code