pandas talk at PyHPC 2011 workshop in SC11, thoughts on hash tables

Here are the slides from my talk at PyHPC2011. Not really my usual talk for data crowds– a little more nuts and bolts about some of the indexing and GroupBy implementation details. Some people might be interested in the illustrative data alignment benchmarks which show the relative weakness of Python’s dict implementation (in both speed and memory usage) for lookups and alignment. After these benchmarks I think it’s pretty much inevitable that I’m going to end up writing a custom hash table implementation in C for the data alignment on primitive types. Now, if I wanted a threadsafe hash table that I could use OpenMP on, that would be a serious undertaking. Anyone want to help?

The basic problem is that Python dicts are not designed for my use case– namely very large dicts that I use to perform data alignment operations.

PyHPC2011

  • Vincent

    Looks cool. Can you do a split-apply-combine based on the values of any column, or do you need to first create an index from the variables you want to group by?

    [Reply]

    Wes McKinney Reply:

    Absolutely, see for example: http://pandas.sourceforge.net/groupby.html#splitting-an-object-into-groups

    [Reply]

  • Anonymous

    Instead of using cython you could try pypy 1.7. Also they would probably be interested in a better dict if it is not good enough :)

    [Reply]

    Wes McKinney Reply:

    I should benchmark the pypy dict to see if it’s any faster. Running pandas on pypy is probably a long ways off as I consume the NumPy C API, though. I should write pure Python implementations of some of these algorithms so I can monitor pypy’s speed relative to the Cython versions

    [Reply]

    Anonymous Reply:

    They are doing some really interesting stuff like having many implementations of dict and changing them based on key types and whatnot. It would be interesting to have preallocation hint to dict, or even some high performance collections

    [Reply]