pandas talk at PyHPC 2011 workshop in SC11, thoughts on hash tables

Here are the slides from my talk at PyHPC2011. Not really my usual talk for data crowds– a little more nuts and bolts about some of the indexing and GroupBy implementation details. Some people might be interested in the illustrative data alignment benchmarks which show the relative weakness of Python’s dict implementation (in both speed and memory usage) for lookups and alignment. After these benchmarks I think it’s pretty much inevitable that I’m going to end up writing a custom hash table implementation in C for the data alignment on primitive types. Now, if I wanted a threadsafe hash table that I could use OpenMP on, that would be a serious undertaking. Anyone want to help?

The basic problem is that Python dicts are not designed for my use case– namely very large dicts that I use to perform data alignment operations.