## Mastering high performance data algorithms I: Group By

I’m on my way back from R/Finance 2012. Those guys did a nice job of organizing the conference and was great to meet everyone there.

As part of pandas development, I have had to develop a suite of high performance data algorithms and implementation strategies which are the heart and soul of the library. I get asked a lot why pandas’s performance is much better than R and other data manipulation tools. The reasons are two-fold: careful implementation (in Cython and and C, so minimizing the computational friction) and carefully-thought-out algorithms.

Here are some of the more important tools and ideas that I make use of on a day-to-day basis:

• Hash tables I use klib. It’s awesome and easy to use. I have this for critical algorithms like unique, factorize (unique + integer label assignment)
• O(n) sort, known in pandas as groupsort, on integers with known range. This is a variant of counting sort; if you have N integers with known range from 0 to K – 1, this can be sorted in O(N) time. Combining this tool with factorize (hash table-based), you can categorize and sort a large data set in linear time. Failure to understand these two algorithms will force you to pay O(N log N), dominating the runtime of your algorithm.
• Vectorized data movement and subsetting routines: take, put, putmask, replace, etc.
• Let me give you a prime example from a commit yesterday of me applying these ideas to great effect. Suppose I had a time series (or DataFrame containing time series) that I want to group by year, month, and day:

 123456789101112131415161718192021222324252627282930313233343536 In [6]: rng = date_range('1/1/2000', periods=20, freq='4h') In [7]: ts = Series(np.random.randn(len(rng)), index=rng) In [8]: ts Out[8]: 2000-01-01 00:00:00 -0.891761 2000-01-01 04:00:00 0.204853 2000-01-01 08:00:00 0.690581 2000-01-01 12:00:00 0.454010 2000-01-01 16:00:00 -0.123102 2000-01-01 20:00:00 0.300111 2000-01-02 00:00:00 -1.052215 2000-01-02 04:00:00 0.094484 2000-01-02 08:00:00 0.318417 2000-01-02 12:00:00 0.779984 2000-01-02 16:00:00 -1.514042 2000-01-02 20:00:00 2.550011 2000-01-03 00:00:00 0.983423 2000-01-03 04:00:00 -0.710861 2000-01-03 08:00:00 -1.350554 2000-01-03 12:00:00 -0.464388 2000-01-03 16:00:00 0.817372 2000-01-03 20:00:00 1.057514 2000-01-04 00:00:00 0.743033 2000-01-04 04:00:00 0.925849 Freq: 4H In [9]: by = lambda x: lambda y: getattr(y, x) In [10]: ts.groupby([by('year'), by('month'), by('day')]).mean() Out[10]: 2000 1 1 0.105782 2 0.196106 3 0.055418 4 0.834441

This is a small dataset, but imagine you have millions of observations and thousands or even millions of groups. How does that look algorithmically? I guarantee if you take a naive approach, you will crash and burn when the data increases in size. I know, because I did just that (take a look at the vbenchmarks). Laying down the infrastructure for doing a better job is not simple. Here are the steps for efficiently aggregating data like this:

• Factorize the group keys, computing integer label arrays (O(N) per key)
• Compute a “cartesian product number” or group index for each observation (since you could theoretically have K_1 * … * K_p groups observed, where K_i is the number of unique values in key i). This is again O(n) work.
• If the maximum number of groups is large, “compress” the group index by using the factorize algorithm on it again. imagine you have 1000 uniques per key and 3 keys; most likely you do not actually observe 1 billion key combinations but rather some much smaller number. O(n) work again
• For simple aggregations, like mean, go ahead and aggregate the data in one linear sweep without moving anything around.
• Otherwise, for arbitrary user-defined aggregations, we have two options for chopping up the data set: sorting it (O(N) using groupsort!) or create index arrays indicating which observations belong to each group. The latter would be preferable in many cases to sorting a large data set (reordering the data, even though linear time, can be very costly).
• I worked on speeding up the latter part of this last bullet point yesterday. The resulting code looked like this:

 1234567891011121314151617 def _get_indices_dict(label_list, keys): # Accepts factorized labels and unique key values shape = [len(x) for x in keys] group_index = get_group_index(label_list, shape) # Compute group index sorter, _ = lib.groupsort_indexer(com._ensure_int64(group_index), np.prod(shape)) # Reorder key labels and group index sorted_labels = [lab.take(sorter) for lab in label_list] group_index = group_index.take(sorter) # Compute dict of {group tuple -> location NumPy array for group} index = np.arange(len(group_index)).take(sorter) return lib.indices_fast(index, group_index, keys, sorted_labels)

The details of lib.indices_fast aren’t that interesting; it chops up np.arange(len(group_index)).take(sorter), the sorted indirect indices, to produce the index dictionary. Running %lprun to get a line profiling on a large-ish data set:

 123456789101112131415161718192021222324252627 In [11]: rng = date_range('1/1/2000', '12/31/2005', freq='H') In [12]: year, month, day = rng.year, rng.month, rng.day In [13]: ts = Series(np.random.randn(len(rng)), index=rng) In [14]: lprun -f gp._get_indices_dict for i in xrange(100): ts.groupby([year, month, day]).indices Timer unit: 1e-06 s File: pandas/core/groupby.py Function: _get_indices_dict at line 1975 Total time: 0.628506 s Line # Hits Time Per Hit % Time Line Contents ============================================================== 1975 def _get_indices_dict(label_list, keys): 1976 400 695 1.7 0.1 shape = [len(x) for x in keys] 1977 100 114388 1143.9 18.2 group_index = get_group_index(label_list, shape) 1978 1979 100 320 3.2 0.1 sorter, _ = lib.groupsort_indexer(com._ensure_int64(group_index), 1980 100 62007 620.1 9.9 np.prod(shape)) 1981 1982 400 53530 133.8 8.5 sorted_labels = [lab.take(sorter) for lab in label_list] 1983 100 19516 195.2 3.1 group_index = group_index.take(sorter) 1984 100 20189 201.9 3.2 index = np.arange(len(group_index)).take(sorter) 1985 1986 100 357861 3578.6 56.9 return lib.indices_fast(index, group_index, keys, sorted_labels)

You might say, well, this seems like a lot of work and maybe we should just zip the keys (forming an array of Python tuples) and do a dumb algorithm? The speed difference ends up being something like an order of magnitude or more faster by being careful in this way and working with indirect integer index arrays.

Anyway, in conclusion, it’s these kinds of algorithms and ideas why pandas is perhaps the best-performing open-source data analysis toolkit for in memory data (I’m going to get to out-of-core data processing and “big data” eventually, just hang tight). It goes beyond language features and data structure internals (though this naturally also has a major impact, a lot of the things I do are easy to express in Python but would be very awkward or impossible to do in, say, R. Maybe I should write a whole article on this.); carefully thought-out algorithms are a major piece of the puzzle.

• Bob Hancock

Nice post, I look forwardt o the book.

The accuracy of typing on a tablet.

• Jorgen

I really enjoy following pandas development.

Is there a mailing list to discuss development of pandas?

• patfla

It would be helpful if in the following code block

In [11]: rng = date_range(’1/1/2000′, ’12/31/2005′, freq=’H')

In [12]: year, month, day = rng.year, rng.month, rng.day

In [13]: ts = Series(np.random.randn(len(rng)), index=rng)

In [14]: lprun -f gp._get_indices_dict for i in xrange(100): ts.groupby([year, month, day]).indices

Timer unit: 1e-06 s