Yesterday I found myself adding some additional Cython methods for doing fast grouped aggregations in pandas. To my disappointment, I found myself duplicating a lot of code and not having much alternative beyond cooking up some kind of ad hoc code generation framework. So, here’s the problem: array of data (possibly with NAs), array of labels, with the number of distinct labels known (range from 0 to N – 1). Aggregate data based on label using some function in {sum, min, max, product, mean, variance, …}. Here is a Cython function to add all the values in a group:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
def group_add(ndarray[float64_t] out, ndarray[int32_t] counts, ndarray[float64_t] values, ndarray[int32_t] labels): cdef: Py_ssize_t i, N, lab float64_t val, count ndarray[float64_t] sumx, nobs nobs = np.zeros_like(out) sumx = np.zeros_like(out) N = len(values) for i in range(N): lab = labels[i] if lab < 0: continue counts[lab] += 1 val = values[i] # not NA if val == val: nobs[lab] += 1 sumx[lab] += val for i in range(len(counts)): if nobs[i] == 0: out[i] = nan else: out[i] = sumx[i] |

Now suppose we want to take the product of the values:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
def group_prod(ndarray[float64_t] out, ndarray[int32_t] counts, ndarray[float64_t] values, ndarray[int32_t] labels): cdef: Py_ssize_t i, N, lab float64_t val, count ndarray[float64_t] prodx, nobs nobs = np.zeros_like(out) prodx = np.ones_like(out) N = len(values) for i in range(N): lab = labels[i] if lab < 0: continue counts[lab] += 1 val = values[i] # not nan if val == val: nobs[lab] += 1 prodx[lab] *= val for i in range(len(counts)): if nobs[i] == 0: out[i] = nan else: out[i] = prodx[i] |

However, the amount of code duplication here is kind of ridiculous; the only changes are: different reduction function, and different starting values. Ideally I would like to write code that looks like:

1 2 3 4 5 6 7 8 |
def group_reduce(how): def reducer(data, labels, out): filtered = notna(data) grouped = groupby(filtered, labels) return reduce(how, grouped, out=out) return reducer group_add = group_reduce(add) group_prod = group_reduce(multiply) |

IMHO there is plenty of information here to generate C or LLVM that is at least as fast as the above code. However, this isn’t completely extensible– suppose we wanted to compute group variance. In that case what you actually need is

1 2 3 4 |
nobs = reduce(count, grouped) X = reduce(add, grouped) XX = reduce(add, grouped * grouped) variance = (nobs * XX - X**2) / (nos * (nobs - 1)) |

The only thing you care about here is the variance result. Ideally, you want the generated code to:

Under this framework, something like mean becomes even more trivial:

1 2 3 4 |
grouped = groupby(notna(data), labels) nobs = reduce(count, grouped) X = reduce(add, grouped) mean = X / nobs |

Now, I can live with creating a C/Cython code generation framework, enabling me to write code like the above and get code that’s at least as good as something hand-written. But it would be best, of course, if the translation / machine code generation were dynamic to remove the compile step. Is there an existing toolchain here that can help with the machine code generation? Perhaps Numba is the place to do this with LLVM: Numba.

Perhaps the eventual result of something like this is an embedded JIT-ing APL-like language with Python syntax running inside CPython. You would like to make Numexpr (which is an awesome project!) effective obsolete while you’re at it. In the meantime, I’ll keep thinking about the best way forward.