Yesterday I found myself adding some additional Cython methods for doing fast grouped aggregations in pandas. To my disappointment, I found myself duplicating a lot of code and not having much alternative beyond cooking up some kind of ad hoc code generation framework. So, here's the problem: array of data (possibly with NAs), array of labels, with the number of distinct labels known (range from 0 to N — 1). Aggregate data based on label using some function in {sum, min, max, product, mean, variance, …}. Here is a Cython function to add all the values in a group:

def group_add(ndarray[float64_t] out, ndarray[int32_t] counts,
              ndarray[float64_t] values, ndarray[int32_t] labels):
    cdef:
        Py_ssize_t i, N, lab
        float64_t val, count
        ndarray[float64_t] sumx, nobs

    nobs = np.zeros_like(out)
    sumx = np.zeros_like(out)

    N = len(values)

    for i in range(N):
        lab = labels[i]
        if lab < 0:
            continue

        counts[lab] += 1
        val = values[i]

        # not NA
        if val == val:
            nobs[lab] += 1
            sumx[lab] += val

    for i in range(len(counts)):
        if nobs[i] == 0:
            out[i] = nan
        else:
            out[i] = sumx[i]

Now suppose we want to take the product of the values:

def group_prod(ndarray[float64_t] out, ndarray[int32_t] counts,
              ndarray[float64_t] values, ndarray[int32_t] labels):
    cdef:
        Py_ssize_t i, N, lab
        float64_t val, count
        ndarray[float64_t] prodx, nobs

    nobs = np.zeros_like(out)
    prodx = np.ones_like(out)

    N = len(values)

    for i in range(N):
        lab = labels[i]
        if lab < 0:
            continue

        counts[lab] += 1
        val = values[i]

        # not nan
        if val == val:
            nobs[lab] += 1
            prodx[lab] *= val

    for i in range(len(counts)):
        if nobs[i] == 0:
            out[i] = nan
        else:
            out[i] = prodx[i]

However, the amount of code duplication here is kind of ridiculous; the only changes are: different reduction function, and different starting values. Ideally I would like to write code that looks like:

def group_reduce(how):
    def reducer(data, labels, out):
        filtered = notna(data)
        grouped = groupby(filtered, labels)
        return reduce(how, grouped, out=out)
    return reducer
group_add = group_reduce(add)
group_prod = group_reduce(multiply)

IMHO there is plenty of information here to generate C or LLVM that is at least as fast as the above code. However, this isn't completely extensible-- suppose we wanted to compute group variance. In that case what you actually need is

nobs = reduce(count, grouped)
X = reduce(add, grouped)
XX = reduce(add, grouped * grouped)
variance = (nobs * XX - X**2) / (nos * (nobs - 1))

The only thing you care about here is the variance result. Ideally, you want the generated code to:

  • Only make one pass over the data
  • Have no unnecessary temporary variables, except the X and XX arrays; the full reductions are necessary for computing the variance. Under this framework, something like mean becomes even more trivial:
grouped = groupby(notna(data), labels)
nobs = reduce(count, grouped)
X = reduce(add, grouped)
mean = X / nobs

Now, I can live with creating a C/Cython code generation framework, enabling me to write code like the above and get code that's at least as good as something hand-written. But it would be best, of course, if the translation / machine code generation were dynamic to remove the compile step. Is there an existing toolchain here that can help with the machine code generation? Perhaps Numba is the place to do this with LLVM: Numba.

Perhaps the eventual result of something like this is an embedded JIT-ing APL-like language with Python syntax running inside CPython. You would like to make Numexpr (which is an awesome project!) effective obsolete while you're at it. In the meantime, I'll keep thinking about the best way forward.