Yesterday I found myself adding some additional Cython methods for doing fast grouped aggregations in pandas. To my disappointment, I found myself duplicating a lot of code and not having much alternative beyond cooking up some kind of ad hoc code generation framework. So, here's the problem: array of data (possibly with NAs), array of labels, with the number of distinct labels known (range from 0 to N — 1). Aggregate data based on label using some function in {sum, min, max, product, mean, variance, …}. Here is a Cython function to add all the values in a group:

def group_add(ndarray[float64_t] out, ndarray[int32_t] counts,
ndarray[float64_t] values, ndarray[int32_t] labels):
cdef:
Py_ssize_t i, N, lab
float64_t val, count
ndarray[float64_t] sumx, nobs

nobs = np.zeros_like(out)
sumx = np.zeros_like(out)

N = len(values)

for i in range(N):
lab = labels[i]
if lab < 0:
continue

counts[lab] += 1
val = values[i]

# not NA
if val == val:
nobs[lab] += 1
sumx[lab] += val

for i in range(len(counts)):
if nobs[i] == 0:
out[i] = nan
else:
out[i] = sumx[i]


Now suppose we want to take the product of the values:

def group_prod(ndarray[float64_t] out, ndarray[int32_t] counts,
ndarray[float64_t] values, ndarray[int32_t] labels):
cdef:
Py_ssize_t i, N, lab
float64_t val, count
ndarray[float64_t] prodx, nobs

nobs = np.zeros_like(out)
prodx = np.ones_like(out)

N = len(values)

for i in range(N):
lab = labels[i]
if lab < 0:
continue

counts[lab] += 1
val = values[i]

# not nan
if val == val:
nobs[lab] += 1
prodx[lab] *= val

for i in range(len(counts)):
if nobs[i] == 0:
out[i] = nan
else:
out[i] = prodx[i]


However, the amount of code duplication here is kind of ridiculous; the only changes are: different reduction function, and different starting values. Ideally I would like to write code that looks like:

def group_reduce(how):
def reducer(data, labels, out):
filtered = notna(data)
grouped = groupby(filtered, labels)
return reduce(how, grouped, out=out)
return reducer
group_prod = group_reduce(multiply)


IMHO there is plenty of information here to generate C or LLVM that is at least as fast as the above code. However, this isn't completely extensible-- suppose we wanted to compute group variance. In that case what you actually need is

nobs = reduce(count, grouped)
XX = reduce(add, grouped * grouped)
variance = (nobs * XX - X**2) / (nos * (nobs - 1))


The only thing you care about here is the variance result. Ideally, you want the generated code to:

• Only make one pass over the data
• Have no unnecessary temporary variables, except the X and XX arrays; the full reductions are necessary for computing the variance. Under this framework, something like mean becomes even more trivial:
grouped = groupby(notna(data), labels)
nobs = reduce(count, grouped)