Many scientific Python users are surprised when I tell them that
ndarray.take is faster than __getitem__-based (a.k.a. "fancy" as I call it) indexing.
import numpy as np import random arr = np.random.randn(10000, 5) indexer = np.arange(10000) random.shuffle(indexer) In : timeit arr[indexer] 1000 loops, best of 3: 1.25 ms per loop In : timeit arr.take(indexer, axis=0) 10000 loops, best of 3: 127 us per loop
It's actually kind of unbelievable when you think about it. What's going on here that
take is almost 10x faster? I really should take a closer at the internals of what __getitem__ does because this has always struck me as pretty bad. Maybe I shouldn't be complaining? I mean, R 2.13's indexing falls somewhere in the middle:
mat <- matrix(rnorm(50000), nrow=10000, ncol=5) set.seed(12345) indexer <- sample(1:10000) > system.time(for (i in 1:1000) mat[indexer,]) user system elapsed 0.460 0.197 0.656
So 656 microseconds per iteration. (In an earlier version of this post I used rpy2 to do the benchmark and got 1.05 ms, but there was apparently some overhead from rpy2)
Another peculiarity that I noticed with
out = np.empty_like(arr) In : timeit np.take(arr, indexer, axis=0, out=out) 10000 loops, best of 3: 200 us per loop
EDIT: I've been informed that using
mode='wrap' makes this run as fast as without the out argument.
Weird! I was dissatisfied by this, so I got curious how fast a hand-coded little Cython function can do this:
@cython.wraparound(False) @cython.boundscheck(False) def take_axis0(ndarray[float64_t, ndim=2] values, ndarray[int32_t] indexer, out=None): cdef: Py_ssize_t i, j, k, n, idx ndarray[float64_t, ndim=2] outbuf if out is None: outbuf = np.empty_like(values) else: outbuf = out n = len(indexer) k = values.shape for i from 0 <= i < n: idx = indexer[i] if idx == -1: for j from 0 <= j < k: outbuf[i, j] = NaN else: for j from 0 <= j < k: outbuf[i, j] = values[idx, j]
Don't worry about the -1 thing— that's a specialization that I'm using inside pandas. Curiously, this function is a lot faster than
take using out but faster than the regular
take by a handful of microseconds.
In : timeit lib.take_axis0(arr, indexer) 10000 loops, best of 3: 115 us per loop In : timeit lib.take_axis0(arr, indexer, out) 10000 loops, best of 3: 109 us per loop
-based indexing to get best performance
- Cython is just as fast for my specific application and a lot faster if you're passing an out array (which I will be for the application that I needed this for)
matrixindexing performance is better than NumPy's fancy indexing, but about 5-6x slower than
ndarray.take. This can probably be improved.