Many scientific Python users are surprised when I tell them that ndarray.take is faster than __getitem__-based (a.k.a. “fancy” as I call it) indexing.
import random
arr = np.random.randn(10000, 5)
indexer = np.arange(10000)
random.shuffle(indexer)
In [26]: timeit arr[indexer]
1000 loops, best of 3: 1.25 ms per loop
In [27]: timeit arr.take(indexer, axis=0)
10000 loops, best of 3: 127 us per loop
It’s actually kind of unbelievable when you think about it. What’s going on here that take is almost 10x faster? I really should take a closer at the internals of what __getitem__ does because this has always struck me as pretty bad. Maybe I shouldn’t be complaining? I mean, R 2.13′s indexing falls somewhere in the middle:
set.seed(12345)
indexer <- sample(1:10000)
> system.time(for (i in 1:1000) mat[indexer,])
user system elapsed
0.460 0.197 0.656
So 656 microseconds per iteration. (In an earlier version of this post I used rpy2 to do the benchmark and got 1.05 ms, but there was apparently some overhead from rpy2)
Another peculiarity that I noticed with
In [50]: timeit np.take(arr, indexer, axis=0, out=out)
10000 loops, best of 3: 200 us per loop
EDIT: I’ve been informed that using mode='clip' or mode='wrap' makes this run as fast as without the out argument.
Weird! I was dissatisfied by this, so I got curious how fast a hand-coded little Cython function can do this:
@cython.boundscheck(False)
def take_axis0(ndarray[float64_t, ndim=2] values,
ndarray[int32_t] indexer,
out=None):
cdef:
Py_ssize_t i, j, k, n, idx
ndarray[float64_t, ndim=2] outbuf
if out is None:
outbuf = np.empty_like(values)
else:
outbuf = out
n = len(indexer)
k = values.shape[1]
for i from 0 <= i < n:
idx = indexer[i]
if idx == -1:
for j from 0 <= j < k:
outbuf[i, j] = NaN
else:
for j from 0 <= j < k:
outbuf[i, j] = values[idx, j]
Don’t worry about the -1 thing– that’s a specialization that I’m using inside pandas. Curiously, this function is a lot faster than take using out but faster than the regular take by a handful of microseconds.
10000 loops, best of 3: 115 us per loop
In [54]: timeit lib.take_axis0(arr, indexer, out)
10000 loops, best of 3: 109 us per loop
Very interesting.

Wes McKinney Reply:
September 23rd, 2011 at 11:24 pm
rpy2 has an embedded R interpreter instance so it is just executing code inside the R engine, I think? Happy to run a benchmark in pure R, what’s the preferred way of doing that? The system.time gives pretty imprecise results with a code snippet run inside a for loop
EDIT: system.time(for (i in 1:1000) mat[indexer,]) takes 655 millis so maybe rpy2 is adding some overhead. is there a better way to benchmark it?
[Reply]
Kieran Reply:
September 24th, 2011 at 4:31 am
Thanks Wes. I am no R (or Rpy2) user but there is the obvious overhead of the python function calls passing and returning arguments to the R interpreter. Probably negligible for most purposes but I thought it might have an impact on such a benchmark.
I had a bit of a dig and the Rpy2 performance page makes for an interesting read (http://rpy.sourceforge.net/rpy2/doc-2.1/html/performances.html), quote
“[In some cases] the use of R through rpy2 faster that using R from R. This was again expected, as the lower-level interface is closer to the C API for R.”
It might be interesting to compare the low-level R interface aswell if possible?
[Reply]
Kieran Reply:
September 24th, 2011 at 4:32 am
The link got a bit messed up, it should be http://rpy.sourceforge.net/rpy2/doc-2.1/html/performances.html