Many scientific Python users are surprised when I tell them that `ndarray.take` is faster than __getitem__-based (a.k.a. “fancy” as I call it) indexing.

import random

arr = np.random.randn(10000, 5)

indexer = np.arange(10000)

random.shuffle(indexer)

In [26]: timeit arr[indexer]

1000 loops, best of 3: 1.25 ms per loop

In [27]: timeit arr.take(indexer, axis=0)

10000 loops, best of 3: 127 us per loop

It’s actually kind of unbelievable when you think about it. What’s going on here that `take` is almost **10x faster**? I really should take a closer at the internals of what __getitem__ does because this has always struck me as pretty bad. Maybe I shouldn’t be complaining? I mean, R 2.13′s indexing falls somewhere in the middle:

set.seed(12345)

indexer <- sample(1:10000)

> system.time(for (i in 1:1000) mat[indexer,])

user system elapsed

0.460 0.197 0.656

So 656 microseconds per iteration. (In an earlier version of this post I used rpy2 to do the benchmark and got 1.05 ms, but there was apparently some overhead from rpy2)

Another peculiarity that I noticed with **out** argument, which tells the function to use an array you pass in to write out the result:

In [50]: timeit np.take(arr, indexer, axis=0, out=out)

10000 loops, best of 3: 200 us per loop

**EDIT:** I’ve been informed that using `mode='clip'` or `mode='wrap'` makes this run as fast as without the out argument.

Weird! I was dissatisfied by this, so I got curious how fast a hand-coded little Cython function can do this:

@cython.boundscheck(False)

def take_axis0(ndarray[float64_t, ndim=2] values,

ndarray[int32_t] indexer,

out=None):

cdef:

Py_ssize_t i, j, k, n, idx

ndarray[float64_t, ndim=2] outbuf

if out is None:

outbuf = np.empty_like(values)

else:

outbuf = out

n = len(indexer)

k = values.shape[1]

for i from 0 <= i < n:

idx = indexer[i]

if idx == -1:

for j from 0 <= j < k:

outbuf[i, j] = NaN

else:

for j from 0 <= j < k:

outbuf[i, j] = values[idx, j]

Don’t worry about the -1 thing– that’s a specialization that I’m using inside pandas. Curiously, this function is a lot faster than `take` using **out** but faster than the regular `take` by a handful of microseconds.

10000 loops, best of 3: 115 us per loop

In [54]: timeit lib.take_axis0(arr, indexer, out)

10000 loops, best of 3: 109 us per loop

Very interesting.

### TL;DR

`take`not

`[]`-based indexing to get best performance

**out**array (which I will be for the application that I needed this for)

`matrix`indexing performance is better than NumPy’s fancy indexing, but about 5-6x slower than

`ndarray.take`. This can probably be improved.