## NumPy indexing peculiarities

Many scientific Python users are surprised when I tell them that ndarray.take is faster than __getitem__-based (a.k.a. “fancy” as I call it) indexing.

import numpy as np
import random

arr = np.random.randn(10000, 5)
indexer = np.arange(10000)
random.shuffle(indexer)

In [26]: timeit arr[indexer]
1000 loops, best of 3: 1.25 ms per loop

In [27]: timeit arr.take(indexer, axis=0)
10000 loops, best of 3: 127 us per loop

It’s actually kind of unbelievable when you think about it. What’s going on here that take is almost 10x faster? I really should take a closer at the internals of what __getitem__ does because this has always struck me as pretty bad. Maybe I shouldn’t be complaining? I mean, R 2.13′s indexing falls somewhere in the middle:

mat <- matrix(rnorm(50000), nrow=10000, ncol=5)
set.seed(12345)
indexer <- sample(1:10000)
> system.time(for (i in 1:1000) mat[indexer,])
user  system elapsed
0.460   0.197   0.656

So 656 microseconds per iteration. (In an earlier version of this post I used rpy2 to do the benchmark and got 1.05 ms, but there was apparently some overhead from rpy2)

Another peculiarity that I noticed with take is that performance gets worse when you use the out argument, which tells the function to use an array you pass in to write out the result:

out = np.empty_like(arr)

In [50]: timeit np.take(arr, indexer, axis=0, out=out)
10000 loops, best of 3: 200 us per loop

EDIT: I’ve been informed that using mode='clip' or mode='wrap' makes this run as fast as without the out argument.

Weird! I was dissatisfied by this, so I got curious how fast a hand-coded little Cython function can do this:

@cython.wraparound(False)
@cython.boundscheck(False)
def take_axis0(ndarray[float64_t, ndim=2] values,
ndarray[int32_t] indexer,
out=None):
cdef:
Py_ssize_t i, j, k, n, idx
ndarray[float64_t, ndim=2] outbuf

if out is None:
outbuf = np.empty_like(values)
else:
outbuf = out

n = len(indexer)
k = values.shape[1]
for i from 0 <= i < n:
idx = indexer[i]

if idx == -1:
for j from 0 <= j < k:
outbuf[i, j] = NaN
else:
for j from 0 <= j < k:
outbuf[i, j] = values[idx, j]

Don’t worry about the -1 thing– that’s a specialization that I’m using inside pandas. Curiously, this function is a lot faster than take using out but faster than the regular take by a handful of microseconds.

In [53]: timeit lib.take_axis0(arr, indexer)
10000 loops, best of 3: 115 us per loop

In [54]: timeit lib.take_axis0(arr, indexer, out)
10000 loops, best of 3: 109 us per loop

Very interesting.

### TL;DR

• Use take not []-based indexing to get best performance
• Cython is just as fast for my specific application and a lot faster if you’re passing an out array (which I will be for the application that I needed this for)
• R’s matrix indexing performance is better than NumPy’s fancy indexing, but about 5-6x slower than ndarray.take. This can probably be improved.
• Gaël Varoquaux

Wow, I had never realized that. Thanks for posting!

• Kieran

Hi Wes, perhaps I am missing something but isn’t R wrapped in Python here which would add a lot of overhead?

rpy2 has an embedded R interpreter instance so it is just executing code inside the R engine, I think? Happy to run a benchmark in pure R, what’s the preferred way of doing that? The system.time gives pretty imprecise results with a code snippet run inside a for loop

EDIT: system.time(for (i in 1:1000) mat[indexer,]) takes 655 millis so maybe rpy2 is adding some overhead. is there a better way to benchmark it?

Thanks Wes. I am no R (or Rpy2) user but there is the obvious overhead of the python function calls passing and returning arguments to the R interpreter. Probably negligible for most purposes but I thought it might have an impact on such a benchmark.

I had a bit of a dig and the Rpy2 performance page makes for an interesting read (http://rpy.sourceforge.net/rpy2/doc-2.1/html/performances.html), quote

“[In some cases] the use of R through rpy2 faster that using R from R. This was again expected, as the lower-level interface is closer to the C API for R.”

It might be interesting to compare the low-level R interface aswell if possible?

The link got a bit messed up, it should be http://rpy.sourceforge.net/rpy2/doc-2.1/html/performances.html

• Anonymous

Here’s another possible plan:

1. Profile Numpy’s fancy indexing
2. Eliminate bottlenecks in Numpy itself
3. ???
4. Profit

I definitely think we should try to bring fancy indexing down to parity with ndarray.take, or at least introduce some special cases in the code that just call take. It’s too bizarre a difference to let stand

• http://wesmckinney.com/blog/?p=232 Faster time series alignment / joins for pandas, beating R’s xts package | Quant Pythonista

[...] xts for all of its supported join types (left/outer/inner) using the merge.xts function. In a blog article earlier today I wrote about some of the performance problems I had to address to do this. The rest [...]

• Tom Aldcroft

Very useful! I did a few experiments and discovered that if “arr” is one-dimensional then arr.take(indexer) is only about 2.5 times faster. Still, that’s nothing to sneeze at and will definitely improve some functions that make heavy use of fancy indexing.

• http://nwerneck.sdf.org dividebyzero

I think “fancy indexing” is slower because of things like bound checking and wrapping around. And that is why usually when you are writing something in Cython, the first thing you do is to remove that, like you did. If that is really the main overhead, it wouldn’t be possible to make it as fast as take() and Cython-made procedures.

I don’t know much about take(), how it deals with bound checking, etc, but it’s really strange that giving an out argument made it slower. Now _that_ is something we should try to understand and fix.

Minimizing the memory accesses is the key to make faster code. The way to go for me is really begin with pure numpy code, then figure out a piece of your code that can be replaced with a specialized Cython procedure. When inline Cython coding becomes very easy (I understand it’s pretty much around the corner), that will become the best way to go, at least for the stuff I do.

ndarray.take does handle both bounds-checking and wraparound. The C code is generated from a template for all the NumPy dtypes I think. So the specialized Cython function I wrote is not a drop-in replacement. I wanted my own anyway because of the specific handling of missing data in pandas

• Anonymous

np.take(arr, indexer, axis=0, out=out, mode=’wrap’) or mode=’clip’ is as fas as np.take(arr, indexer). Maybe np.take skips error handling when it creates out array internally?
see http://docs.scipy.org/doc/numpy/reference/generated/numpy.take.html

I assume that’s almost certainly the reason. Thanks for the tip!

Hey Wes,

Thanks for posting this. There is a lot of low-hanging fruit in NumPy to speed things up. Of course, it’s always true that generalized code will be slower than specialized code. You can narrow the gap significantly, of course, but the differential dispatch will take time to sort through. There are a lot of areas in NumPy where this is the case.

Now, I don’t think the gap needs to be that large either. I wish there were resources to spend on NumPy. Everybody with time seems to want to build their own thing rather than improve the common standard — really too bad.

I would certainly like to dig into NumPy and see if this could be fixed (I’m sure it can, but unclear how much effort will be required). In this case I just needed a small specialized function (relating to how pandas does data alignment) and it was good enough

It would help if there was a listing of the known low-hanging fruit. Personally, I’ve looked at the Numpy bug tracker before and only seen bugs that didn’t really apply to anything that I was doing. I posted some patches to some of them, but didn’t really feel like contributing as much since there wasn’t much guidance as far as what can be improved.

• http://bytefish.myopenid.com/ Philipp

Hi Wes,

Nice read! Funny that I found this post today, because I did some NumPy programming and wrote about NumPys performance on my page: http://www.bytefish.de/wiki/python/numpy/performance. My naive version of an algorithm took 185 seconds in NumPy and 0.06 seconds with C++. To be fair: I was able to get 0.27 seconds with NumPy. Still there’s a lot of potential.

Best regards, Philipp.

• Maxim Imakaev

Thanks for posting! I’ve never realized that!

I did further research, and found that a[b] = c[b] is ~2 times slower than numpy.copyto(c,a,where = b). Copyto is new in numpy 1.7.0 beta, but numpy.putmask can be used in earlier versions.

I also found that boolean indexing c = a[b] can be sped up ~50% when dimensions of c (i.e. b.sum()) are already known, and output array of a known size is allocated (I couldn’t find a numpy function to do this and wrote it using weave.inline)

• dbv

Did this ever get resolved in newer releases of Numpy?