#### TagSciPy

I’m on my way back from R/Finance 2012. Those guys did a nice job of organizing the conference and was great to meet everyone there.

As part of pandas development, I have had to develop a suite of high performance data algorithms and implementation strategies which are the heart and soul of the library. I get asked a lot why pandas’s performance is much better than R and other data manipulation tools. The reasons are two-fold: careful implementation (in Cython and and C, so minimizing the computational friction) and carefully-thought-out algorithms.

Here are some of the more important tools and ideas that I make use of on a day-to-day basis:

• Hash tables I use klib. It’s awesome and easy to use. I have this for critical algorithms like unique, factorize (unique + integer label assignment)
• O(n) sort, known in pandas as groupsort, on integers with known range. This is a variant of counting sort; if you have N integers with known range from 0 to K – 1, this can be sorted in O(N) time. Combining this tool with factorize (hash table-based), you can categorize and sort a large data set in linear time. Failure to understand these two algorithms will force you to pay O(N log N), dominating the runtime of your algorithm.
• Vectorized data movement and subsetting routines: take, put, putmask, replace, etc.
• Let me give you a prime example from a commit yesterday of me applying these ideas to great effect. Suppose I had a time series (or DataFrame containing time series) that I want to group by year, month, and day:

This is a small dataset, but imagine you have millions of observations and thousands or even millions of groups. How does that look algorithmically? I guarantee if you take a naive approach, you will crash and burn when the data increases in size. I know, because I did just that (take a look at the vbenchmarks). Laying down the infrastructure for doing a better job is not simple. Here are the steps for efficiently aggregating data like this:

• Factorize the group keys, computing integer label arrays (O(N) per key)
• Compute a “cartesian product number” or group index for each observation (since you could theoretically have K_1 * … * K_p groups observed, where K_i is the number of unique values in key i). This is again O(n) work.
• If the maximum number of groups is large, “compress” the group index by using the factorize algorithm on it again. imagine you have 1000 uniques per key and 3 keys; most likely you do not actually observe 1 billion key combinations but rather some much smaller number. O(n) work again
• For simple aggregations, like mean, go ahead and aggregate the data in one linear sweep without moving anything around.
• Otherwise, for arbitrary user-defined aggregations, we have two options for chopping up the data set: sorting it (O(N) using groupsort!) or create index arrays indicating which observations belong to each group. The latter would be preferable in many cases to sorting a large data set (reordering the data, even though linear time, can be very costly).
• I worked on speeding up the latter part of this last bullet point yesterday. The resulting code looked like this:

The details of lib.indices_fast aren’t that interesting; it chops up np.arange(len(group_index)).take(sorter), the sorted indirect indices, to produce the index dictionary. Running %lprun to get a line profiling on a large-ish data set:

You might say, well, this seems like a lot of work and maybe we should just zip the keys (forming an array of Python tuples) and do a dumb algorithm? The speed difference ends up being something like an order of magnitude or more faster by being careful in this way and working with indirect integer index arrays.

Anyway, in conclusion, it’s these kinds of algorithms and ideas why pandas is perhaps the best-performing open-source data analysis toolkit for in memory data (I’m going to get to out-of-core data processing and “big data” eventually, just hang tight). It goes beyond language features and data structure internals (though this naturally also has a major impact, a lot of the things I do are easy to express in Python but would be very awkward or impossible to do in, say, R. Maybe I should write a whole article on this.); carefully thought-out algorithms are a major piece of the puzzle.

Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.

When calling help(xtabs) (or help(table)) in R, you get a number of examples for using table, xtabs, and ftable:

Here are the same examples using pandas:

One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:

Here is another fun example:

Here is the equivalent R code:

As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.

I had the privilege of speaking last night at the NYCPython meetup group. I’ve given tons of “use pandas!” talks so I thought I would take a slightly different angle and talk about some of the design and implementation work that I’ve done for getting good performance in critical data manipulations. I’ll turn some of this material into some blog articles in the near future.

Here’s some more video footable shot by my awesome friend Emily Paup with her HDSLR.

I did a little interactive demo (using the ever-amazing IPython HTML Notebook) on Ashley Williams’s Food Nutrient JSON Database:

If you want to run the code in the IPython notebook, you’ll have to download the food database file above.

The audience and I learned from this demo that if you’re after Tryptophan, “Sea lion, Steller, meat with fat (Alaska Native)” is where to get it (in the highest density).

Over the last week I have completely retooled pandas’s “database” join infrastructure / algorithms in order to support the full gamut of SQL-style many-to-many merges (pandas has had one-to-one and many-to-one joins for a long time). I was curious about the performance with reasonably large data sets as compared with base::merge.data.frame which I’ve used many times in R. So I just ran a little benchmark for joining a 100000-row DataFrame with a 10000-row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.

Note this new functionality will be shipping in the upcoming 0.7.0 release (!).

There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it’s completely apples to apples. These are mean runtimes in seconds:

Sort by key columns
pandas R Ratio
inner 0.07819 0.2286 2.924
outer 0.09013 1.284 14.25
left 0.0853 0.7766 9.104
right 0.08112 0.3371 4.156

Don’t sort by key columns
pandas R Ratio
inner 0.02808 0.2297 8.18
outer 0.03879 1.181 30.45
left 0.0365 0.6706 18.37
right 0.03021 0.2995 9.912

As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn’t really an indictment of R’s merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven’t run benchmarks against SQLite or another SQL database yet; that will happen eventually.

I’m going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast–it was not easy!

Here is the R code.

N <- 10000
indices = rep(NA, N)
for (i in 1:N)
indices[i] <- paste(sample(letters, 10), collapse="")

left <- data.frame(key=rep(indices, 10),
key2=sample(rep(indices, 10)),
value=rnorm(100000))
right <- data.frame(key=indices,
key2=sample(indices),
value2=rnorm(10000))
timeit <- function(func, niter=10) {
timing = rep(NA, niter)
for (i in 1:niter) {
gc()
timing[i] <- system.time(func())[3]
}
mean(timing)
}

left.join <- function(sort=TRUE) {
result <- merge(left, right, all.x=TRUE, sort=sort)
}
right.join <- function(sort=TRUE) {
result <- merge(left, right, all.y=TRUE, sort=sort)
}
outer.join <- function(sort=TRUE) {
result <- merge(left, right, all=TRUE, sort=sort)
}
inner.join <- function(sort=TRUE) {
reuslt <- merge(left, right, sort=sort)
}

sort.options <- c(FALSE, TRUE)

results <- matrix(nrow=4, ncol=2)
colnames(results) <- c("dont_sort", "sort")
rownames(results) <- c("inner", "outer", "left", "right")

join.functions <- c(inner.join, outer.join, left.join, right.join)
for (i in 1:4) {
results[1, 1] <- timeit(function() {inner.join(sort=sort.options[1])})
results[1, 2] <- timeit(function() {inner.join(sort=sort.options[2])})
results[2, 1] <- timeit(function() {outer.join(sort=sort.options[1])})
results[2, 2] <- timeit(function() {outer.join(sort=sort.options[2])})
results[3, 1] <- timeit(function() {left.join(sort=sort.options[1])})
results[3, 2] <- timeit(function() {left.join(sort=sort.options[2])})
results[4, 1] <- timeit(function() {right.join(sort=sort.options[1])})
results[4, 2] <- timeit(function() {right.join(sort=sort.options[2])})
}

And the Python code

from collections import defaultdict
import gc
import time
from pandas.util.testing import rands
N = 10000
indices = np.array([rands(10) for _ in xrange(N)], dtype='O')

key = np.tile(indices, 10)
key2 = key.copy()
random.shuffle(key2)
indices2 = indices.copy()
random.shuffle(indices2)
left = DataFrame({'key' : key, 'key2':key2,
'value' : np.random.randn(100000)})
right = DataFrame({'key': indices, 'key2':indices2,
'value2' : np.random.randn(10000)})
join_methods = ['inner', 'outer', 'left', 'right']
results = DataFrame(index=join_methods, columns=[False, True])
niter = 10
for sort in [False, True]:
for join_method in join_methods:
f = lambda: merge(left, right, how=join_method, sort=sort)
gc.disable()
start = time.time()
for _ in xrange(niter):
f()
elapsed = (time.time() - start) / niter
gc.enable()
results[sort][join_method] = elapsed
results.columns = ['dont_sort', 'sort']

# R results
from StringIO import StringIO
inner    0.2297 0.2286
outer    1.1811 1.2843
left     0.6706 0.7766
right    0.2995 0.3371
"""
), sep='\s+')

sort_results = DataFrame.from_items([('pandas', results['sort']),
('R', r_results['sort'])])
sort_results['Ratio'] = sort_results['R'] / sort_results['pandas']

nosort_results = DataFrame.from_items([('pandas', results['dont_sort']),
('R', r_results['dont_sort'])])
nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas']

Do you know how fast your code is? Is it faster than it was last week? Or a month ago? How do you know if you accidentally made a function slower by changes elsewhere? Unintentional performance regressions are extremely common in my experience: it’s hard to unit test the performance of your code. Over time I have gotten tired of playing the game of “performance whack-a-mole”. Thus, I started hacking together a little weekend project that I’m calling vbench. If someone thinks up a cleverer name, I’m all ears.

Link to pandas benchmarks page produced using vbench

## What is vbench?

vbench is a super-lightweight Python library for running a collection of performance benchmarks over the course of your source repository’s history. Since I’m a GitHub user, it only does git for now, but it could be generalized to support other VCSs. Basically, you define a benchmark:

common_setup = """
from pandas import *
import pandas.util.testing as tm
import random
import numpy as np
"""

setup = common_setup + """

N = 100000
ngroups = 100

def get_test_data(ngroups=100, n=N):
unique_groups = range(ngroups)
arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object)

if len(arr) < n:
arr = np.asarray(list(arr) + unique_groups[:n - len(arr)],
dtype=object)

random.shuffle(arr)
return arr

df = DataFrame({'key1' : get_test_data(ngroups=ngroups),
'key2' : get_test_data(ngroups=ngroups),
'data' : np.random.randn(N)})
def f():
df.groupby(['key1', 'key2']).agg(lambda x: x.values.sum())
"""

stmt2 = "df.groupby(['key1', 'key2']).sum()"
bm_groupby2 = Benchmark(stmt2, setup, name="GroupBy test 2",
start_date=datetime(2011, 7, 1))

Then you write down the information about your repository and how to build any relevant DLLs, etc., that vary from revision to revision:

REPO_PATH = '/home/wesm/code/pandas'
REPO_URL = 'git@github.com:wesm/pandas.git'
DB_PATH = '/home/wesm/code/pandas/gb_suite/benchmarks.db'
TMP_DIR = '/home/wesm/tmp/gb_pandas'
PREPARE = """
python setup.py clean
"""
BUILD = """
python setup.py build_ext --inplace
"""
START_DATE = datetime(2011, 3, 1)

Then you pass this info, plus a list of your benchmark objects, to the BenchmarkRunner class:

runner = BenchmarkRunner(benchmarks, REPO_PATH, REPO_URL,
BUILD, DB_PATH, TMP_DIR, PREPARE,
run_option='eod', start_date=START_DATE)
runner.run()

Now, the BenchmarkRunner makes a clone of your repo, then runs all of the benchmarks once for each revision in the repository (or some other rule, e.g. I’ve set run_option='eod' to only take the last snapshot on each day). It persists the results in a SQLite database so that you can rerun the process and it will skip benchmarks it’s already run (this is key when you add new benchmarks, only the new ones will be updated). Benchmarks are uniquely identified by the MD5 hash of their source code.

This is the resulting plot over time for the above GroupBy benchmark related to some Cython code that I worked on late last week (where I made a major performance improvement in this case):

Here is a fully-formed vbench suite in the pandas git repository.

## Kind of like codespeed and speed.pypy.org?

Before starting to write a new project I looked briefly at codespeed and the excellent work that the PyPy guys have done with speed.pypy.org. But then I started thinking, you know, Django web application, JSON requests to upload benchmark results? Seemed like far too much work to do something relatively simple. The dealbreaker is that codespeed is just a web application. It doesn’t actually (to my knowledge, someone correct me if I’m wrong?) have any kind of a framework for orchestrating the running of benchmarks throughout your code history. That is what this new project is for. I actually see a natural connection between vbench and codespeed, all you need to do is write a script to upload your vbench results to a codespeed web app!

At some point I’d like to build a simple web front end or wx/Qt viewer for the generated vbench database. I’ve never done any JavaScript, but it would be a good opportunity to learn. Knowing me, I might break down and hack out a stupid little wxPython app with an embedded matplotlib widget anyway.

Anyway, I’m really excited about this project. It’s very prototype-y at the moment but I tried to design it in a nice and extensible way. I also plan to put all my git repo analysis tools in there (like code churn graphs etc.) so it should become a nice little collection of tools.

## Other ideas for extending

• Dealing with API changes
• Multiple library versions (e.g. NumPy) or Python versions
• I’ll be presenting this paper in a couple of weeks at SC11. It’s the most up-to-date white paper about pandas.

pandas: a Foundational Python Library for Data Analysis and Statistics

Over the last several months, I’ve invested a great deal in the GroupBy and indexing infrastructure of pandas. GroupBy in particular could still use more work to be made even higher performance, but even for quite large data sets, most users will find it more than satisfactory compared with other Python alternatives (which are mainly DIY approaches these days) or even in other data analysis packages, e.g. those in R, which I’ll talk a bit about in this article.

An easy application of GroupBy is creating pivot tables from a tabular data set. This involves aggregating a data set on some number of keys, then reshaping it to form a 2D table with the stratified aggregated values. I added a new function

pivot_table

which provides a high level interface for creating pivot tables. Let’s consider the tips data set used in some examples of the Hadley Wickham’s excellent R reshape2 package:

In [1]: import pandas.rpy.common as com

In [2]: tips = com.load_data('tips', package='reshape2')

In [3]: tips['tip_pct'] = tips['tip'] / tips['total_bill']

Out[4]:
total_bill  tip   sex     smoker  day  time    size  tip_pct
1  16.99       1.01  Female  No      Sun  Dinner  2     0.05945
2  10.34       1.66  Male    No      Sun  Dinner  3     0.1605
3  21.01       3.5   Male    No      Sun  Dinner  3     0.1666
4  23.68       3.31  Male    No      Sun  Dinner  2     0.1398
5  24.59       3.61  Female  No      Sun  Dinner  4     0.1468

Since most readers won’t have rpy2, you can get the data file here and read it with:

All of this new functionality is available now on GitHub but will be a part of the upcoming pandas 0.5.0 release. Note that I am only using rpy2 here to pull the R data set into my Python sesson– all of the below code is implemented in pure Python (with a little Cython magic to make things fast).

### Pivot table basics

To use pivot_table, pass the pandas DataFrame and specify the column you wish to aggregate and the columns to group on the rows and columns of the table. Note the default aggregation is mean, though this can be specified:

In [9]: table = pivot_table(tips, 'tip_pct', rows=['time', 'sex'],
cols='smoker')

In [10]: table
Out[10]:
smoker       No      Yes
time   sex
Dinner Female  0.1568  0.1851
Male    0.1594  0.1489
Lunch  Female  0.1571  0.1753
Male    0.1657  0.1667

In [17]: pivot_table(tips, 'tip_pct', rows=['day', 'time'],
cols='sex')
Out[17]:
sex        Female  Male
day  time
Fri  Dinner  0.1991  0.1302
Lunch   0.1997  0.1741
Sat  Dinner  0.1565  0.1516
Sun  Dinner  0.1816  0.1623
Thur Dinner  0.1597  NaN
Lunch   0.1575  0.1653

Note that the returned object is still a DataFrame, so you can use all of the standard indexing technology to select portions of the resulting table:

In [22]: table.ix['Dinner']
Out[22]:
smoker  No      Yes
sex
Female    0.1568  0.1851
Male      0.1594  0.1489

If you don’t specify a particular values column, it will try to aggregate all the other columns:

In [24]: pivot_table(tips, rows=['sex', 'smoker'])
Out[24]:
size   tip    tip_pct  total_bill
sex    smoker
Female No      2.593  2.774  0.1569   18.11
Yes     2.242  2.932  0.1822   17.98
Male   No      2.711  3.113  0.1607   19.79
Yes     2.5    3.051  0.1528   22.28

Other aggregations can be performed, like using len to get group counts:

In [26]: pivot_table(tips, 'tip_pct', rows='sex', cols='smoker', aggfunc=len)
Out[26]:
smoker  No  Yes
sex
Female    54  33
Male      97  60

Once you’ve done the aggregation, you’re free to use the built-in reshaping capability to reshape the data in the table:

In [29]: table = pivot_table(tips, 'tip_pct', rows=['sex', 'day'],
cols='smoker', aggfunc=len)

In [30]: table
Out[30]:
smoker     No  Yes
sex    day
Female Fri   2   7
Sat   13  15
Sun   14  4
Thur  25  7
Male   Fri   2   8
Sat   32  27
Sun   43  15
Thur  20  10

In [31]: table.unstack('sex')
Out[31]:
smoker  No            Yes
sex     Female  Male  Female  Male
day
Fri       2       2     7       8
Sat       13      32    15      27
Sun       14      43    4       15
Thur      25      20    7       10

If you know a bit more about groupby, you can get clever and pass a dict of functions to perform multiple aggregations at once:

In [59]: pivot_table(tips, rows=['sex', 'smoker'],
aggfunc={'tip_pct' : 'mean', 'size' : 'sum'})
Out[59]:
size  tip_pct
sex    smoker
Female No      140   0.1569
Yes     74    0.1822
Male   No      263   0.1607
Yes     150   0.1528

I thought that was pretty cool.

### Comparison with reshape2

R’s reshape2 package relies on two functions, melt and cast, to do the reshaping. Its implementation is, at least in my opinion, a slightly more ad hoc approach than in pandas (the actual pivot_table function is only about 10 lines and basically falls out of the groupby and hierarchical-indexing based reshaping). However, it can take advantage of R’s built-in formulas which makes the syntax very intuitive. Here are some of the same examples using reshape2:

> tips\$tip.pct <- tips\$tip / tips\$total_bill
> mtips = melt(tips, id=c("sex", "smoker", "day", "time"))
> acast(mtips, day ~ sex + smoker, length, subset=.(variable=="tip"))
Female_No Female_Yes Male_No Male_Yes
Fri          2          7       2        8
Sat         13         15      32       27
Sun         14          4      43       15
Thur        25          7      20       10

> acast(mtips, sex + smoker ~ variable, mean)
total_bill      tip     size   tip.pct
Female_No    18.10519 2.773519 2.592593 0.1569210
Female_Yes   17.97788 2.931515 2.242424 0.1821504
Male_No      19.79124 3.113402 2.711340 0.1606687
Male_Yes     22.28450 3.051167 2.500000 0.1527712

R lacks hierarchical indexing as far as I know, so reshape2 munges the group labels together into a row label.

In R there is also the xtabs function for doing cross-tabulation:

> ftable(xtabs(size ~ time + sex + smoker + day, data=tips))
day Fri Sat Sun Thur
time   sex    smoker
Dinner Female No           2  30  43    2
Yes          8  33  10    0
Male   No           4  85 124    0
Yes         12  71  39    0
Lunch  Female No           3   0   0   60
Yes          6   0   0   17
Male   No           0   0   0   50
Yes          5   0   0   23

ftable just creates a flat table which can be more easily viewed. This table could be created with pandas by:

In [54]: pivot_table(tips, 'size', rows=['time', 'sex', 'smoker'],
....:             cols='day', aggfunc=np.sum, fill_value=0)
Out[54]:
day                 Fri  Sat  Sun  Thur
time   sex    smoker
Dinner Female No      2    30   43   2
Yes     8    33   10   0
Dinner Male   No      4    85   124  0
Yes     12   71   39   0
Lunch  Female No      3    0    0    60
Yes     6    0    0    17
Lunch  Male   No      0    0    0    50
Yes     5    0    0    23

### Performance

pandas is very fast as I’ve invested a great deal in optimizing the indexing infrastructure and other core algorithms related to things such as this. I don’t have a lot of points of comparison, but here is a simple benchmark of reshape2 versus pandas.pivot_table on a data set with 100000 entries and 25 groups. Here is the R code for the benchmark:

library(reshape2)
n <- 100000
a.size <- 5
b.size <- 5
data <- data.frame(a=sample(letters[1:a.size], n, replace=T),
b=sample(letters[1:b.size], n, replace=T),
c=rnorm(n),
d=rnorm(n))
timings <- numeric()

for (i in 1:10) {
gc()
tim <- system.time(acast(melt(data, id=c("a", "b")), a + b ~ variable, mean))
timings[i] = tim[3]
}
mean(timings)

Here is what the actual table looks like:

> table <- acast(melt(data, id=c("a", "b")), a + b ~ variable, mean)
c            d
a_a  0.002761963 -0.008793516
a_b -0.004618980 -0.026978744
a_c  0.022491767 -0.002163986
a_d  0.005027362 -0.002532782
a_e  0.006146438  0.031699238
b_a -0.025094899  0.003334301

Here’s the equivalent Python code (sans timings, which I’ll do in IPython):

from pandas import *
import string
n = 100000
asize = 5
bsize = 5

letters = np.asarray(list(string.letters), dtype=object)

data = DataFrame(dict(foo=letters[:asize][np.random.randint(0, asize, n)],
bar=letters[:bsize][np.random.randint(0, bsize, n)],
baz=np.random.randn(n),
qux=np.random.randn(n)))

table = pivot_table(data, rows=['foo', 'bar'])

Similarly:

In [36]: table = pivot_table(data, rows=['foo', 'bar'])

In [37]: table[:10]
Out[37]:
baz       qux
foo bar
a   a   -0.02162   0.02186
b   -0.002155  0.01173
c   -0.002331  0.006608
d   -0.01734   0.01109
e    0.02837   0.01093
b   a    0.01703  -0.005845
b   -0.001043  0.01283
c   -0.006015 -0.0005866
d    0.009683  0.02211
e    0.02347  -8.346e-05

And the R timing

> mean(timings)
[1] 0.42

And pandas:

In [38]: timeit table = pivot_table(data, rows=['foo', 'bar'])
10 loops, best of 3: 117 ms per loop

So it’s about 3-4x faster than reshape2. This obviously depends on the data set and how you structure the aggregation. Here’s a couple slightly different benchmarks on the same data:

> acast(melt(data, id=c("a", "b")), a~b, mean, subset=.(variable=="c"))
a            b            c           d            e
a  0.002761963 -0.004618980  0.022491767 0.005027362  0.006146438
b -0.025094899  0.015966167 -0.007663059 0.006795551  0.006648496
c  0.012273797  0.006139820 -0.019169968 0.009035374 -0.006079683
d -0.014174747 -0.009844566  0.022450738 0.025967638 -0.006572791
e -0.003623382  0.005924704  0.008156482 0.013288116  0.009839315

Timing
> mean(timings)
[1] 0.3036

and pandas:

In [41]: pivot_table(data, 'baz', rows='foo', cols='bar')
Out[41]:
bar  a         b          c         d         e
foo
a     -0.02162  -0.002155  -0.002331 -0.01734   0.02837
b      0.01703  -0.001043  -0.006015  0.009683  0.02347
c      0.01561   0.0003992  0.01451  -0.01046  -0.00368
d      0.003504 -0.01437   -0.03265   0.003471 -0.004638
e     -0.01656  -0.01276   -0.02481   0.008629  0.011

In [42]: timeit pivot_table(data, 'baz', rows='foo', cols='bar')
10 loops, best of 3: 55 ms per loop

I suspect one source of slowdown reshape/reshape2 approach is that the melt operation is very expensive, as is variable subsetting. However, it enables easier computation of group margins, something which I have not yet addressed (but will, eventually) in pandas.

R is much more magical than Python. What do I mean by this? In R, things like this are a part of everyday life:

> a <- rnorm(10)
> b <- rnorm(10)
> cbind(a, b)
a          b
[1,]  0.8729978  0.5170078
[2,] -0.6885048 -0.4430447
[3,]  0.4017740  1.8985843
[4,]  2.1088905 -1.4121763
[5,]  0.9375273  0.4703302
[6,]  0.5558276 -0.5825152
[7,] -2.1606252  0.7379874
[8,] -0.7651046 -0.4534345
[9,] -4.2604901  0.9561077
[10,]  0.3940632 -0.8331285

If you’re a seasoned Python programmer, you might have the sort of visceral negative reaction that I do to this. Seriously, just where in the hell did those variable names come from? So when I say magic here I’m talking about abusing the language’s parser. There is nothing special about R that makes the above behavior possible, but rather taking a fundamentally different design philosophy to, say, Python. As any Python programmer knows: Explicit is better than implicit. I happen to agree. There is also a bit of a semantic difference in R versus Python in that assignment in R typically copies data, whereas variables in Python are simply references (labels) for a particular object. So you could make the argument that the names a and b above are more strongly linked to the underlying data.

While building pandas over the last several years, I occasionally grapple with issues like the above. Maybe I should just break from Python ethos and embrace magic? I mean, how hard would it be to get the above behavior in Python? Python gives you stack frames and the ast module after all. So I went down the rabbit hole and wrote this little code snippet:

While this is woefully unpythonic, it’s also kind of cool:

In [27]: merge(a, b)
Out[27]:
a         b
2000-01-03 -1.35      0.8398
2000-01-04  0.999    -1.617
2000-01-05  0.2537    1.433
2000-01-06  0.6273   -0.3959
2000-01-07  0.7963   -0.789
2000-01-10  0.004295 -1.446

This can even parse and format more complicated expressions (harder than it looks, because you have to walk the whole AST):

In [30]: merge(a, np.log(b))
Out[30]:
a        np.log(b)
2000-01-03  0.6243   0.7953
2000-01-04  0.3593  -1.199
2000-01-05  2.805   -1.059
2000-01-06  0.6369  -0.9067
2000-01-07 -0.2734   NaN
2000-01-10 -1.023    0.3326

Now, I am *not* suggesting we do this any time soon. I’m going to prefer the explicit approach (cf. the Zen of Python) any day of the week:

In [32]: DataFrame({'a' : a, 'log(b)' : np.log(b)})
Out[32]:
a        log(b)
2000-01-03  0.6243   0.7953
2000-01-04  0.3593  -1.199
2000-01-05  2.805   -1.059
2000-01-06  0.6369  -0.9067
2000-01-07 -0.2734   NaN
2000-01-10 -1.023    0.3326

Many scientific Python users are surprised when I tell them that ndarray.take is faster than __getitem__-based (a.k.a. “fancy” as I call it) indexing.

import numpy as np
import random

arr = np.random.randn(10000, 5)
indexer = np.arange(10000)
random.shuffle(indexer)

In [26]: timeit arr[indexer]
1000 loops, best of 3: 1.25 ms per loop

In [27]: timeit arr.take(indexer, axis=0)
10000 loops, best of 3: 127 us per loop

It’s actually kind of unbelievable when you think about it. What’s going on here that take is almost 10x faster? I really should take a closer at the internals of what __getitem__ does because this has always struck me as pretty bad. Maybe I shouldn’t be complaining? I mean, R 2.13’s indexing falls somewhere in the middle:

mat <- matrix(rnorm(50000), nrow=10000, ncol=5)
set.seed(12345)
indexer <- sample(1:10000)
> system.time(for (i in 1:1000) mat[indexer,])
user  system elapsed
0.460   0.197   0.656

So 656 microseconds per iteration. (In an earlier version of this post I used rpy2 to do the benchmark and got 1.05 ms, but there was apparently some overhead from rpy2)

Another peculiarity that I noticed with take is that performance gets worse when you use the out argument, which tells the function to use an array you pass in to write out the result:

out = np.empty_like(arr)

In [50]: timeit np.take(arr, indexer, axis=0, out=out)
10000 loops, best of 3: 200 us per loop

EDIT: I’ve been informed that using mode='clip' or mode='wrap' makes this run as fast as without the out argument.

Weird! I was dissatisfied by this, so I got curious how fast a hand-coded little Cython function can do this:

@cython.wraparound(False)
@cython.boundscheck(False)
def take_axis0(ndarray[float64_t, ndim=2] values,
ndarray[int32_t] indexer,
out=None):
cdef:
Py_ssize_t i, j, k, n, idx
ndarray[float64_t, ndim=2] outbuf

if out is None:
outbuf = np.empty_like(values)
else:
outbuf = out

n = len(indexer)
k = values.shape[1]
for i from 0 <= i < n:
idx = indexer[i]

if idx == -1:
for j from 0 <= j < k:
outbuf[i, j] = NaN
else:
for j from 0 <= j < k:
outbuf[i, j] = values[idx, j]

Don’t worry about the -1 thing– that’s a specialization that I’m using inside pandas. Curiously, this function is a lot faster than take using out but faster than the regular take by a handful of microseconds.

In [53]: timeit lib.take_axis0(arr, indexer)
10000 loops, best of 3: 115 us per loop

In [54]: timeit lib.take_axis0(arr, indexer, out)
10000 loops, best of 3: 109 us per loop

Very interesting.

### TL;DR

• Use take not []-based indexing to get best performance
• Cython is just as fast for my specific application and a lot faster if you’re passing an out array (which I will be for the application that I needed this for)
• R’s matrix indexing performance is better than NumPy’s fancy indexing, but about 5-6x slower than ndarray.take. This can probably be improved.
• For all my Python zealotry, I never claimed it was perfect. In fact, there are a number of places you can shoot yourself in the foot if you aren’t careful. But I was kind of bummed about this one:

import numpy as np

class MainScreen(object):

def __init__(self, cats):
self.cats = cats
self._signal = None

def turn_on(self):
if self._signal is None:
signal = Signal(self)
self._signal = signal
return self._signal

class Signal(object):

def __init__(self, main_screen):
# well we want to know which screen we came from!
self.main_screen = main_screen

def we_get(self):
return True

for i in xrange(50):
cats = np.empty((10000, 500), dtype=float)
cats.fill(0)
main_screen = MainScreen(cats)
signal = main_screen.turn_on()

Try it.

EDIT: I’ve been informed by an astute reader that the GC is actually capable of dealing with the above case. Yang Zhang says “The collector actually works with cycles, including this case. (Actually, most objects are managed via reference-counting; the gc is specifically for cyclic references.) The problem is that it’s triggered by # allocations, not # bytes. In this case, each cats array is large enough that you need several GB of allocations before the threshold is triggered. Try adding gc.collect() or gc.get_count() in the loop (and inspect gc.get_threshold()).”

All is not lost however. This pattern can be implemented using weak references:

class MainScreen(object):

def __init__(self, cats):
self.cats = cats
self._signal = None

def turn_on(self):
import weakref
if self._signal is None:
signal = Signal(self)
self._signal = weakref.ref(signal)
return self._signal()