# Category Archives: python

## Mastering high performance data algorithms I: Group By

I’m on my way back from R/Finance 2012. Those guys did a nice job of organizing the conference and was great to meet everyone there.

As part of pandas development, I have had to develop a suite of high performance data algorithms and implementation strategies which are the heart and soul of the library. I get asked a lot why pandas’s performance is much better than R and other data manipulation tools. The reasons are two-fold: careful implementation (in Cython and and C, so minimizing the computational friction) and carefully-thought-out algorithms.

Here are some of the more important tools and ideas that I make use of on a day-to-day basis:

• Hash tables I use klib. It’s awesome and easy to use. I have this for critical algorithms like unique, factorize (unique + integer label assignment)
• O(n) sort, known in pandas as groupsort, on integers with known range. This is a variant of counting sort; if you have N integers with known range from 0 to K – 1, this can be sorted in O(N) time. Combining this tool with factorize (hash table-based), you can categorize and sort a large data set in linear time. Failure to understand these two algorithms will force you to pay O(N log N), dominating the runtime of your algorithm.
• Vectorized data movement and subsetting routines: take, put, putmask, replace, etc.
• Let me give you a prime example from a commit yesterday of me applying these ideas to great effect. Suppose I had a time series (or DataFrame containing time series) that I want to group by year, month, and day:

This is a small dataset, but imagine you have millions of observations and thousands or even millions of groups. How does that look algorithmically? I guarantee if you take a naive approach, you will crash and burn when the data increases in size. I know, because I did just that (take a look at the vbenchmarks). Laying down the infrastructure for doing a better job is not simple. Here are the steps for efficiently aggregating data like this:

• Factorize the group keys, computing integer label arrays (O(N) per key)
• Compute a “cartesian product number” or group index for each observation (since you could theoretically have K_1 * … * K_p groups observed, where K_i is the number of unique values in key i). This is again O(n) work.
• If the maximum number of groups is large, “compress” the group index by using the factorize algorithm on it again. imagine you have 1000 uniques per key and 3 keys; most likely you do not actually observe 1 billion key combinations but rather some much smaller number. O(n) work again
• For simple aggregations, like mean, go ahead and aggregate the data in one linear sweep without moving anything around.
• Otherwise, for arbitrary user-defined aggregations, we have two options for chopping up the data set: sorting it (O(N) using groupsort!) or create index arrays indicating which observations belong to each group. The latter would be preferable in many cases to sorting a large data set (reordering the data, even though linear time, can be very costly).
• I worked on speeding up the latter part of this last bullet point yesterday. The resulting code looked like this:

The details of lib.indices_fast aren’t that interesting; it chops up np.arange(len(group_index)).take(sorter), the sorted indirect indices, to produce the index dictionary. Running %lprun to get a line profiling on a large-ish data set:

You might say, well, this seems like a lot of work and maybe we should just zip the keys (forming an array of Python tuples) and do a dumb algorithm? The speed difference ends up being something like an order of magnitude or more faster by being careful in this way and working with indirect integer index arrays.

Anyway, in conclusion, it’s these kinds of algorithms and ideas why pandas is perhaps the best-performing open-source data analysis toolkit for in memory data (I’m going to get to out-of-core data processing and “big data” eventually, just hang tight). It goes beyond language features and data structure internals (though this naturally also has a major impact, a lot of the things I do are easy to express in Python but would be very awkward or impossible to do in, say, R. Maybe I should write a whole article on this.); carefully thought-out algorithms are a major piece of the puzzle.

## vbench Lightning Talk Slides from PyCon 2012

I prepared this lightning talk for PyCon on vbench, my performance monitoring tool, but didn’t have a chance to give it. If I get energetic I might make a screencast of it, but here are the slides:

## Contingency tables and cross-tabulations in pandas

Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.

When calling help(xtabs) (or help(table)) in R, you get a number of examples for using table, xtabs, and ftable:

Here are the same examples using pandas:

One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:

Here is another fun example:

Here is the equivalent R code:

As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.

## NYCPython 1/10/2012: A look inside pandas design and development

I had the privilege of speaking last night at the NYCPython meetup group. I’ve given tons of “use pandas!” talks so I thought I would take a slightly different angle and talk about some of the design and implementation work that I’ve done for getting good performance in critical data manipulations. I’ll turn some of this material into some blog articles in the near future.

Here’s some more video footable shot by my awesome friend Emily Paup with her HDSLR.

I did a little interactive demo (using the ever-amazing IPython HTML Notebook) on Ashley Williams’s Food Nutrient JSON Database:

If you want to run the code in the IPython notebook, you’ll have to download the food database file above.

The audience and I learned from this demo that if you’re after Tryptophan, “Sea lion, Steller, meat with fat (Alaska Native)” is where to get it (in the highest density).

## Some pandas Database Join (merge) Benchmarks vs. R base::merge

Over the last week I have completely retooled pandas’s “database” join infrastructure / algorithms in order to support the full gamut of SQL-style many-to-many merges (pandas has had one-to-one and many-to-one joins for a long time). I was curious about the performance with reasonably large data sets as compared with base::merge.data.frame which I’ve used many times in R. So I just ran a little benchmark for joining a 100000-row DataFrame with a 10000-row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.

Note this new functionality will be shipping in the upcoming 0.7.0 release (!).

There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it’s completely apples to apples. These are mean runtimes in seconds:

Sort by key columns
pandas R Ratio
inner 0.07819 0.2286 2.924
outer 0.09013 1.284 14.25
left 0.0853 0.7766 9.104
right 0.08112 0.3371 4.156

Don’t sort by key columns
pandas R Ratio
inner 0.02808 0.2297 8.18
outer 0.03879 1.181 30.45
left 0.0365 0.6706 18.37
right 0.03021 0.2995 9.912

As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn’t really an indictment of R’s merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven’t run benchmarks against SQLite or another SQL database yet; that will happen eventually.

I’m going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast–it was not easy!

Here is the R code.

N <- 10000
indices = rep(NA, N)
for (i in 1:N)
indices[i] <- paste(sample(letters, 10), collapse="")

left <- data.frame(key=rep(indices, 10),
key2=sample(rep(indices, 10)),
value=rnorm(100000))
right <- data.frame(key=indices,
key2=sample(indices),
value2=rnorm(10000))
timeit <- function(func, niter=10) {
timing = rep(NA, niter)
for (i in 1:niter) {
gc()
timing[i] <- system.time(func())[3]
}
mean(timing)
}

left.join <- function(sort=TRUE) {
result <- merge(left, right, all.x=TRUE, sort=sort)
}
right.join <- function(sort=TRUE) {
result <- merge(left, right, all.y=TRUE, sort=sort)
}
outer.join <- function(sort=TRUE) {
result <- merge(left, right, all=TRUE, sort=sort)
}
inner.join <- function(sort=TRUE) {
reuslt <- merge(left, right, sort=sort)
}

sort.options <- c(FALSE, TRUE)

results <- matrix(nrow=4, ncol=2)
colnames(results) <- c("dont_sort", "sort")
rownames(results) <- c("inner", "outer", "left", "right")

join.functions <- c(inner.join, outer.join, left.join, right.join)
for (i in 1:4) {
results[1, 1] <- timeit(function() {inner.join(sort=sort.options[1])})
results[1, 2] <- timeit(function() {inner.join(sort=sort.options[2])})
results[2, 1] <- timeit(function() {outer.join(sort=sort.options[1])})
results[2, 2] <- timeit(function() {outer.join(sort=sort.options[2])})
results[3, 1] <- timeit(function() {left.join(sort=sort.options[1])})
results[3, 2] <- timeit(function() {left.join(sort=sort.options[2])})
results[4, 1] <- timeit(function() {right.join(sort=sort.options[1])})
results[4, 2] <- timeit(function() {right.join(sort=sort.options[2])})
}

And the Python code

from collections import defaultdict
import gc
import time
from pandas.util.testing import rands
N = 10000
indices = np.array([rands(10) for _ in xrange(N)], dtype='O')

key = np.tile(indices, 10)
key2 = key.copy()
random.shuffle(key2)
indices2 = indices.copy()
random.shuffle(indices2)
left = DataFrame({'key' : key, 'key2':key2,
'value' : np.random.randn(100000)})
right = DataFrame({'key': indices, 'key2':indices2,
'value2' : np.random.randn(10000)})
join_methods = ['inner', 'outer', 'left', 'right']
results = DataFrame(index=join_methods, columns=[False, True])
niter = 10
for sort in [False, True]:
for join_method in join_methods:
f = lambda: merge(left, right, how=join_method, sort=sort)
gc.disable()
start = time.time()
for _ in xrange(niter):
f()
elapsed = (time.time() - start) / niter
gc.enable()
results[sort][join_method] = elapsed
results.columns = ['dont_sort', 'sort']

# R results
from StringIO import StringIO
inner    0.2297 0.2286
outer    1.1811 1.2843
left     0.6706 0.7766
right    0.2995 0.3371
"""
), sep='\s+')

sort_results = DataFrame.from_items([('pandas', results['sort']),
('R', r_results['sort'])])
sort_results['Ratio'] = sort_results['R'] / sort_results['pandas']

nosort_results = DataFrame.from_items([('pandas', results['dont_sort']),
('R', r_results['dont_sort'])])
nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas']

## Introducing vbench, new code performance analysis and monitoring tool

Do you know how fast your code is? Is it faster than it was last week? Or a month ago? How do you know if you accidentally made a function slower by changes elsewhere? Unintentional performance regressions are extremely common in my experience: it’s hard to unit test the performance of your code. Over time I have gotten tired of playing the game of “performance whack-a-mole”. Thus, I started hacking together a little weekend project that I’m calling vbench. If someone thinks up a cleverer name, I’m all ears.

Link to pandas benchmarks page produced using vbench

## What is vbench?

vbench is a super-lightweight Python library for running a collection of performance benchmarks over the course of your source repository’s history. Since I’m a GitHub user, it only does git for now, but it could be generalized to support other VCSs. Basically, you define a benchmark:

common_setup = """
from pandas import *
import pandas.util.testing as tm
import random
import numpy as np
"""

setup = common_setup + """

N = 100000
ngroups = 100

def get_test_data(ngroups=100, n=N):
unique_groups = range(ngroups)
arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object)

if len(arr) < n:
arr = np.asarray(list(arr) + unique_groups[:n - len(arr)],
dtype=object)

random.shuffle(arr)
return arr

df = DataFrame({'key1' : get_test_data(ngroups=ngroups),
'key2' : get_test_data(ngroups=ngroups),
'data' : np.random.randn(N)})
def f():
df.groupby(['key1', 'key2']).agg(lambda x: x.values.sum())
"""

stmt2 = "df.groupby(['key1', 'key2']).sum()"
bm_groupby2 = Benchmark(stmt2, setup, name="GroupBy test 2",
start_date=datetime(2011, 7, 1))

Then you write down the information about your repository and how to build any relevant DLLs, etc., that vary from revision to revision:

REPO_PATH = '/home/wesm/code/pandas'
REPO_URL = 'git@github.com:wesm/pandas.git'
DB_PATH = '/home/wesm/code/pandas/gb_suite/benchmarks.db'
TMP_DIR = '/home/wesm/tmp/gb_pandas'
PREPARE = """
python setup.py clean
"""
BUILD = """
python setup.py build_ext --inplace
"""
START_DATE = datetime(2011, 3, 1)

Then you pass this info, plus a list of your benchmark objects, to the BenchmarkRunner class:

runner = BenchmarkRunner(benchmarks, REPO_PATH, REPO_URL,
BUILD, DB_PATH, TMP_DIR, PREPARE,
run_option='eod', start_date=START_DATE)
runner.run()

Now, the BenchmarkRunner makes a clone of your repo, then runs all of the benchmarks once for each revision in the repository (or some other rule, e.g. I’ve set run_option='eod' to only take the last snapshot on each day). It persists the results in a SQLite database so that you can rerun the process and it will skip benchmarks it’s already run (this is key when you add new benchmarks, only the new ones will be updated). Benchmarks are uniquely identified by the MD5 hash of their source code.

This is the resulting plot over time for the above GroupBy benchmark related to some Cython code that I worked on late last week (where I made a major performance improvement in this case):

Here is a fully-formed vbench suite in the pandas git repository.

## Kind of like codespeed and speed.pypy.org?

Before starting to write a new project I looked briefly at codespeed and the excellent work that the PyPy guys have done with speed.pypy.org. But then I started thinking, you know, Django web application, JSON requests to upload benchmark results? Seemed like far too much work to do something relatively simple. The dealbreaker is that codespeed is just a web application. It doesn’t actually (to my knowledge, someone correct me if I’m wrong?) have any kind of a framework for orchestrating the running of benchmarks throughout your code history. That is what this new project is for. I actually see a natural connection between vbench and codespeed, all you need to do is write a script to upload your vbench results to a codespeed web app!

At some point I’d like to build a simple web front end or wx/Qt viewer for the generated vbench database. I’ve never done any JavaScript, but it would be a good opportunity to learn. Knowing me, I might break down and hack out a stupid little wxPython app with an embedded matplotlib widget anyway.

Anyway, I’m really excited about this project. It’s very prototype-y at the moment but I tried to design it in a nice and extensible way. I also plan to put all my git repo analysis tools in there (like code churn graphs etc.) so it should become a nice little collection of tools.

## Other ideas for extending

• Dealing with API changes
• Multiple library versions (e.g. NumPy) or Python versions
• ## PyHPC 2011 Pre-print paper on pandas

I’ll be presenting this paper in a couple of weeks at SC11. It’s the most up-to-date white paper about pandas.

pandas: a Foundational Python Library for Data Analysis and Statistics

## Python for Financial Data Analysis with pandas

I taught a guest lecture tonight to the Baruch MFE program about using Python and pandas for financial data analysis. Many thanks to Alain Ledon and Norman Kabir for inviting me to teach the class. Here are the slides from the first 40 minutes:

I spent the remaining 90 minutes or so going through a fairly epic whirlwind tour of some of the most important nuts and bolts features of pandas for working with time series and other kinds of financial data. Been a while since I’ve done a talk like this– I usually am talking to a more general audience that don’t want to hear about quant finance! I did a number of fun applications:

• Computing rolling 3 year betas to the S&P 500
• Grouped regression of MSFT on AAPL returns by year and month
• Plotting heatmap of daily and monthly stock correlation matrix (this one got some “ooh”s and “ahh”s)
• Minimum tracking error portfolio optimization problem and lite backtest
• Here is the full PDF of the IPython notebook and a zip file of the IPython notebook and supporting files themselves. Note that the notebook requires a recent GitHub snapshot of statsmodels. I also use the quadratic programming solver in CVXOPT for the portfolio optimization application: if you don’t have CVXOPT, you can change it to use the brute force BFGS optimizer (though it’s about 50x slower).

See my prior blog post with some more details about installing and getting set up with the IPython notebook.

## Python, R, and the allure of magic

R is much more magical than Python. What do I mean by this? In R, things like this are a part of everyday life:

> a <- rnorm(10)
> b <- rnorm(10)
> cbind(a, b)
a          b
[1,]  0.8729978  0.5170078
[2,] -0.6885048 -0.4430447
[3,]  0.4017740  1.8985843
[4,]  2.1088905 -1.4121763
[5,]  0.9375273  0.4703302
[6,]  0.5558276 -0.5825152
[7,] -2.1606252  0.7379874
[8,] -0.7651046 -0.4534345
[9,] -4.2604901  0.9561077
[10,]  0.3940632 -0.8331285

If you’re a seasoned Python programmer, you might have the sort of visceral negative reaction that I do to this. Seriously, just where in the hell did those variable names come from? So when I say magic here I’m talking about abusing the language’s parser. There is nothing special about R that makes the above behavior possible, but rather taking a fundamentally different design philosophy to, say, Python. As any Python programmer knows: Explicit is better than implicit. I happen to agree. There is also a bit of a semantic difference in R versus Python in that assignment in R typically copies data, whereas variables in Python are simply references (labels) for a particular object. So you could make the argument that the names a and b above are more strongly linked to the underlying data.

While building pandas over the last several years, I occasionally grapple with issues like the above. Maybe I should just break from Python ethos and embrace magic? I mean, how hard would it be to get the above behavior in Python? Python gives you stack frames and the ast module after all. So I went down the rabbit hole and wrote this little code snippet:

While this is woefully unpythonic, it’s also kind of cool:

In [27]: merge(a, b)
Out[27]:
a         b
2000-01-03 -1.35      0.8398
2000-01-04  0.999    -1.617
2000-01-05  0.2537    1.433
2000-01-06  0.6273   -0.3959
2000-01-07  0.7963   -0.789
2000-01-10  0.004295 -1.446

This can even parse and format more complicated expressions (harder than it looks, because you have to walk the whole AST):

In [30]: merge(a, np.log(b))
Out[30]:
a        np.log(b)
2000-01-03  0.6243   0.7953
2000-01-04  0.3593  -1.199
2000-01-05  2.805   -1.059
2000-01-06  0.6369  -0.9067
2000-01-07 -0.2734   NaN
2000-01-10 -1.023    0.3326

Now, I am *not* suggesting we do this any time soon. I’m going to prefer the explicit approach (cf. the Zen of Python) any day of the week:

In [32]: DataFrame({'a' : a, 'log(b)' : np.log(b)})
Out[32]:
a        log(b)
2000-01-03  0.6243   0.7953
2000-01-04  0.3593  -1.199
2000-01-05  2.805   -1.059
2000-01-06  0.6369  -0.9067
2000-01-07 -0.2734   NaN
2000-01-10 -1.023    0.3326

## Faster time series alignment / joins for pandas, beating R’s xts package

In anticipation of integrating NumPy’s shiny new datetime64 dtype into pandas, I set about writing some faster alignment and merging functions for ordered time series data indexed by datetime64 timestamps. Many people have pointed me to the widely used R xts package as a baseline for highly optimized joining functions.

Anyway, long story short, with a little NumPy- and Cython-fu I think I’ve matched or beaten xts for almost all of its supported join types by up to 40% (left/outer/inner) using the merge.xts function.

In a blog article earlier today I wrote about some of the performance problems I had to address to do this. The rest of the joining code is pretty straightforward Cython code. Though it’ll probably be a few weeks before this new code gets incorporated into DataFrame.join. You’ll just have to take my word for it that I’m doing an apples-to-apples comparison (or read the source yourself) =)

### Python benchmarks

Here are the Python timings in milliseconds for joining two time series data sets. The column labels are the lengths (in scientific notation, from 100 through 1,000,000). The two timings are with two univariate time series and two collections of 5 time series.

EDIT (9/24/2011): after corresponding with the xts author, Jeff Ryan, I reran the benchmarks with the code modified to ensure that garbage collection time isn’t being included in the runtime. The results after the change to the benchmark are less disparate than before. I also tweaked the Cython algos to determine the outer/inner join time index and re-ran the benchmarks. In the 1e6 outer join case the new algo trimmed 8 ms off, 4-5ms in the inner join case. Whenever I develop a strong desire to hack up a pile of spaghetti-like Cython code (combining the index union/intersection routines with the take / row-copying code) I can probably shave off another few millis…

Python Benchmark on 9/25

Joining two univariate time series
1e2      1e3      1e4     1e5    1e6
outer  0.0605   0.0637   0.1966  1.898  26.26
left   0.0187   0.02282  0.1157  1.023  13.89
inner  0.04526  0.05052  0.1523  1.382  22.25

Joining two 5-variate time series
1e2      1e3      1e4     1e5    1e6
outer  0.07944  0.0638   0.3178  6.498  67.46
left   0.0255   0.03512  0.2467  4.711  51.88
inner  0.06176  0.05262  0.2283  5.267  56.46

EDIT 9/28: I put in some work integrating the new merging routines throughout DataFrame and friends in pandas and added a new Int64Index class to facilitate fast joining of time series data. Here are the updated benchmarks, which now have pandas a bit slower than xts for outer/inner joins in the univariate case but still significantly faster in the multivariate case:

Python Benchmark on 9/28 post integration

Joining two univariate time series
1e2     1e3     1e4     1e5    1e6
outer  0.4501  0.314   0.5362  3.162  30.78
left   0.2742  0.2879  0.408   2.025  19.84
inner  0.2715  0.2863  0.4306  2.504  26.64

Joining two 5-variate time series
1e2      1e3    1e4     1e5    1e6
outer  0.4507  0.3375  0.6158  7.028  71.34
left   0.2959  0.3184  0.4927  5.068  55.2
inner  0.2767  0.305   0.5368  5.782  59.65

As you can see in the 1 million row case there is an additional 4-5 ms of overhead across the board which largely has to do with handling types other than floating point. With some effort I could eliminate this overhead but I’m going to leave it for now.

And the source code for the benchmark:

from pandas import *
import gc
import numpy as np

def bench_python(pct_overlap=0.20, K=1):
ns = [2, 3, 4, 5, 6]
iterations = 50
pct_overlap = 0.2
kinds = ['outer', 'left', 'inner']

all_results = {}
for logn in ns:
n = 10**logn
a = np.arange(n, dtype=np.int64)
b = np.arange(n * pct_overlap, n * pct_overlap + n, dtype=np.int64)
a_frame = DataFrame(np.random.randn(n, K), index=a, columns=range(K))
b_frame = DataFrame(np.random.randn(n, K), index=b, columns=range(K, 2 * K))
all_results[logn] = result = {}
for kind in kinds:
gc.disable(); _s = time.clock()
# do the join
for _ in range(iterations):
a_frame.join(b_frame, how=kind)

elapsed = time.clock() - _s; gc.enable()
result[kind] = (elapsed / iterations) * 1000
return DataFrame(all_results, index=kinds)

### R/xts benchmarks

And the R benchmark using xts. The results for the smaller datasets are unreliable due to the low precision of system.time.

R Benchmark

Joining two univariate time series
1e2  1e3  1e4  1e5   1e6
outer 0.30 0.26 0.48 3.12 28.58
left  0.22 0.24 0.36 2.78 24.18
inner 0.20 0.24 0.40 2.42 21.06

Joining two 5-variate time series
1e2  1e3  1e4   1e5   1e6
outer 0.26 0.46 1.30 11.56 97.02
left  0.34 0.28 1.06 10.04 85.72
inner 0.30 0.28 0.94  8.02 67.22

The Python code for the benchmark is all found here.

Here is the R code for the benchmark (GitHub link):

library(xts)

iterations <- 50

ns = c(100, 1000, 10000, 100000, 1000000)
kinds = c("outer", "left", "inner")

result = matrix(0, nrow=3, ncol=length(ns))
n <- 100000
pct.overlap <- 0.2

k <- 1

for (ni in 1:length(ns)){
n <- ns[ni]
rng1 <- 1:n
offset <- as.integer(n * pct.overlap)
rng2 <- rng1 + offset
x <- xts(matrix(rnorm(n * k), nrow=n, ncol=k),
as.POSIXct(Sys.Date()) + rng1)
y <- xts(matrix(rnorm(n * k), nrow=n, ncol=k),
as.POSIXct(Sys.Date()) + rng2)
timing <- numeric()
for (i in 1:3) {
kind = kinds[i]
for(j in 1:iterations) {
gc()  # just to be sure
timing[j] <- system.time(merge(x,y,join=kind))[3]
}
#timing <- system.time(for (j in 1:iterations) merge.xts(x, y, join=kind),
#                      gcFirst=F)
#timing <- as.list(timing)
result[i, ni] <- mean(timing) * 1000
#result[i, ni] = (timing\$elapsed / iterations) * 1000
}
}

rownames(result) <- kinds
colnames(result) <- log10(ns)