Formatting DataFrame as HTML

Recently, a pandas user, Wouter Overmeire, contributed a to_html function to DataFrame in a pull request. It’s very handy:

In [31]: df
Out[31]:
                        ls    lsc   pop   ccode
year cname     agefrom                        
1950 Australia 15       64.3  15.4  558   AUS  
               40       38.9  20.1  555   AUS  
               65       24.7  10.7  273   AUS  
1955 Australia 25       48.4  26.2  705   AUS  
               50       34    16.6  478   AUS  
               75       26.3  11.7  240   AUS  
1960 Australia 35       47.9  24.7  755   AUS  
               60       29.6  13.5  389   AUS  
1965 Australia 20       57.7  31.1  832   AUS  
               45       41.3  20.3  665   AUS  
               70       30.2  12.9  273   AUS  
1970 Australia 30       57.7  31.1  789   AUS  
               55       42.1  19.4  619   AUS  
1975 Australia 15       91.1  23.1  1225  AUS  
               40       56.3  30.9  744   AUS  
               65       48.8  22.5  440   AUS  
1980 Australia 25       74    44.5  1205  AUS  
               50       60.2  32.8  774   AUS  
               75       62.5  30.8  500   AUS  
1985 Australia 35       72.7  42.8  1180  AUS  

In [32]: df.to_html()

which outputs:


ls lsc pop ccode
year cname agefrom
1950 Australia 15 64.3 15.4 558 AUS
1955 Australia 50 34 16.6 478 AUS
1965 Australia 20 57.7 31.1 832 AUS
1970 Australia 55 42.1 19.4 619 AUS
1980 Australia 25 74 44.5 1205 AUS
1985 Australia 60 59.1 30.8 685 AUS
1995 Australia 30 72 45.7 1430 AUS
2000 Australia 65 60 31.5 663 AUS
2010 Australia 35 59.9 39.1 1506 AUS
1950 Austria 70 2.8 1.9 218 AUT
1960 Austria 40 4.5 3 324 AUT
1965 Austria 75 13 8.6 322 AUT
1975 Austria 45 42.1 28 455 AUT
1985 Austria 15 22.4 4.2 626 AUT
1990 Austria 50 47.3 23.4 448 AUT
2000 Austria 20 73.2 63.2 465 AUT
2005 Austria 55 55.1 36.2 486 AUT

It also renders a tidy representation with hierarchical columns:

In [14]: table
Out[14]:
  sex     Female          Male          
  smoker  No      Yes     No      Yes  
day                                    
Fri       0.1653  0.2091  0.138   0.1447
Sat       0.148   0.1638  0.1621  0.1391
Sun       0.1657  0.2371  0.1583  0.174
Thur      0.156   0.1631  0.1657  0.1644


sex
smoker
Female
No
Female
Yes
Male
No
Male
Yes
day
Fri 0.1653 0.2091 0.138 0.1447
Sat 0.148 0.1638 0.1621 0.1391
Sun 0.1657 0.2371 0.1583 0.174
Thur 0.156 0.1631 0.1657 0.1644

Talk at Rice Stats on structured data analysis, pandas, 11/21/2011

I had the privilege of speaking today at Rice at Hadley Wickham’s (of R fame) invitation. I talked broadly about the problems faced in structured data manipulation problems and how I’ve worked to address them in pandas. I enjoyed speaking with Hadley (whom I’d never met before in person) about these problems as we’ve come up with nice solutions to many of them independently, and in different languages (R and Python, respectively). A lot of folks assume that parts of pandas are inspired by his plyr and reshape2 packages, but I’ve only started playing around with them recently. I think we all have a lot we can learn from each other on the road toward building even better high performance, easy-to-use, and expressive data manipulation tools.

pandas talk at PyHPC 2011 workshop in SC11, thoughts on hash tables

Here are the slides from my talk at PyHPC2011. Not really my usual talk for data crowds– a little more nuts and bolts about some of the indexing and GroupBy implementation details. Some people might be interested in the illustrative data alignment benchmarks which show the relative weakness of Python’s dict implementation (in both speed and memory usage) for lookups and alignment. After these benchmarks I think it’s pretty much inevitable that I’m going to end up writing a custom hash table implementation in C for the data alignment on primitive types. Now, if I wanted a threadsafe hash table that I could use OpenMP on, that would be a serious undertaking. Anyone want to help?

The basic problem is that Python dicts are not designed for my use case– namely very large dicts that I use to perform data alignment operations.

PyHPC2011

Filtering out duplicate pandas.DataFrame rows

Sean Taylor recently alerted me to the fact that there wasn’t an easy way to filter out duplicate rows in a pandas DataFrame. R has the duplicated function which serves this purpose quite nicely. The R method’s implementation is kind of kludgy in my opinion (from “The data frame method works by pasting together a character representation of the rows”), but in any case I set about writing a Python version from first principles.

So a drop_duplicates method should be able to either consider a subset of the columns or all of the columns for determining which are “duplicates”. It occurred to me that a reasonably fast and efficient way to do this was to use GroupBy. You have to know a bit about the internals, but it’s fairly straightforward otherwise:

grouped = df.groupby(cols_to_consider)
index = [gp_keys[0] for gp_keys in grouped.groups.values()]
unique_df = df.reindex(index)

For a faster and more direct implementation in Cython, I decided to use a list of tuples and a dict to keep track of whether a row has been “seen” or not:

def duplicated(list values, take_last=False):
    cdef:
        Py_ssize_t i, n
        dict seen = {}
        object row

    n = len(values)
    cdef ndarray[uint8_t] result = np.zeros(n, dtype=np.uint8)

    if take_last:
        for i from n > i >= 0:
            row = values[i]
            if row in seen:
                result[i] = 1
            else:
                seen[row] = None
                result[i] = 0
    else:
        for i from 0 <= i < n:
            row = values[i]
            if row in seen:
                result[i] = 1
            else:
                seen[row] = None
                result[i] = 0

    return result.view(np.bool_)

Aside: the uint8 and casting to np.bool_ thing is there because the buffer interface doesn’t work quite right with boolean arrays in Python 2.5. So as soon as I drop Python 2.5 compatibility (probably around the end of the year), I’ll go through and fix that stuff up. As a second aside, using a dict with dummy keys was coming out a bit faster than using a set in Cython, for reasons unknown to me.

So, on the Python side, the new DataFrame function just takes the boolean vector returned by that Cython function and uses it to select out the rows:

In [21]: df
Out[21]:
    A  B  C
0   a  c  0
1   b  c  1
2   c  b  2
3   d  a  0
4   e  c  1
5   a  a  2
6   b  b  0
7   c  b  1
8   d  b  2
9   e  b  0
10  a  a  1
11  b  a  2
12  c  c  0
13  d  a  1
14  e  c  2

In [22]: df.drop_duplicates('A')
Out[22]:
   A  B  C
0  a  c  0
1  b  c  1
2  c  b  2
3  d  a  0
4  e  c  1

In [23]: df.drop_duplicates(['A', 'B'])
Out[23]:
    A  B  C
0   a  c  0
1   b  c  1
2   c  b  2
3   d  a  0
4   e  c  1
5   a  a  2
6   b  b  0
8   d  b  2
9   e  b  0
11  b  a  2
12  c  c  0

Not passing any particular column or columns is the same as passing all of them.

I was impressed by the performance:

In [26]: from pandas.util.testing import rands

In [27]: k1 = [rands(1) for _ in xrange(100000)]

In [28]: k2 = [rands(1) for _ in xrange(100000)]

In [29]: df = DataFrame({'A' : k1, 'B' : k2, 'C' : np.tile(np.arange(100), 1000)})

In [30]: timeit df.drop_duplicates(['A', 'B'])
100 loops, best of 3: 18.2 ms per loop

In [31]: timeit df.drop_duplicates()
10 loops, best of 3: 49.8 ms per loop

# the groupby method
In [32]: timeit df.reindex([v[0] for v in df.groupby(['A', 'B']).groups.values()])
10 loops, best of 3: 56.5 ms per loop

For reference, it’s roughly twice as fast as the equivalent R code (though I’m still using R 2.11.1…time to upgrade to 2.14!), but perhaps not too shocking as the R function is converting each row into a string first.

N <- 100000

k1 = rep(NA, N)
k2 = rep(NA, N)
for (i in 1:N){
  k1[i] <- paste(sample(letters, 1), collapse="")
  k2[i] <- paste(sample(letters, 1), collapse="")
}
df <- data.frame(a=k1, b=k2, c=rep(1:100, N / 100))
df2 <- data.frame(a=k1, b=k2)

timings <- numeric()
timings2 <- numeric()
for (i in 1:50) {
  gc()
  timings[i] = system.time(deduped <- df[!duplicated(df),])[3]
  gc()
  timings2[i] = system.time(deduped <- df[!duplicated(df[,c("a", "b")]),])[3]
}

> mean(timings)
[1] 0.09312
> mean(timings2)
[1] 0.04312

PyHPC 2011 Pre-print paper on pandas

I’ll be presenting this paper in a couple of weeks at SC11. It’s the most up-to-date white paper about pandas.

PDF Download Link

pandas: a Foundational Python Library for Data Analysis and Statistics

Fast and easy pivot tables in pandas 0.5.0

Over the last several months, I’ve invested a great deal in the GroupBy and indexing infrastructure of pandas. GroupBy in particular could still use more work to be made even higher performance, but even for quite large data sets, most users will find it more than satisfactory compared with other Python alternatives (which are mainly DIY approaches these days) or even in other data analysis packages, e.g. those in R, which I’ll talk a bit about in this article.

An easy application of GroupBy is creating pivot tables from a tabular data set. This involves aggregating a data set on some number of keys, then reshaping it to form a 2D table with the stratified aggregated values. I added a new function “pivot_table“ which provides a high level interface for creating pivot tables. Let’s consider the tips data set used in some examples of the Hadley Wickham’s excellent R reshape2 package:

In [1]: import pandas.rpy.common as com

In [2]: tips = com.load_data('tips', package='reshape2')

In [3]: tips['tip_pct'] = tips['tip'] / tips['total_bill']

In [4]: tips.head()
Out[4]:
   total_bill  tip   sex     smoker  day  time    size  tip_pct
1  16.99       1.01  Female  No      Sun  Dinner  2     0.05945
2  10.34       1.66  Male    No      Sun  Dinner  3     0.1605
3  21.01       3.5   Male    No      Sun  Dinner  3     0.1666
4  23.68       3.31  Male    No      Sun  Dinner  2     0.1398
5  24.59       3.61  Female  No      Sun  Dinner  4     0.1468

Since most readers won’t have rpy2, you can get the data file here and read it with:

tips = read_csv('tips.csv')

All of this new functionality is available now on GitHub but will be a part of the upcoming pandas 0.5.0 release. Note that I am only using rpy2 here to pull the R data set into my Python sesson– all of the below code is implemented in pure Python (with a little Cython magic to make things fast).

Pivot table basics

To use pivot_table, pass the pandas DataFrame and specify the column you wish to aggregate and the columns to group on the rows and columns of the table. Note the default aggregation is mean, though this can be specified:

In [9]: table = pivot_table(tips, 'tip_pct', rows=['time', 'sex'],
                            cols='smoker')

In [10]: table
Out[10]:
  smoker       No      Yes  
time   sex                  
Dinner Female  0.1568  0.1851
       Male    0.1594  0.1489
Lunch  Female  0.1571  0.1753
       Male    0.1657  0.1667

In [17]: pivot_table(tips, 'tip_pct', rows=['day', 'time'],
                     cols='sex')
Out[17]:
  sex        Female  Male  
day  time                  
Fri  Dinner  0.1991  0.1302
     Lunch   0.1997  0.1741
Sat  Dinner  0.1565  0.1516
Sun  Dinner  0.1816  0.1623
Thur Dinner  0.1597  NaN  
     Lunch   0.1575  0.1653

Note that the returned object is still a DataFrame, so you can use all of the standard indexing technology to select portions of the resulting table:

In [22]: table.ix['Dinner']
Out[22]:
  smoker  No      Yes  
sex                    
Female    0.1568  0.1851
Male      0.1594  0.1489

If you don’t specify a particular values column, it will try to aggregate all the other columns:

In [24]: pivot_table(tips, rows=['sex', 'smoker'])
Out[24]:
               size   tip    tip_pct  total_bill
sex    smoker                                  
Female No      2.593  2.774  0.1569   18.11    
       Yes     2.242  2.932  0.1822   17.98    
Male   No      2.711  3.113  0.1607   19.79    
       Yes     2.5    3.051  0.1528   22.28

Other aggregations can be performed, like using len to get group counts:

In [26]: pivot_table(tips, 'tip_pct', rows='sex', cols='smoker', aggfunc=len)
Out[26]:
  smoker  No  Yes
sex              
Female    54  33
Male      97  60

Once you’ve done the aggregation, you’re free to use the built-in reshaping capability to reshape the data in the table:

In [29]: table = pivot_table(tips, 'tip_pct', rows=['sex', 'day'],
                             cols='smoker', aggfunc=len)

In [30]: table
Out[30]:
  smoker     No  Yes
sex    day          
Female Fri   2   7  
       Sat   13  15
       Sun   14  4  
       Thur  25  7  
Male   Fri   2   8  
       Sat   32  27
       Sun   43  15
       Thur  20  10

In [31]: table.unstack('sex')
Out[31]:
  smoker  No            Yes        
  sex     Female  Male  Female  Male
day                                
Fri       2       2     7       8  
Sat       13      32    15      27  
Sun       14      43    4       15  
Thur      25      20    7       10

If you know a bit more about groupby, you can get clever and pass a dict of functions to perform multiple aggregations at once:

In [59]: pivot_table(tips, rows=['sex', 'smoker'],
                     aggfunc={'tip_pct' : 'mean', 'size' : 'sum'})
Out[59]:
               size  tip_pct
sex    smoker              
Female No      140   0.1569
       Yes     74    0.1822
Male   No      263   0.1607
       Yes     150   0.1528

I thought that was pretty cool.

Comparison with reshape2

R’s reshape2 package relies on two functions, melt and cast, to do the reshaping. Its implementation is, at least in my opinion, a slightly more ad hoc approach than in pandas (the actual pivot_table function is only about 10 lines and basically falls out of the groupby and hierarchical-indexing based reshaping). However, it can take advantage of R’s built-in formulas which makes the syntax very intuitive. Here are some of the same examples using reshape2:

> tips$tip.pct <- tips$tip / tips$total_bill
> mtips = melt(tips, id=c("sex", "smoker", "day", "time"))
> acast(mtips, day ~ sex + smoker, length, subset=.(variable=="tip"))
     Female_No Female_Yes Male_No Male_Yes
Fri          2          7       2        8
Sat         13         15      32       27
Sun         14          4      43       15
Thur        25          7      20       10

> acast(mtips, sex + smoker ~ variable, mean)
           total_bill      tip     size   tip.pct
Female_No    18.10519 2.773519 2.592593 0.1569210
Female_Yes   17.97788 2.931515 2.242424 0.1821504
Male_No      19.79124 3.113402 2.711340 0.1606687
Male_Yes     22.28450 3.051167 2.500000 0.1527712

R lacks hierarchical indexing as far as I know, so reshape2 munges the group labels together into a row label.

In R there is also the xtabs function for doing cross-tabulation:

> ftable(xtabs(size ~ time + sex + smoker + day, data=tips))
                     day Fri Sat Sun Thur
time   sex    smoker                    
Dinner Female No           2  30  43    2
              Yes          8  33  10    0
       Male   No           4  85 124    0
              Yes         12  71  39    0
Lunch  Female No           3   0   0   60
              Yes          6   0   0   17
       Male   No           0   0   0   50
              Yes          5   0   0   23

ftable just creates a flat table which can be more easily viewed. This table could be created with pandas by:

In [54]: pivot_table(tips, 'size', rows=['time', 'sex', 'smoker'],
   ....:             cols='day', aggfunc=np.sum, fill_value=0)
Out[54]:
  day                 Fri  Sat  Sun  Thur
time   sex    smoker                    
Dinner Female No      2    30   43   2  
              Yes     8    33   10   0  
Dinner Male   No      4    85   124  0  
              Yes     12   71   39   0  
Lunch  Female No      3    0    0    60  
              Yes     6    0    0    17  
Lunch  Male   No      0    0    0    50  
              Yes     5    0    0    23

Performance

pandas is very fast as I’ve invested a great deal in optimizing the indexing infrastructure and other core algorithms related to things such as this. I don’t have a lot of points of comparison, but here is a simple benchmark of reshape2 versus pandas.pivot_table on a data set with 100000 entries and 25 groups. Here is the R code for the benchmark:

library(reshape2)
n <- 100000
a.size <- 5
b.size <- 5
data <- data.frame(a=sample(letters[1:a.size], n, replace=T),
                   b=sample(letters[1:b.size], n, replace=T),
                   c=rnorm(n),
                   d=rnorm(n))
timings <- numeric()

for (i in 1:10) {
  gc()
  tim <- system.time(acast(melt(data, id=c("a", "b")), a + b ~ variable, mean))
  timings[i] = tim[3]
}
mean(timings)

Here is what the actual table looks like:

> table <- acast(melt(data, id=c("a", "b")), a + b ~ variable, mean)
> head(table)
               c            d
a_a  0.002761963 -0.008793516
a_b -0.004618980 -0.026978744
a_c  0.022491767 -0.002163986
a_d  0.005027362 -0.002532782
a_e  0.006146438  0.031699238
b_a -0.025094899  0.003334301

Here’s the equivalent Python code (sans timings, which I’ll do in IPython):

from pandas import *
import string
n = 100000
asize = 5
bsize = 5

letters = np.asarray(list(string.letters), dtype=object)

data = DataFrame(dict(foo=letters[:asize][np.random.randint(0, asize, n)],
                      bar=letters[:bsize][np.random.randint(0, bsize, n)],
                      baz=np.random.randn(n),
                      qux=np.random.randn(n)))

table = pivot_table(data, rows=['foo', 'bar'])

Similarly:

In [36]: table = pivot_table(data, rows=['foo', 'bar'])

In [37]: table[:10]
Out[37]:
         baz       qux      
foo bar                    
a   a   -0.02162   0.02186  
    b   -0.002155  0.01173  
    c   -0.002331  0.006608
    d   -0.01734   0.01109  
    e    0.02837   0.01093  
b   a    0.01703  -0.005845
    b   -0.001043  0.01283  
    c   -0.006015 -0.0005866
    d    0.009683  0.02211  
    e    0.02347  -8.346e-05

And the R timing

> mean(timings)
[1] 0.42

And pandas:

In [38]: timeit table = pivot_table(data, rows=['foo', 'bar'])
10 loops, best of 3: 117 ms per loop

So it’s about 3-4x faster than reshape2. This obviously depends on the data set and how you structure the aggregation. Here’s a couple slightly different benchmarks on the same data:

> acast(melt(data, id=c("a", "b")), a~b, mean, subset=.(variable=="c"))
             a            b            c           d            e
a  0.002761963 -0.004618980  0.022491767 0.005027362  0.006146438
b -0.025094899  0.015966167 -0.007663059 0.006795551  0.006648496
c  0.012273797  0.006139820 -0.019169968 0.009035374 -0.006079683
d -0.014174747 -0.009844566  0.022450738 0.025967638 -0.006572791
e -0.003623382  0.005924704  0.008156482 0.013288116  0.009839315

Timing
> mean(timings)
[1] 0.3036

and pandas:

In [41]: pivot_table(data, 'baz', rows='foo', cols='bar')
Out[41]:
  bar  a         b          c         d         e      
foo                                                    
a     -0.02162  -0.002155  -0.002331 -0.01734   0.02837
b      0.01703  -0.001043  -0.006015  0.009683  0.02347
c      0.01561   0.0003992  0.01451  -0.01046  -0.00368
d      0.003504 -0.01437   -0.03265   0.003471 -0.004638
e     -0.01656  -0.01276   -0.02481   0.008629  0.011  

In [42]: timeit pivot_table(data, 'baz', rows='foo', cols='bar')
10 loops, best of 3: 55 ms per loop

I suspect one source of slowdown reshape/reshape2 approach is that the melt operation is very expensive, as is variable subsetting. However, it enables easier computation of group margins, something which I have not yet addressed (but will, eventually) in pandas.

Performance quirk: making a 1D object ndarray of tuples

Getting a 1-dimensional ndarray of object dtype containing Python tuples is, unless I’m missing something, rather difficult. Take this simple example:

In [1]: tuples = zip(range(100000), range(100000))
In [2]: arr = np.array(tuples, dtype=object, ndmin=1)
In [3]: arr
Out[3]:
array([[0, 0],
       [1, 1],
       [2, 2],
       ...,
       [99997, 99997],
       [99998, 99998],
       [99999, 99999]], dtype=object)

In [5]: arr.ndim
Out[5]: 2

OK, that didn’t work so well. The only way I’ve figured out how to get what I want is:

In [6]: arr = np.empty(len(tuples), dtype='O')
In [7]: arr[:] = tuples
In [8]: arr
Out[8]:
array([(0, 0), (1, 1), (2, 2), ..., (99997, 99997), (99998, 99998),
       (99999, 99999)], dtype=object)

Yahtzee. But the kids aren’t alright:

In [9]: timeit arr[:] = tuples
10 loops, best of 3: 133 ms per loop

Maybe it’s just me but that strikes me as being outrageously slow. Someday I’ll look at what’s going on under the hood, but a quickie Cython function comes to the rescue:

def list_to_object_array(list obj):
    '''
    Convert list to object ndarray.
    Seriously can't believe I had to write this function
    '''

    cdef:
        Py_ssize_t i, n
        ndarray[object] arr
    n = len(obj)
    arr = np.empty(n, dtype=object)
    for i from 0 <= i < n:
        arr[i] = obj[i]
    return arr

You would hope this is faster, and indeed it’s about 85x faster:

In [12]: timeit arr = lib.list_to_object_array(tuples)
1000 loops, best of 3: 1.56 ms per loop

Scratching my head here, but I’ll take it. I suspect there might be some object copying going on under the hood, anyone know?

Python for Financial Data Analysis with pandas

I taught a guest lecture tonight to the Baruch MFE program about using Python and pandas for financial data analysis. Many thanks to Alain Ledon and Norman Kabir for inviting me to teach the class. Here are the slides from the first 40 minutes:

I spent the remaining 90 minutes or so going through a fairly epic whirlwind tour of some of the most important nuts and bolts features of pandas for working with time series and other kinds of financial data. Been a while since I’ve done a talk like this– I usually am talking to a more general audience that don’t want to hear about quant finance! I did a number of fun applications:

  • Computing rolling 3 year betas to the S&P 500
  • Grouped regression of MSFT on AAPL returns by year and month
  • Plotting heatmap of daily and monthly stock correlation matrix (this one got some “ooh”s and “ahh”s)
  • Minimum tracking error portfolio optimization problem and lite backtest
  • Here is the full PDF of the IPython notebook and a zip file of the IPython notebook and supporting files themselves. Note that the notebook requires a recent GitHub snapshot of statsmodels. I also use the quadratic programming solver in CVXOPT for the portfolio optimization application: if you don’t have CVXOPT, you can change it to use the brute force BFGS optimizer (though it’s about 50x slower).

    See my prior blog post with some more details about installing and getting set up with the IPython notebook.

    IPython Notebook PDF Output
    IPython Notebook and supporting files

    Speeding up pandas’s file parsers with Cython

    I’ve tried to make the file parsing functions in pandas, read_csv and read_table, as robust (they do the right thing) and fast as possible. What do we really care about?

  • Good performance: can read a CSV file as fast as other statistical computing / data analysis languages, like R
  • Proper type handling: if a column of data has integers, the resulting column will be integer dtype. But you can’t be too aggressive, values like ’1.0′ and ’2.0′ should not get converted to integers.
  • Support for converting strings into datetime objects
  • Support for returning indexed DataFrame (with a simple or hierarchical index)
  • NA value handling. This is tricky as you want to recognize lots of common NA representations (“NA”, “NULL”, etc.) and also allow the user to specify custom NA value strings.
  • While hacking yesterdays with folks at the Data Without Borders event, I realized that boolean values (“True” and “False”) weren’t resulting in an boolean array in the result. Also, I was not satisfied with the performance of the pure Python code. Since I’ve had great results using Cython to create C extensions, it was the natural choice. The results were great: parsing the data set we were looking at at DWB went from about 8 seconds before to 800ms, a full 10x improvement. I also fixed a number of bugs / corner cases with type handling.

    TL;DR pandas.read_csv is a lot faster now

    Here was the speed of read_csv before these changes on a fairly big file (46738×54):

    In [1]: %time df = read_csv('data.csv')
    CPU times: user 7.99 s, sys: 0.09 s, total: 8.08 s
    Wall time: 8.11 s

    This obviously will not do. And here post-Cythonization:

    In [17]: timeit df = read_csv('data.csv')
    1 loops, best of 3: 804 ms per loop

    As a point of comparison, R is pretty speedy but about 2x slower.

    > system.time(df <- read.csv('data.csv', header=T))
       user  system elapsed
      1.660   0.000   1.667

    In fairness I am 100% sure that read.csv is “doing” a lot more, but it shows that I’m at least on the right track.

    I won’t rehash all the code, but there were a number of interesting things along the way.

    The basics: working with NumPy arrays in Cython

    One of the truly beautiful things about programming in Cython is that you can get the speed of working with a C array representing a multi-dimensional array (e.g. double *) without the headache of having to handle the striding information of the ndarray yourself. Also, you can work with non-contiguous arrays and arrays with dtype=object (which are just arrays of PyObject* underneath) with no code changes (!). Cython calls this the buffer interface:

    def sanitize_objects(ndarray[object] values):
        cdef:
            Py_ssize_t i, n
            object val, onan

        n = len(values)
        onan = np.nan

        for i from 0 <= i < n:
            val = values[i]
            if val == '':
                values[i] = onan

    For multi-dimensional arrays, you specify the number of dimensions in the buffer. and pass multiple indexes (Py_ssize_t is the proper C “index type” to use). I’ll demonstrate this in:

    Converting rows to columns faster than zip(*rows)

    A cool Python trick to convert rows to column is:

    In [5]: rows
    Out[5]:
    [(0.39455404791938709, 0.13120015514691319, -0.38366356835950594),
     (-0.744567101498121, -0.9189909692195557, 1.3558711696319314),
     (-0.20933216711571506, 0.36102965753837235, 0.94614438124063927),
     (0.49200559154161844, 0.099177280246717708, -1.2622899429921068),
     (-0.48238271158753454, -0.9414862514454051, -1.0257632509581869)]

    In [6]: zip(*rows)
    Out[6]:
    [(0.39455404791938709,
      -0.744567101498121,
      -0.20933216711571506,
      0.49200559154161844,
      -0.48238271158753454),
     (0.13120015514691319,
      -0.9189909692195557,
      0.36102965753837235,
      0.099177280246717708,
      -0.9414862514454051),
     (-0.38366356835950594,
      1.3558711696319314,
      0.94614438124063927,
      -1.2622899429921068,
      -1.0257632509581869)]

    While zip is very fast (a built-in Python function), the larger problem here is that our target data structure is NumPy arrays to begin with. So it would make sense to write out the rows directly to a 2-dimensional object array:

    def to_object_array(list rows):
        cdef:
            Py_ssize_t i, j, n, k, tmp
            ndarray[object, ndim=2] result
            list row

        n = len(rows)

        # get the maximum row length
        k = 0
        for i from 0 <= i < n:
            tmp = len(rows[i])
            if tmp > k:
                k = tmp

        result = np.empty((n, k), dtype=object)

        for i from 0 <= i < n:
            row = rows[i]

            for j from 0 <= j < len(row):
                result[i, j] = row[j]

        return result

    And lo and behold, this function is significantly faster than the zip trick:

    In [12]: data = map(list, np.random.randn(10000, 10))

    In [13]: timeit zip(*data)
    100 loops, best of 3: 3.66 ms per loop

    In [14]: timeit lib.to_object_array(data)
    1000 loops, best of 3: 1.47 ms per loop

    It’s even more of a big deal if you zip and convert to ndarray:

    In [15]: timeit [np.asarray(x, dtype=object) for x in zip(*data)]
    100 loops, best of 3: 6.72 ms per loop

    Numeric conversion: floats, ints, and NA’s, oh my

    When converting the Python strings to numeric data, you must:

  • Check that the value is not among the set of NA values
  • Handle data like ['1', '2', '3.5', '4'] where you may have “integer-like” data until you observe a floating point value
  • Unfortunately, code for this sort of thing ends up looking like a state machine 99% of the time, but at least it’s fairly tidy in Cython and runs super fast:

    def maybe_convert_numeric(ndarray[object] values, set na_values):
        cdef:
            Py_ssize_t i, n
            ndarray[float64_t] floats
            ndarray[int64_t] ints
            bint seen_float = 0
            object val
            float64_t fval

        n = len(values)

        floats = np.empty(n, dtype='f8')
        ints = np.empty(n, dtype='i8')

        for i from 0 <= i < n:
            val = values[i]

            if cpython.PyFloat_Check(val):
                floats[i] = val
                seen_float = 1
            elif val in na_values:
                floats[i] = nan
                seen_float = 1
            elif val is None:
                floats[i] = nan
                seen_float = 1
            elif len(val) == 0:
                floats[i] = nan
                seen_float = 1
            else:
                fval = float(val)
                floats[i] = fval
                if not seen_float:
                    if '.' in val:
                        seen_float = 1
                    else:
                        ints[i] = <int64_t> fval

        if seen_float:
            return floats
        else:
            return ints

    Adopting the Python philosophy that it’s “easier to ask forgiveness than permission” if float conversion ever fails, the exception will get raised and the code will just leave the column as dtype=object. And this function would obviously have problems with European decimal format– but I’m not willing to compromise performance in 99% cases for the sake of the 1% cases. It will make sense to write a slower function that also handles a broader variety of formatting issues.

    The pandas escaped the zoo: Python’s pandas vs. R’s zoo benchmarks

    generic pandas data alignment is about 10-15x faster than the #rstats zoo package in initial tests. interesting #python
    @wesmckinn
    Wes McKinney

    I tweeted that yesterday and figured it would be prudent to justify that with some code and real benchmarks. I’m really proud of pandas’s performance after investing years of development building a tool that is both easy-to-use and fast. So here we go.

    The test case

    The basic set-up is: you have two labeled vectors of different lengths and you add them together. The algorithm matches the labels and adds together the corresponding values. Simple, right?

    R/zoo benchmarks

    Here’s the R code:

    library(zoo)

    indices = rep(NA, 100000)
    for (i in 1:100000)
      indices[i] <- paste(sample(letters, 10), collapse="")

    timings <- numeric()

    x <- zoo(rnorm(100000), indices)
    y <- zoo(rnorm(90000), indices[sample(1:100000, 90000)])

    for (i in 1:10) {
      gc()
      timings[i] = system.time(x + y)[3]
    }

    In this benchmark, I get a timing of:

    > mean(timings)
    [1] 1.1518

    So, 1.15 seconds per iteration. There are a couple things to note here:

  • The zoo package pre-sorts the objects by the index/label. As you will see below this makes a big performance difference as you can write a faster algorithm for ordered data.
  • zoo returns an object whose index is the intersection of the indexes. I disagree with this design choice as I feel that it is discarding information. pandas returns the union (the “outer join”, if you will) by default.

    Python benchmark

    Here’s the code doing basically the same thing, except using objects that are not pre-sorted by label:

    from pandas import *
    from pandas.util.testing import rands

    n = 100000
    indices = Index([rands(10) for _ in xrange(n)])

    def sample(values, k):
        from random import shuffle
        sampler = np.arange(len(values))
        shuffle(sampler)
        return values.take(sampler[:k])

    subsample_size = 90000

    x = Series(np.random.randn(100000), indices)
    y = Series(np.random.randn(subsample_size),
               index=sample(indices, subsample_size))

    And the timing:

    In [11]: timeit x + y
    10 loops, best of 3: 110 ms per loop

    Now, if I first sort the objects by index, a more specialized algorithm will be used:

    In [12]: xs = x.sort_index()

    In [13]: ys = y.sort_index()

    In [14]: timeit xs + ys
    10 loops, best of 3: 44.1 ms per loop

    Note that I’m also the fastest (that I know of) among Python libraries. Here’s the above example using the labeled array package:

    In [12]: import la

    In [13]: lx = la.larry(x.values, [list(x.index)])

    In [14]: ly = la.larry(y.values, [list(y.index)])

    In [15]: timeit la.add(lx, ly, join="outer")
    1 loops, best of 3: 214 ms per loop

    In [16]: timeit la.add(lx, ly, join="inner")
    10 loops, best of 3: 176 ms per loop

    The verdict

    So in a apples-to-apples comparison, in this benchmark pandas is 26x faster than zoo. Even in the completely unordered case (which is not apples-to-apples), it’s 10x faster. I actually have a few tricks up my sleeve (as soon as I can find the time to implement them) to make the above operations even faster still =)