Monthly Archives: October 2011

PyHPC 2011 Pre-print paper on pandas

I’ll be presenting this paper in a couple of weeks at SC11. It’s the most up-to-date white paper about pandas.

PDF Download Link

pandas: a Foundational Python Library for Data Analysis and Statistics

Fast and easy pivot tables in pandas 0.5.0

Over the last several months, I’ve invested a great deal in the GroupBy and indexing infrastructure of pandas. GroupBy in particular could still use more work to be made even higher performance, but even for quite large data sets, most users will find it more than satisfactory compared with other Python alternatives (which are mainly DIY approaches these days) or even in other data analysis packages, e.g. those in R, which I’ll talk a bit about in this article.

An easy application of GroupBy is creating pivot tables from a tabular data set. This involves aggregating a data set on some number of keys, then reshaping it to form a 2D table with the stratified aggregated values. I added a new function “pivot_table“ which provides a high level interface for creating pivot tables. Let’s consider the tips data set used in some examples of the Hadley Wickham’s excellent R reshape2 package:

In [1]: import pandas.rpy.common as com

In [2]: tips = com.load_data('tips', package='reshape2')

In [3]: tips['tip_pct'] = tips['tip'] / tips['total_bill']

In [4]: tips.head()
Out[4]:
   total_bill  tip   sex     smoker  day  time    size  tip_pct
1  16.99       1.01  Female  No      Sun  Dinner  2     0.05945
2  10.34       1.66  Male    No      Sun  Dinner  3     0.1605
3  21.01       3.5   Male    No      Sun  Dinner  3     0.1666
4  23.68       3.31  Male    No      Sun  Dinner  2     0.1398
5  24.59       3.61  Female  No      Sun  Dinner  4     0.1468

Since most readers won’t have rpy2, you can get the data file here and read it with:

tips = read_csv('tips.csv')

All of this new functionality is available now on GitHub but will be a part of the upcoming pandas 0.5.0 release. Note that I am only using rpy2 here to pull the R data set into my Python sesson– all of the below code is implemented in pure Python (with a little Cython magic to make things fast).

Pivot table basics

To use pivot_table, pass the pandas DataFrame and specify the column you wish to aggregate and the columns to group on the rows and columns of the table. Note the default aggregation is mean, though this can be specified:

In [9]: table = pivot_table(tips, 'tip_pct', rows=['time', 'sex'],
                            cols='smoker')

In [10]: table
Out[10]:
  smoker       No      Yes  
time   sex                  
Dinner Female  0.1568  0.1851
       Male    0.1594  0.1489
Lunch  Female  0.1571  0.1753
       Male    0.1657  0.1667

In [17]: pivot_table(tips, 'tip_pct', rows=['day', 'time'],
                     cols='sex')
Out[17]:
  sex        Female  Male  
day  time                  
Fri  Dinner  0.1991  0.1302
     Lunch   0.1997  0.1741
Sat  Dinner  0.1565  0.1516
Sun  Dinner  0.1816  0.1623
Thur Dinner  0.1597  NaN  
     Lunch   0.1575  0.1653

Note that the returned object is still a DataFrame, so you can use all of the standard indexing technology to select portions of the resulting table:

In [22]: table.ix['Dinner']
Out[22]:
  smoker  No      Yes  
sex                    
Female    0.1568  0.1851
Male      0.1594  0.1489

If you don’t specify a particular values column, it will try to aggregate all the other columns:

In [24]: pivot_table(tips, rows=['sex', 'smoker'])
Out[24]:
               size   tip    tip_pct  total_bill
sex    smoker                                  
Female No      2.593  2.774  0.1569   18.11    
       Yes     2.242  2.932  0.1822   17.98    
Male   No      2.711  3.113  0.1607   19.79    
       Yes     2.5    3.051  0.1528   22.28

Other aggregations can be performed, like using len to get group counts:

In [26]: pivot_table(tips, 'tip_pct', rows='sex', cols='smoker', aggfunc=len)
Out[26]:
  smoker  No  Yes
sex              
Female    54  33
Male      97  60

Once you’ve done the aggregation, you’re free to use the built-in reshaping capability to reshape the data in the table:

In [29]: table = pivot_table(tips, 'tip_pct', rows=['sex', 'day'],
                             cols='smoker', aggfunc=len)

In [30]: table
Out[30]:
  smoker     No  Yes
sex    day          
Female Fri   2   7  
       Sat   13  15
       Sun   14  4  
       Thur  25  7  
Male   Fri   2   8  
       Sat   32  27
       Sun   43  15
       Thur  20  10

In [31]: table.unstack('sex')
Out[31]:
  smoker  No            Yes        
  sex     Female  Male  Female  Male
day                                
Fri       2       2     7       8  
Sat       13      32    15      27  
Sun       14      43    4       15  
Thur      25      20    7       10

If you know a bit more about groupby, you can get clever and pass a dict of functions to perform multiple aggregations at once:

In [59]: pivot_table(tips, rows=['sex', 'smoker'],
                     aggfunc={'tip_pct' : 'mean', 'size' : 'sum'})
Out[59]:
               size  tip_pct
sex    smoker              
Female No      140   0.1569
       Yes     74    0.1822
Male   No      263   0.1607
       Yes     150   0.1528

I thought that was pretty cool.

Comparison with reshape2

R’s reshape2 package relies on two functions, melt and cast, to do the reshaping. Its implementation is, at least in my opinion, a slightly more ad hoc approach than in pandas (the actual pivot_table function is only about 10 lines and basically falls out of the groupby and hierarchical-indexing based reshaping). However, it can take advantage of R’s built-in formulas which makes the syntax very intuitive. Here are some of the same examples using reshape2:

> tips$tip.pct <- tips$tip / tips$total_bill
> mtips = melt(tips, id=c("sex", "smoker", "day", "time"))
> acast(mtips, day ~ sex + smoker, length, subset=.(variable=="tip"))
     Female_No Female_Yes Male_No Male_Yes
Fri          2          7       2        8
Sat         13         15      32       27
Sun         14          4      43       15
Thur        25          7      20       10

> acast(mtips, sex + smoker ~ variable, mean)
           total_bill      tip     size   tip.pct
Female_No    18.10519 2.773519 2.592593 0.1569210
Female_Yes   17.97788 2.931515 2.242424 0.1821504
Male_No      19.79124 3.113402 2.711340 0.1606687
Male_Yes     22.28450 3.051167 2.500000 0.1527712

R lacks hierarchical indexing as far as I know, so reshape2 munges the group labels together into a row label.

In R there is also the xtabs function for doing cross-tabulation:

> ftable(xtabs(size ~ time + sex + smoker + day, data=tips))
                     day Fri Sat Sun Thur
time   sex    smoker                    
Dinner Female No           2  30  43    2
              Yes          8  33  10    0
       Male   No           4  85 124    0
              Yes         12  71  39    0
Lunch  Female No           3   0   0   60
              Yes          6   0   0   17
       Male   No           0   0   0   50
              Yes          5   0   0   23

ftable just creates a flat table which can be more easily viewed. This table could be created with pandas by:

In [54]: pivot_table(tips, 'size', rows=['time', 'sex', 'smoker'],
   ....:             cols='day', aggfunc=np.sum, fill_value=0)
Out[54]:
  day                 Fri  Sat  Sun  Thur
time   sex    smoker                    
Dinner Female No      2    30   43   2  
              Yes     8    33   10   0  
Dinner Male   No      4    85   124  0  
              Yes     12   71   39   0  
Lunch  Female No      3    0    0    60  
              Yes     6    0    0    17  
Lunch  Male   No      0    0    0    50  
              Yes     5    0    0    23

Performance

pandas is very fast as I’ve invested a great deal in optimizing the indexing infrastructure and other core algorithms related to things such as this. I don’t have a lot of points of comparison, but here is a simple benchmark of reshape2 versus pandas.pivot_table on a data set with 100000 entries and 25 groups. Here is the R code for the benchmark:

library(reshape2)
n <- 100000
a.size <- 5
b.size <- 5
data <- data.frame(a=sample(letters[1:a.size], n, replace=T),
                   b=sample(letters[1:b.size], n, replace=T),
                   c=rnorm(n),
                   d=rnorm(n))
timings <- numeric()

for (i in 1:10) {
  gc()
  tim <- system.time(acast(melt(data, id=c("a", "b")), a + b ~ variable, mean))
  timings[i] = tim[3]
}
mean(timings)

Here is what the actual table looks like:

> table <- acast(melt(data, id=c("a", "b")), a + b ~ variable, mean)
> head(table)
               c            d
a_a  0.002761963 -0.008793516
a_b -0.004618980 -0.026978744
a_c  0.022491767 -0.002163986
a_d  0.005027362 -0.002532782
a_e  0.006146438  0.031699238
b_a -0.025094899  0.003334301

Here’s the equivalent Python code (sans timings, which I’ll do in IPython):

from pandas import *
import string
n = 100000
asize = 5
bsize = 5

letters = np.asarray(list(string.letters), dtype=object)

data = DataFrame(dict(foo=letters[:asize][np.random.randint(0, asize, n)],
                      bar=letters[:bsize][np.random.randint(0, bsize, n)],
                      baz=np.random.randn(n),
                      qux=np.random.randn(n)))

table = pivot_table(data, rows=['foo', 'bar'])

Similarly:

In [36]: table = pivot_table(data, rows=['foo', 'bar'])

In [37]: table[:10]
Out[37]:
         baz       qux      
foo bar                    
a   a   -0.02162   0.02186  
    b   -0.002155  0.01173  
    c   -0.002331  0.006608
    d   -0.01734   0.01109  
    e    0.02837   0.01093  
b   a    0.01703  -0.005845
    b   -0.001043  0.01283  
    c   -0.006015 -0.0005866
    d    0.009683  0.02211  
    e    0.02347  -8.346e-05

And the R timing

> mean(timings)
[1] 0.42

And pandas:

In [38]: timeit table = pivot_table(data, rows=['foo', 'bar'])
10 loops, best of 3: 117 ms per loop

So it’s about 3-4x faster than reshape2. This obviously depends on the data set and how you structure the aggregation. Here’s a couple slightly different benchmarks on the same data:

> acast(melt(data, id=c("a", "b")), a~b, mean, subset=.(variable=="c"))
             a            b            c           d            e
a  0.002761963 -0.004618980  0.022491767 0.005027362  0.006146438
b -0.025094899  0.015966167 -0.007663059 0.006795551  0.006648496
c  0.012273797  0.006139820 -0.019169968 0.009035374 -0.006079683
d -0.014174747 -0.009844566  0.022450738 0.025967638 -0.006572791
e -0.003623382  0.005924704  0.008156482 0.013288116  0.009839315

Timing
> mean(timings)
[1] 0.3036

and pandas:

In [41]: pivot_table(data, 'baz', rows='foo', cols='bar')
Out[41]:
  bar  a         b          c         d         e      
foo                                                    
a     -0.02162  -0.002155  -0.002331 -0.01734   0.02837
b      0.01703  -0.001043  -0.006015  0.009683  0.02347
c      0.01561   0.0003992  0.01451  -0.01046  -0.00368
d      0.003504 -0.01437   -0.03265   0.003471 -0.004638
e     -0.01656  -0.01276   -0.02481   0.008629  0.011  

In [42]: timeit pivot_table(data, 'baz', rows='foo', cols='bar')
10 loops, best of 3: 55 ms per loop

I suspect one source of slowdown reshape/reshape2 approach is that the melt operation is very expensive, as is variable subsetting. However, it enables easier computation of group margins, something which I have not yet addressed (but will, eventually) in pandas.

Performance quirk: making a 1D object ndarray of tuples

Getting a 1-dimensional ndarray of object dtype containing Python tuples is, unless I’m missing something, rather difficult. Take this simple example:

In [1]: tuples = zip(range(100000), range(100000))
In [2]: arr = np.array(tuples, dtype=object, ndmin=1)
In [3]: arr
Out[3]:
array([[0, 0],
       [1, 1],
       [2, 2],
       ...,
       [99997, 99997],
       [99998, 99998],
       [99999, 99999]], dtype=object)

In [5]: arr.ndim
Out[5]: 2

OK, that didn’t work so well. The only way I’ve figured out how to get what I want is:

In [6]: arr = np.empty(len(tuples), dtype='O')
In [7]: arr[:] = tuples
In [8]: arr
Out[8]:
array([(0, 0), (1, 1), (2, 2), ..., (99997, 99997), (99998, 99998),
       (99999, 99999)], dtype=object)

Yahtzee. But the kids aren’t alright:

In [9]: timeit arr[:] = tuples
10 loops, best of 3: 133 ms per loop

Maybe it’s just me but that strikes me as being outrageously slow. Someday I’ll look at what’s going on under the hood, but a quickie Cython function comes to the rescue:

def list_to_object_array(list obj):
    '''
    Convert list to object ndarray.
    Seriously can't believe I had to write this function
    '''

    cdef:
        Py_ssize_t i, n
        ndarray[object] arr
    n = len(obj)
    arr = np.empty(n, dtype=object)
    for i from 0 <= i < n:
        arr[i] = obj[i]
    return arr

You would hope this is faster, and indeed it’s about 85x faster:

In [12]: timeit arr = lib.list_to_object_array(tuples)
1000 loops, best of 3: 1.56 ms per loop

Scratching my head here, but I’ll take it. I suspect there might be some object copying going on under the hood, anyone know?

Python for Financial Data Analysis with pandas

I taught a guest lecture tonight to the Baruch MFE program about using Python and pandas for financial data analysis. Many thanks to Alain Ledon and Norman Kabir for inviting me to teach the class. Here are the slides from the first 40 minutes:

I spent the remaining 90 minutes or so going through a fairly epic whirlwind tour of some of the most important nuts and bolts features of pandas for working with time series and other kinds of financial data. Been a while since I’ve done a talk like this– I usually am talking to a more general audience that don’t want to hear about quant finance! I did a number of fun applications:

  • Computing rolling 3 year betas to the S&P 500
  • Grouped regression of MSFT on AAPL returns by year and month
  • Plotting heatmap of daily and monthly stock correlation matrix (this one got some “ooh”s and “ahh”s)
  • Minimum tracking error portfolio optimization problem and lite backtest
  • Here is the full PDF of the IPython notebook and a zip file of the IPython notebook and supporting files themselves. Note that the notebook requires a recent GitHub snapshot of statsmodels. I also use the quadratic programming solver in CVXOPT for the portfolio optimization application: if you don’t have CVXOPT, you can change it to use the brute force BFGS optimizer (though it’s about 50x slower).

    See my prior blog post with some more details about installing and getting set up with the IPython notebook.

    IPython Notebook PDF Output
    IPython Notebook and supporting files

    Speeding up pandas’s file parsers with Cython

    I’ve tried to make the file parsing functions in pandas, read_csv and read_table, as robust (they do the right thing) and fast as possible. What do we really care about?

  • Good performance: can read a CSV file as fast as other statistical computing / data analysis languages, like R
  • Proper type handling: if a column of data has integers, the resulting column will be integer dtype. But you can’t be too aggressive, values like ’1.0′ and ’2.0′ should not get converted to integers.
  • Support for converting strings into datetime objects
  • Support for returning indexed DataFrame (with a simple or hierarchical index)
  • NA value handling. This is tricky as you want to recognize lots of common NA representations (“NA”, “NULL”, etc.) and also allow the user to specify custom NA value strings.
  • While hacking yesterdays with folks at the Data Without Borders event, I realized that boolean values (“True” and “False”) weren’t resulting in an boolean array in the result. Also, I was not satisfied with the performance of the pure Python code. Since I’ve had great results using Cython to create C extensions, it was the natural choice. The results were great: parsing the data set we were looking at at DWB went from about 8 seconds before to 800ms, a full 10x improvement. I also fixed a number of bugs / corner cases with type handling.

    TL;DR pandas.read_csv is a lot faster now

    Here was the speed of read_csv before these changes on a fairly big file (46738×54):

    In [1]: %time df = read_csv('data.csv')
    CPU times: user 7.99 s, sys: 0.09 s, total: 8.08 s
    Wall time: 8.11 s

    This obviously will not do. And here post-Cythonization:

    In [17]: timeit df = read_csv('data.csv')
    1 loops, best of 3: 804 ms per loop

    As a point of comparison, R is pretty speedy but about 2x slower.

    > system.time(df <- read.csv('data.csv', header=T))
       user  system elapsed
      1.660   0.000   1.667

    In fairness I am 100% sure that read.csv is “doing” a lot more, but it shows that I’m at least on the right track.

    I won’t rehash all the code, but there were a number of interesting things along the way.

    The basics: working with NumPy arrays in Cython

    One of the truly beautiful things about programming in Cython is that you can get the speed of working with a C array representing a multi-dimensional array (e.g. double *) without the headache of having to handle the striding information of the ndarray yourself. Also, you can work with non-contiguous arrays and arrays with dtype=object (which are just arrays of PyObject* underneath) with no code changes (!). Cython calls this the buffer interface:

    def sanitize_objects(ndarray[object] values):
        cdef:
            Py_ssize_t i, n
            object val, onan

        n = len(values)
        onan = np.nan

        for i from 0 <= i < n:
            val = values[i]
            if val == '':
                values[i] = onan

    For multi-dimensional arrays, you specify the number of dimensions in the buffer. and pass multiple indexes (Py_ssize_t is the proper C “index type” to use). I’ll demonstrate this in:

    Converting rows to columns faster than zip(*rows)

    A cool Python trick to convert rows to column is:

    In [5]: rows
    Out[5]:
    [(0.39455404791938709, 0.13120015514691319, -0.38366356835950594),
     (-0.744567101498121, -0.9189909692195557, 1.3558711696319314),
     (-0.20933216711571506, 0.36102965753837235, 0.94614438124063927),
     (0.49200559154161844, 0.099177280246717708, -1.2622899429921068),
     (-0.48238271158753454, -0.9414862514454051, -1.0257632509581869)]

    In [6]: zip(*rows)
    Out[6]:
    [(0.39455404791938709,
      -0.744567101498121,
      -0.20933216711571506,
      0.49200559154161844,
      -0.48238271158753454),
     (0.13120015514691319,
      -0.9189909692195557,
      0.36102965753837235,
      0.099177280246717708,
      -0.9414862514454051),
     (-0.38366356835950594,
      1.3558711696319314,
      0.94614438124063927,
      -1.2622899429921068,
      -1.0257632509581869)]

    While zip is very fast (a built-in Python function), the larger problem here is that our target data structure is NumPy arrays to begin with. So it would make sense to write out the rows directly to a 2-dimensional object array:

    def to_object_array(list rows):
        cdef:
            Py_ssize_t i, j, n, k, tmp
            ndarray[object, ndim=2] result
            list row

        n = len(rows)

        # get the maximum row length
        k = 0
        for i from 0 <= i < n:
            tmp = len(rows[i])
            if tmp > k:
                k = tmp

        result = np.empty((n, k), dtype=object)

        for i from 0 <= i < n:
            row = rows[i]

            for j from 0 <= j < len(row):
                result[i, j] = row[j]

        return result

    And lo and behold, this function is significantly faster than the zip trick:

    In [12]: data = map(list, np.random.randn(10000, 10))

    In [13]: timeit zip(*data)
    100 loops, best of 3: 3.66 ms per loop

    In [14]: timeit lib.to_object_array(data)
    1000 loops, best of 3: 1.47 ms per loop

    It’s even more of a big deal if you zip and convert to ndarray:

    In [15]: timeit [np.asarray(x, dtype=object) for x in zip(*data)]
    100 loops, best of 3: 6.72 ms per loop

    Numeric conversion: floats, ints, and NA’s, oh my

    When converting the Python strings to numeric data, you must:

  • Check that the value is not among the set of NA values
  • Handle data like ['1', '2', '3.5', '4'] where you may have “integer-like” data until you observe a floating point value
  • Unfortunately, code for this sort of thing ends up looking like a state machine 99% of the time, but at least it’s fairly tidy in Cython and runs super fast:

    def maybe_convert_numeric(ndarray[object] values, set na_values):
        cdef:
            Py_ssize_t i, n
            ndarray[float64_t] floats
            ndarray[int64_t] ints
            bint seen_float = 0
            object val
            float64_t fval

        n = len(values)

        floats = np.empty(n, dtype='f8')
        ints = np.empty(n, dtype='i8')

        for i from 0 <= i < n:
            val = values[i]

            if cpython.PyFloat_Check(val):
                floats[i] = val
                seen_float = 1
            elif val in na_values:
                floats[i] = nan
                seen_float = 1
            elif val is None:
                floats[i] = nan
                seen_float = 1
            elif len(val) == 0:
                floats[i] = nan
                seen_float = 1
            else:
                fval = float(val)
                floats[i] = fval
                if not seen_float:
                    if '.' in val:
                        seen_float = 1
                    else:
                        ints[i] = <int64_t> fval

        if seen_float:
            return floats
        else:
            return ints

    Adopting the Python philosophy that it’s “easier to ask forgiveness than permission” if float conversion ever fails, the exception will get raised and the code will just leave the column as dtype=object. And this function would obviously have problems with European decimal format– but I’m not willing to compromise performance in 99% cases for the sake of the 1% cases. It will make sense to write a slower function that also handles a broader variety of formatting issues.