Monthly Archives: July 2011

GroupBy-fu: improvements in grouping and aggregating data in pandas

A couple of weeks ago in my inaugural blog post I wrote about the state of GroupBy in pandas and gave an example application. However, I was dissatisfied with the limited expressiveness (see the end of the article), so I decided to invest some serious time in the groupby functionality in pandas over the last 2 weeks in beefing up what you can do. So this article is a part show-and-tell, part quick tutorial on the new features. Note that I haven’t added a lot of this to the official documentation yet.

GroupBy primer

GroupBy may be one of the least well-understood features in pandas. Let’s consider a DataFrame (basically a labeled 2D array):

In [18]: df
                      A         B        C         D
2000-03-31 00:00:00   0.6304   -1.493    0.2082    0.3378
2000-06-30 00:00:00  -0.1041    1.093    0.1227   -0.2362
2000-09-29 00:00:00  -1.822     0.7325  -0.4964   -1.707
2000-12-29 00:00:00   0.3325   -0.2213   0.5465    1.265
2001-03-30 00:00:00   0.8422   -0.1265   0.1234    0.7743
2001-06-29 00:00:00  -0.07774  -0.3906  -0.6328   -1.259
2001-09-28 00:00:00  -0.3417   -0.4209   1.724     0.3069
2001-12-31 00:00:00   0.493    -0.2815   1.314     1.004
2002-03-29 00:00:00   1.003    -0.4807   0.7232    1.079
2002-06-28 00:00:00   0.1415   -2.275    0.6178    0.8817
2002-09-30 00:00:00  -0.6855    0.3921  -1.468     1.522
2002-12-31 00:00:00  -1.635     0.3083   0.1209   -0.5606
2003-03-31 00:00:00   0.5707   -0.7125  -0.07677   0.5154
2003-06-30 00:00:00   0.3646   -0.8574  -0.562     0.2176
2003-09-30 00:00:00   1.024    -0.8113   0.6771   -1.261

Here, the index (row labels) contains dates and the columns are names for each time series. When performing a groupby operation, we may different goals:

  • Perform an aggregation, like computing the sum of mean of each group. Functionally this means applying a function to each group and putting the aggregated results into a DataFrame
  • Slicing the DataFrame into chunks (groups) and then doing something with the slices
  • Performing a transformation, like standardizing each group (computing the zscore)

So there are two tasks: first, grouping the data; second, doing something with the grouped data. As far as grouping, in a totally abstract sense we have to define some kind of mapping that assigns labels (one of the axes) into group buckets. A mapping could be one of many things:

  • A Python function, to be called on each label
  • A dict, containing {label : group_name} mappings
  • An array the same length as the axis containing the group correspondences. This may often be one of the columns in a DataFrame, but need not be. Every other mapping type gets rendered to this form eventually.

So in the above data, let’s say we wanted to group the data by year. Since the row labels are Python datetime objects, we can access the year attribute when calling groupby

In [21]: grouped = df.groupby(lambda x: x.year)

This returns an object of type GroupBy. With this object, you can do a lot of things including: iteration, aggregation, or transformation. So we could iterate like so:

In [20]: for year, group in grouped:
   ....:     print year
   ....:     print group
                      A        B        C        D
2000-03-31 00:00:00   0.6304  -1.493    0.2082   0.3378
2000-06-30 00:00:00  -0.1041   1.093    0.1227  -0.2362
2000-09-29 00:00:00  -1.822    0.7325  -0.4964  -1.707
2000-12-29 00:00:00   0.3325  -0.2213   0.5465   1.265

                      A         B        C        D
2001-03-30 00:00:00   0.8422   -0.1265   0.1234   0.7743
2001-06-29 00:00:00  -0.07774  -0.3906  -0.6328  -1.259
2001-09-28 00:00:00  -0.3417   -0.4209   1.724    0.3069
2001-12-31 00:00:00   0.493    -0.2815   1.314    1.004

                      A        B        C        D
2002-03-29 00:00:00   1.003   -0.4807   0.7232   1.079
2002-06-28 00:00:00   0.1415  -2.275    0.6178   0.8817
2002-09-30 00:00:00  -0.6855   0.3921  -1.468    1.522
2002-12-31 00:00:00  -1.635    0.3083   0.1209  -0.5606

                      A        B        C         D
2003-03-31 00:00:00   0.5707  -0.7125  -0.07677   0.5154
2003-06-30 00:00:00   0.3646  -0.8574  -0.562     0.2176
2003-09-30 00:00:00   1.024   -0.8113   0.6771   -1.261

Or aggregate like so:

In [21]: grouped.aggregate(np.sum)
       A        B        C          D
2000  -0.9629   0.1108   0.3811    -0.3405
2001   0.9159  -1.219    2.529      0.8268
2002  -1.176   -2.055   -0.006253   2.921
2003   1.959   -2.381    0.03838   -0.528

Or transform groups like so (here I standardize each group):

In [22]: zscore = lambda x: (x - x.mean()) / x.std()

In [23]: transformed = grouped.transform(zscore)
                      A        B         C         D
2000-03-31 00:00:00   0.7946  -1.316     0.2598    0.3396
2000-06-30 00:00:00   0.1246   0.9216    0.06319  -0.1213
2000-09-29 00:00:00  -1.442    0.61     -1.361    -1.302
2000-12-29 00:00:00   0.5228  -0.2155    1.038     1.084
2001-03-30 00:00:00   1.142    1.34     -0.47      0.5569
2001-06-29 00:00:00  -0.571   -0.644    -1.168    -1.438
2001-09-28 00:00:00  -1.062   -0.8715    1.009     0.0983
2001-12-31 00:00:00   0.4917   0.1758    0.6299    0.7827
2002-03-29 00:00:00   1.149    0.02675   0.7159    0.3864
2002-06-28 00:00:00   0.3857  -1.422     0.6118    0.168
2002-09-30 00:00:00  -0.3469   0.7316   -1.449     0.878
2002-12-31 00:00:00  -1.188    0.664     0.1209   -1.432
2003-03-31 00:00:00  -0.2443   1.097    -0.1434    0.7267
2003-06-30 00:00:00  -0.8552  -0.8598   -0.9205    0.4138
2003-09-30 00:00:00   1.099   -0.2376    1.064    -1.14

Some aggregations are so common that they’re provided as instance methods on the GroupBy object:

In [24]: grouped.sum()
       A        B        C          D
2000  -0.9629   0.1108   0.3811    -0.3405
2001   0.9159  -1.219    2.529      0.8268
2002  -1.176   -2.055   -0.006253   2.921
2003   1.959   -2.381    0.03838   -0.528

In [25]: grouped.mean()
       A        B         C          D
2000  -0.2407   0.02769   0.09526   -0.08512
2001   0.229   -0.3049    0.6324     0.2067
2002  -0.2939  -0.5138   -0.001563   0.7303
2003   0.6531  -0.7938    0.01279   -0.176

For those with SQL experience, a DataFrame can be treated like an SQL table. As such, grouping “by columns” is quite natural:

In [27]: df2
    A     B       C        D
0   foo   one    -0.7883   0.7743
1   bar   one    -0.5866   0.06009
2   foo   two     0.9312   1.2
3   bar   three  -0.6417   0.3444
4   foo   two    -0.8841  -0.08126
5   bar   two     1.194   -0.7933
6   foo   one    -1.624   -0.1403
7   foo   three   0.5046   0.5833

In [28]: for name, group in df2.groupby('A'):
   ....:     print name
   ....:     print group
    A     B       C        D
1   bar   one    -0.5866   0.06009
3   bar   three  -0.6417   0.3444
5   bar   two     1.194   -0.7933

    A     B       C        D
0   foo   one    -0.7883   0.7743
2   foo   two     0.9312   1.2
4   foo   two    -0.8841  -0.08126
6   foo   one    -1.624   -0.1403
7   foo   three   0.5046   0.5833

Note the presence of the “nuisance” B column when we group by A. I want this to work:

In [30]: df2.groupby('A').mean()
      C         D
bar  -0.01137  -0.1296
foo  -0.3722    0.4671

More on dealing with this below.

New: Column selection

In my last post I mentioned wanting to be able to do a groupby-operation on a single column of a DataFrame using another column or set of columns as the groupers. So in the above example, what I want is:

grouped = df['C'].groupby(df['A'])

# or
grouped = df['C'].groupby([df['A'], df['B']])

This is too verbose / awkward for my tastes. So now you can do:

In [111]: df2.groupby('A')['C'].mean()
bar    -0.447919901908
foo    -0.158016186685

Behind the scenes, this simply passes the C column to a Series GroupBy object along with the already-computed grouping(s).

New: Group by multiple columns / key functions

The ability to group by multiple criteria (just like SQL) has been one of my most desired GroupBy features for a long time. It’s also very hard to implement efficiently. So I bit the bullet and carefully crafted Cython code to speedily do aggregations on data grouped by multiple columns. So now you can do:

In [113]: df2.groupby(['A', 'B']).mean()
    A     B       C        D
0   foo   one     0.3558  -0.4818
1   foo   two    -0.2758  -1.271
2   foo   three  -0.95     1.709
3   bar   one     1.916   -0.5093
4   bar   two    -2.315    0.1177
5   bar   three  -0.9447   0.2519

Iteration with multiple groups flattens the key pairs:

In [114]: for k1, k2, group in df2.groupby(['A', 'B']):
   .....:     print k1, k2
   .....:     print group
foo one
    A     B     C        D
0   foo   one   0.8248   0.3856
6   foo   one  -0.1133  -1.349

foo two
    A     B     C        D
2   foo   two  -1.175   -1.036
4   foo   two   0.6239  -1.506

foo three
    A     B       C      D
7   foo   three  -0.95   1.709

bar one
    A     B     C       D
1   bar   one   1.916  -0.5093

bar two
    A     B     C       D
5   bar   two  -2.315   0.1177

bar three
    A     B       C        D
3   bar   three  -0.9447   0.2519

As a less trivial example, I loaded some data from the Bureau of Labor Statistics, which I’ll write more about another time:

In [126]: df = read_table('',
                          sep='\s+', index_col=None)

In [127]: df
<class 'pandas.core.frame.DataFrame'>
Index: 32806 entries, 0 to 32805
Data columns:
series_id    32806  non-null values
year         32806  non-null values
period       32806  non-null values
value        32806  non-null values
dtypes: int64(1), float64(1), object(2)

In [128]: df[:15]
     series_id       year   period   value
0    CEU5500000034   1990   M01      83
1    CEU5500000034   1990   M02      83.6
2    CEU5500000034   1990   M03      83.4
3    CEU5500000034   1990   M04      84.9
4    CEU5500000034   1990   M05      84.1
5    CEU5500000034   1990   M06      85.4
6    CEU5500000034   1990   M07      86.5
7    CEU5500000034   1990   M08      86
8    CEU5500000034   1990   M09      85.5
9    CEU5500000034   1990   M10      83.6
10   CEU5500000034   1990   M11      83.6
11   CEU5500000034   1990   M12      84.7
12   CEU5500000034   1990   M13      84.5
13   CEU5500000034   1991   M01      82.9
14   CEU5500000034   1991   M02      83.1

This data set has 118 unique series_id’s and spans 22 years (the period column gives the month of observation), so we’re talking about 2596 unique groups. Let’s say we wanted to compute the mean value for each year, it’s pretty simple with the improved groupby:

In [129]: df.groupby(['series_id', 'year']).mean()
<class 'pandas.core.frame.DataFrame'>
Index: 2596 entries, 0 to 2595
Data columns:
series_id    2596  non-null values
value        2596  non-null values
year         2596  non-null values
dtypes: int64(1), float64(1), object(1)

I was also pretty impressed with how fast this operation is (the power of Cython!):

In [130]: timeit df.groupby(['series_id', 'year']).mean()
10 loops, best of 3: 22.9 ms per loop

I haven’t compared this with an SQL engine (SQLite, MySQL, Postgres, …) but I hope that it’s competitive. If you call a non-Cythonized function, the speed goes down considerably:

In [132]: timeit df.groupby(['series_id', 'year'])['value'].std()
1 loops, best of 3: 232 ms per loop

Edit: It’s also possible to pass multiple functions instead of column names or to mix and match column names with functions:

In [22]: df.groupby([lambda x: x.year, lambda x: x.month]).mean()
     A          B         C          D         key_0   key_1
0    0.09433    0.06146   0.1891     0.2079    2000    1    
1    0.02674   -0.09411   0.1557    -0.07926   2000    2    
2    0.07089    0.3938   -0.08026   -0.416     2000    3    
3    0.133     -0.04532   0.03341   -0.2866    2000    4    
4    0.1301     0.1803   -0.4241     0.1482    2000    5    
5   -0.3804    -0.08935   0.04791    0.08268   2000    6    
6    0.2211     0.2483   -0.3559    -0.06205   2000    7    
7   -0.2908     0.07606  -0.2001    -0.05225   2000    8    
8   -0.07888   -0.3915    0.06442    0.5109    2000    9    
9    0.3809     0.4037   -0.2431    -0.2892    2000    10  
10  -0.2765     0.01461  -0.2653     0.1335    2000    11  
11   0.6983    -0.1167    0.3964     0.03776   2000    12  
12  -0.2347     0.2333   -0.06508    0.3       2001    1    
13   0.1751     0.05545  -0.004038   0.06194   2001    2    
14   0.08792    0.03517   0.05583   -0.4356    2001    3    
15  -0.08408    0.05567  -0.1705     0.1914    2001    4    
16   0.008912   0.2317    0.08875    0.08637   2001    5    
17   0.3304     0.3635    0.005158   0.3588    2001    6    
18  -0.1324     0.3419   -0.2527    -0.2249    2001    7    
19   0.0472     0.2552   -0.07589   -0.07992   2001    8    
20   0.07582   -0.2295    0.1002     0.1077    2001    9    
21   0.2368     0.02214  -0.3866    -0.2671    2001    10  
22   0.2116    -0.07872  -0.06399    0.2309    2001    11

New: Omitting “nuisance” columns

It’s pretty common to group by a column and ignore other columns containing non-floating point data. It used to be that you’d get an error, forcing you to first drop the “nuisance” columns, e.g. suppose we only want to group by the A column here:

In [133]: df2
    A     B       C        D
0   foo   one     0.8248   0.3856
1   bar   one     1.916   -0.5093
2   foo   two    -1.175   -1.036
3   bar   three  -0.9447   0.2519
4   foo   two     0.6239  -1.506
5   bar   two    -2.315    0.1177
6   foo   one    -0.1133  -1.349
7   foo   three  -0.95     1.709

If you try to compute the mean of the B column, things won’t quite work out:

In [134]: grouped = df2.groupby('A')['B'].mean()
TypeError: unsupported operand type(s) for /: 'str' and 'float'

To cope with this, so-called “nuisance” columns are now automatically dropped. While I don’t like discarding information, I think “practicality beats purity” in this case:

In [135]: df2.groupby('A').mean()
      C        D
bar  -0.4479  -0.0466
foo  -0.158   -0.3593

New: “auto-dispatching” to DataFrame / Series

Since very few functions in DataFrame and Series/TimeSeries are implemented on the GroupBy classes, you may wish to simply invoke a Series instance method on each group, let’s say:

In [140]: df.groupby('series_id')['value'].agg(Series.skew)
CEU5500000034    -0.253887792525
CEU5500000035    0.0938252241898
CEU5552200034    0.387563548587
CEU5552200035    0.0868489117242
CEU5552210034    0.74506367536
CEU5552210035    0.551290226183

But since we’re got Python at our disposal, it’s relatively straightforward to override __getattribute__ to wrap and dispatch calls to Series/DataFrame instance methods so you can do this:

In [141]: df.groupby('series_id')['value'].skew()
CEU5500000034    -0.253887792525
CEU5500000035    0.0938252241898
CEU5552200034    0.387563548587
CEU5552200035    0.0868489117242
CEU5552210034    0.74506367536

Some of the more complicated methods, like the super-useful describe method can be used in fact:

For those interested, the internal code to do this dispatching is short and simple, by the way:

Old but not well known: apply multiple functions

This has always been around but not well-known by many people. Suppose you want to apply multiple functions to a group and collect all the results. You can pass a dict of functions:

In [144]: grouped.agg({'mean' : np.mean, 'variance' : np.var})
                mean    variance
CEU5500000034   96.54   68.92  
CEU5500000035   93.88   832.3  
CEU5552200034   97.95   46.6    
CEU5552200035   93.58   683.1  
CEU5552210034   100.7   15.54  
CEU5552210035   98.84   567.1  
CEU5552211034   100.9   13.73  
CEU5552211035   99.34   622    

I have a mind to expand on this idea for DataFrame objects to get us closer to the full flexibility of SQL with Pythonic syntax. If anyone has ideas or suggestions please let me know.


GroupBy is certainly not done. If you use these tools and find them useful, please let me know. If you can think of ways to make them better, that would be nice information too. I think it would be great to implement a full SQL engine on top of pandas (similar to the SAS “proc sql”), and this new GroupBy functionality gets us closer to that goal.

A Roadmap for Rich Scientific Data Structures in Python

Discussion thread on Hacker News

So, this post is a bit of a brain dump on rich data structures in Python and what needs to happen in the very near future. I care about them for statistical computing (I want to build a statistical computing environment that trounces R) and financial data analysis (all evidence leads me to believe that Python is the best all-around tool for the finance space). Other people in the scientific Python community want them for numerous other applications: geophysics, neuroscience, etc. It’s really hard to make everyone happy with a single solution. But the current state of affairs has me rather anxious. And I’d like to explain why. For a really quick summary on some of the work I’ve been doing, here’s my SciPy 2010 talk slides:

Data structures with metadata, the backstory

In the wake of SciPy 2011 I’ve been thinking a lot about the way forward from here in terms of building rich Pythonic data structures for statistics and many, many other fields. By rich I mean: is not just a NumPy ndarray, contains metadata (however we define metadata) and has operations which depend on the metadata, and in general does far more than structured arrays currently do for you. This touches on a great many topics and features that people want (partial list):

  • Manipulating heterogeneously-typed (what I loosely call “mixed type”) data
  • Size mutability: can add easily add “columns” or otherwise N-1-dimensional hyperslices without necessarily copying data
  • Metadata about what each axis represents: axis=3 is less meaningful than axis=’temperature’
  • Metadata about axis labels (ticks)
  • Label / tick-based data alignment / reshaping, either automatic or explicit
  • Label / tick-based (fancy) indexing, both setting and getting
  • Hierarchical columns
  • Robust handling of missing / NA data (a là R)
  • Tons of operations needing heterogeneous data and metadata: group by, filtering, sorting, selection / querying, reindexing (reshaping to conform to a new set of labels), axis selection based on names, etc. etc. 
The list goes on and on. I could write a 50-page manuscript on the exact specification of what exact functionality is desired on each of the above bullet points. What I do know is that after using a rich data structure like the ones in pandas, it’s very, very hard to go back to using vanilla ndarrays. To wit, R users coming to Python they have a similar experience: the lack of data.frame and all the functions which operate on data.frames is a gigantic loss of functionality. When I work with MATLAB and R users (especially in the finance industry) and get them up and running with pandas, I get a lot of “where was this tool all my life?”. It’s just that much better. In fact, users can get by with only a very rudimentary understanding of NumPy if the data structures are good enough; I think this is highly desirable. Even for purely interactive data analysis (forget operations which actually utilize the metadata), isn’t this much better:
In [4]: data.corr()
       AAPL     GOOG     MSFT     YHOO  
AAPL   1        0.5724   0.4714   0.3447
GOOG   0.5724   1        0.5231   0.3409
MSFT   0.4714   0.5231   1        0.3012
YHOO   0.3447   0.3409   0.3012   1

than this:

In [11]: np.corrcoef(data.T)
array([[ 1.    ,  0.5724,  0.4714,  0.3447],
       [ 0.5724,  1.    ,  0.5231,  0.3409],
       [ 0.4714,  0.5231,  1.    ,  0.3012],
       [ 0.3447,  0.3409,  0.3012,  1.    ]])

Of course if data were a structured ndarray you would be completely up a creek (most NumPy functions do not play well with structured arrays). But that’s another topic.

But anyway, to the point of why I’m writing: we have a ton of really talented people with real problems to solve, and lots of great ideas about how to solve them. Last year at SciPy 2010 in Austin we had a Birds of a Feather session led by the venerable Fernando Pérez and myself to talk about the datarray project, pandas, tabular, larry, and other various ideas about data structures that people have kicked around. The topic is important enough that Enthought hosted a gathering this past May in Austin, the DataArray Summit, to talk about these issues and figure out where to go from here. It was a great meeting and we hashed out in gory detail many of the problem areas that we’d like to solve with a richer data structure. So that’s a bit of the backstory.

But even given all these great discussions and work being done, we have a really fundamental problem:

Fragmentation is killing us

There, I said it =) All us NumPy ninjas can navigated the fragmented, incohesive collection of data structures and tools, but it’s confusing as hell for new and existing users. NumPy alone is not good enough for a lot of people (statisticians, financial data analysts, etc.), but they’re left with a confusing choice between pandas, larry, datarray, or something else. Also, these tools have largely not been integrated with any other tools because of the community’s collective commitment anxiety. We talk, hem and haw, and wring our hands. And still no integration. I don’t mean to complain: I just deeply care about making the scientific Python stack the most powerful data analysis stack to ever exist. Seriously. And I can’t do it alone. And I don’t want to make unilateral decisions and shove anything down anyone’s throat. We’ve started working on integration of pandas in statsmodels (which is already going to make a really huge difference), but we need to collectively get our proverbial sh*t together. And soon.

My work on pandas lately and why it matters

On my end, in the 2 months since the DataArray summit, I decided to put my PhD on hold and focus more seriously on Python development, statistics / statsmodels, pandas, and other projects I care deeply about. So I’ve easily invested more time in pandas in the last 2 months than in the previous 2 years. This included heavily redesigned internals (which I’ve extremely pleased with) and tackling various other thorny internal issues which had long been an annoyance for me and an external source of criticism by end users (like, “why are there 2 2-dimensional data structures, DataFrame and DataMatrix?” The answer is that the internal implementation was different, with different performance characteristics depending on use). I’m also trying to clear out my extensive backlog of wishlist features (lots of blog articles to come on these). What’s happened is that I started prototyping a new data structure which I’m calling NDFrame which I think is really going to be a big deal. Basically, the NDFrame is a by-product of the redesigned internal data structure (currently called BlockManager) backing DataFrame and friends. I can and should write an entire post about BlockManager and exactly what it needs to accomplish, but the short story is that BlockManager:

  • Stores an arbitrary collection of homogeneously-typed N-dimensional NumPy ndarray objects (I call these Block objects or simple blocks)
  • Has axis labels and is capable of reshaping the “blocks” to a new set of labels
  • Is capable of  consolidating blocks (gluing them together) having the same dtype
  • Knows how to appropriately introduce missing data, and upcast dtypes (int, bool) when the missing data marker (NaN, currently. Crossing my fingers for a good NA implementation!) needs to be introduced
  • Can deal with single “item” type casts (ha, try to do that with structured arrays!)
  • Can accept new blocks without copying data
Maybe this is too meta (heh) for some people. As illustration, here’s literally what’s going on inside DataFrame now after all my latest hacking:
In [40]: df
    b         a         c       e   f
0   1.213     1.507     True    1   a
1  -0.6765    0.06237   True    1   b
2   0.3126   -0.2575    False   1   c
3   0.1505    0.2242    True    1   d
4  -0.7952    0.2909    True    1   e
5   1.341    -0.9712    False   2   f
6   0.01121   1.654     True    2   g
7  -0.173    -1.385     False   2   h
8   0.1637   -0.898     False   2   i
9   0.5979   -1.035     False   2   j

In [41]: df._data
Items: [b a c e f]
Axis 1: [0 1 2 3 4 5 6 7 8 9]
FloatBlock: [b a], 2 x 10, dtype float64
BoolBlock: [c], 1 x 10, dtype bool
IntBlock: [e], 1 x 10, dtype int64
ObjectBlock: [f], 1 x 10, dtype object

The user would of course never be intended to look at this, it’s purely internal. But for example things like this are OK to do:

In [42]: df['e'] = df['e'].astype(float)
In [43]: df._data
Items: [b a c e f]
Axis 1: [0 1 2 3 4 5 6 7 8 9]
FloatBlock: [b a], 2 x 10, dtype float64
BoolBlock: [c], 1 x 10, dtype bool
ObjectBlock: [f], 1 x 10, dtype object
FloatBlock: [e], 1 x 10, dtype float64

Since in that case there are now multiple float blocks, they can be explicitly consolidated, but if you use DataFrame many operations will cause it to happen automatically (which is highly desirable, especially when you have only one dtype, for doing row-oriented operations):

In [44]: df._data.consolidate()
Items: [b a c e f]
Axis 1: [0 1 2 3 4 5 6 7 8 9]
BoolBlock: [c], 1 x 10, dtype bool
FloatBlock: [b a e], 3 x 10, dtype float64
ObjectBlock: [f], 1 x 10, dtype object

Now, this is a pretty abstract business. Here’s my point: when I started thinking about NDFrame, a user-facing n-dimensional data structure backed by BlockManager, I realized that what I am going to build is a nearly strict superset of the functionality provided by every other rich data structure I know of. I made a picture of the feature overlap, and note that arrows loosely mean: “can be used to implement”:

Data Structure Features Venn Diagram

For example, I need to write generic fancy indexing on the NDFrame, a task largely tackled by DataArray. So rather than reinvent the wheel, I should just co-opt that code (love that BSD license), but then I’ve effectively created a fork (nooooo!). I think having all these different libraries (and leaving users the confusing choice between them) is kind of nuts. Ideally DataArray (homogeneous) should just be a part of pandas (and I’m not opposed to changing the name, though it has stronger branding and far more users than datarray or larry). But once we’ve gone down that route, larry is just a DataArray (homogeneous) with automatic data alignment. We’re all doing the exact same kinds of things. So why not have one library?

Remember: the real world is heterogeneously-typed

Some other casual (and potentially controversial) observations

Here’s just a dumping ground of various thoughts I have on this and related topics:

  • The NumPy type hierarchy (int8, int16, int32, int64, uint8, uint16, …) isn’t that important to me. R and MATLAB don’t really have a type hierarchy and it doesn’t seem to pose a problem. So I haven’t gone down the road of creating Block objects mapping onto the NumPy type hierarchy. If someone wants to do it without complicating the end-user pandas experience, be my guest
  • Tying yourself to the ndarray is too restrictive. This is a major problem with DataArray and larry; they don’t do mixed-type data. So if you want to build something that competes with R you have failed before you have even begun by using a homogeneous-only data structure. Remember, DataFrame can be homogeneous whenever it wants to and getting the underlying ndarray is just a few keystrokes.
  • Structured arrays are probably not the answer. Size mutability (ability to add columns) and the ability to change dtypes are actually a big deal. As is the ability to do row-oriented computations, broadcasting, and all the things that structured arrays are (currently) unable to do. They are a darned convenient way of serializing / deserializing data, though. But memory layout and all that is far, far less important to me than usability / user interface / experience

SciPy 2011 Conference Highlights

SciPy 2011 was a blast! Intense and fun but of course tiring due to burning the candle at both ends. I was delighted to see a lot of familiar faces after my first SciPy conference last year: the Enthought crew (Travis Oliphant, Eric Jones, et al), Peter Wang (chaco), Fernando Pérez and MinRK / Min Ragan-Kelley (of IPython fame), my statsmodels collaborator Skipper Seabold, Stefan van der Walt, Josh Hemann, and too many others to mention. It was also great to finally meet a lot of other well-known SciPythonistas for the first time: Chris Fonnesbeck (PyMC), John D. Cook, Gaël Varoquaux (scikit-learn, Mayavi), and many others. A 2-day slate of talks seems too little to do justice to all the awesome stuff people are doing!

Here’s partial list of my favorite parts of the conference and general thoughts (in no particular order). Since there were two talk tracks running in parallel I unfortunately could only see half the talks; fortunately videos should be soon posted. I’ll return to this post and add video links once they’re available. I’ll write a separate post about our statsmodels sprint!

Fernando Pérez’s IPython talk: Slides

I always tell people that IPython is one of scientific Python’s killer apps. For proof, watch the video of Fernando’s talk once it’s up. Chris Fonnesbeck wrote a great blog article on the new features in IPython. In essence they’ve taken an already innovative and hugely productivity enhancing tool and rearchitected it to be a) much more easily embedded (see the rich Qt console and the web interface) and b) an extremely powerful parallel computing environment. Truly inspiring work from MinRK, Fernando, Brian Granger, and crew. My hat’s off to you guys.

On my end I started using the rich Qt console (ipython-qtconsole) for interactive work and demos a few months ago. Having inline matplotlib plots while doing tech demos is a huge, huge win!

IPython Rich Qt Console

Peter Wang’s talk on Metagraph: Slides

Fortunately Peter’s slides have notes on them which helps to understand the slides in more detail. He’s working to solve some really major problems here at the core of how we do computation on arrays. The notes do far better justice to the talk, but the nutshell is that he’s building a loop-fusing (a.k.a. “stream fusion”) compiler for expressing array computations and processing streaming data. This is extremely exciting as for the longest time this has been a watershed between NumPy-based tools and work being done in the APL/J/K family and functional languages like Haskell. Especially for big data / streaming data applications, a “fusing” compiler / VM will eliminate loops and enable the program to make only single passes over the data as opposed to multiple passes as commonly happens now. This is related to Python projects like numexpr and Theano (which are also worth checking out).

The extent of what’s possible with these kinds of ideas is a bit hard to grok but I’m very much looking forward to seeing how this project develops over the coming 6 months. As he says “The goal is to make scientific computing Pythonic”. Bringing the full power of array-based and functional languages at the core computational level with a very Pythonic interface could have very serious impact on the direction of scientific computing.

Gaël Varoquaux’s Neuroscience / scikit-learn Talk: Slides

If you’ve ever tried making a presentation in LaTeX, it should be clear that Gaël is at one with the beamer gods. That is one swanky looking deck of slides; it should come as no shock that it’s had nearly 30k views on SlideShare! Aesthetics aside, I’m excited to see how much progress the scikit-learn folks have made on building a really excellent machine learning library. Now if only we can get that much muscle poured into statsmodels! I think generally machine learning is an optimal application area for the scientific Python stack and this talk shows why: solid algos, great data visualization, and excellent task / workflow / big data management tools (e.g. joblib).

As an aside, Gaël’s point that you “cannot develop science and software separately” is highly relevant to all data-driven fields of academic research. Far too often in academia, software development is viewed as a distant second to innovations in methodology. Faculty very rarely get tenure based on their contributions to research software, no matter how impactful. As a result, graduate students and faculty alike are left re-inventing the wheel more often that not, leading to a wasteland of essentially throwaway MATLAB or R code. I am hopeful that this pattern will someday change at a grand scale.

Matthew Goodman’s Lightning Talk: Slides

Here’s another talk where having the video makes a big difference. MG gives a great list of indispensable scientific Python tools. I especially enjoyed the bit about IPython: “If you are not using this tool, you are doing it wrong!”

Hilary Mason‘s keynote talk

Hilary gave a really fun keynote presentation about scientific computing and the work going on at There’s a lot more happening in URL-shortening-land than I thought! When asked about other technologies like R she said, “There’s no way we [] will run R in production.” Having experienced the misery associated with trying to run R in production in the quant finance world I must say that I definitely agree with that sentiment.

Python in Finance Panel

Travis Oliphant, Henry Ward (CEO of Second Sight), and I participated in a panel discussion moderated by Peter Wang. We were asked about our experiences using Python for financial applications as well as the institutional and technical challenges associated with using Python to build research and production systems within larger financial enterprises.

It’s a topic for another blog post, but I spoke at length about the role that Python plays in solving the so-called “research-production gap”. That is, often financial firms do research in one language (R and MATLAB are quite popular) and do production implementations in another (Java/C++). I made the argument that, outside of low-latency high frequency trading, there are huge organizational benefits to building a one-language platform in Python. Also, Python may be the only programming language that is high-productivity and suitable for interactive research as well as building production-worthy, robust, and maintainable systems. More and more hedge funds, banks, and other financial firms are realizing this as time goes on.

Travis spoke about many of the challenges of using Python within larger organizations like investment banks (something I’ve no experience with, but Enthought has done significant consulting work in that space). He had a number of other valuable perspectives– many of which are escaping me (will need to go back and watch the video!).

Henry Ward, who is leading a startup effort to bring quantitative investment tools (such as portfolio optimization) to retail investors, spoke about his experiences building scalable backend systems in Python which also need to carry out extensive statistical computations and data manipulations, for example using pandas. Since Python has great web application tools like Django, rich RPC protocols (e.g. Thrift), and scientific computing libraries, it’s possible to build a pure Python architecture which can scale very naturally on the cloud. Very cool stuff. He also mentioned starting out the project in Ruby and concluded that “Friends don’t let friends do science in Ruby.” Well put, sir.

Other links of interest

I’ll add stuff here when I think of it (and when I get a chance to see videos for the talks I missed).

SciPy 2011 Lightning Talk on pandas (with video!)

Video Link: my talk starts at 18:25 approximately.

SciPy 2011 talk on Time Series Analysis

Adventures in Aggregating Data (Group By)

I’m a big fan of Interactive Brokers. Cheap trading, great APIs, lightning fast execution with no frills, designed for the professional trader. Perfect for people like me. Here’s the background for this article: I started writing short-dated out-of-the-money call options on my stock and ETF positions last year as a way to hedge my downside risk (while limiting upside) and maybe even make a little extra income on the side in this volatile and frequently sideways market. Recently I was looking over my brokerage statements and trying to get a grasp on my realized and mark-to-market PnL (profit-and-loss) this year and last since we’re talking about equity positions plus options that expired in each month since the beginning of the year. After poring over the various forms of reports I can get from IB I concluded the quickest-and-dirtiest (not to mention the most fun) way was to write a little Python program to do it for me.

I’ll spare you the munging bit of getting the data out of the HTML activity statement using BeautifulSoup and skip to where I’ve scraped the relevant data into a DataFrame. Here’s some faux data for illustration purposes:

In [65]: data
     Symbol                 Underlying   Kind     MTM     Realized
0    AAPL                   AAPL         Stock    120.5   108.5
1    CTXS                   CTXS         Stock    101.5   91.39
2    NFLX                   NFLX         Stock   -193.5  -174.1
3    SPY                    SPY          Stock    67.77   60.99
4    USO                    USO          Stock   -3.355  -3.019
5    VMW                    VMW          Stock    39.13   35.21
6    AAPL 18JUN11 350.0 C   AAPL         Option  -52.76  -47.48
7    AAPL 18JUN11 355.0 C   AAPL         Option   211.2   190.1
8    CTXS 18JUN11 87.5 C    CTXS         Option   65.17   58.65
9    CTXS 18JUN11 90.0 C    CTXS         Option  -33.16  -29.84
10   NFLX 18JUN11 270.0 C   NFLX         Option   80.38   72.34
11   NFLX 18JUN11 280.0 C   NFLX         Option   76.82   69.14
12   SPY 16APR11 132.0 C    SPY          Option   54.25   48.83
13   USO 16APR11 44.0 C     USO          Option  -122.4  -110.1
14   USO 16APR11 45.0 C     USO          Option   73.06   65.76
15   USO 18JUN11 40.0 C     USO          Option   50.96   45.86
16   VMW 18JUN11 100.0 C    VMW          Option   25.59   23.03
17   VMW 18JUN11 105.0 C    VMW          Option   91.11   82


So, I want to aggregate the Realized and MTM columns by Underlying and by whether it was an equity or option instrument. Were this data in a SQL table, this operation could be expressed extremely concisely:

SELECT Underlying, Kind,
SUM(Realized) AS Realized, SUM(MTM) AS MTM
GROUP BY Underlying, Kind
ORDER BY Underlying

But since I like doing interactive data analysis in Python (and loath writing lots of SQL queries), I’d rather do this with pandas and also produce some easier-to-read console output. Here’s what I ended up with:

mtm = {}
realized = {}
for kind, group in data.groupby('Kind'):
    mtm_by_underlying = group['MTM'].groupby(group['Underlying'])
    realized_by_underlying = group['Realized'].groupby(group['Underlying'])
    mtm[kind] = mtm_by_underlying.sum()
    realized[kind] = realized_by_underlying.sum()

# Convert to DataFrame and fill NA values with 0
mtm = DataFrame(mtm).fillna(0)
realized = DataFrame(realized).fillna(0)

And it outputs exactly what I’m looking for:

In [77]: mtm
       Option   Stock
AAPL   158.4    120.5
CTXS   32.01    101.5
NFLX   157.2   -193.5
SPY    54.25    67.77
USO    1.668   -3.355
VMW    116.7    39.13

In [79]: mtm.sum(1)
AAPL    278.941528842
CTXS    133.551244568
NFLX    -36.2669453776
SPY     122.022253941
USO     -1.68704136356
VMW     155.829013419

It occurred to me after writing this code that, with some changes to the GroupBy functionality in pandas, this could be made *much easier*. Ideally I’d like to write:

grouped = frame.groupby('Underlying', 'Kind')
mtm = grouped['MTM'].sum()
realized = grouped['Realized'].sum()

I’ll get to work on this in the pandas repo.

Here’s the Gist of the full script