Category Archives: statistics

Contingency tables and cross-tabulations in pandas

Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.

When calling help(xtabs) (or help(table)) in R, you get a number of examples for using table, xtabs, and ftable:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
> head(esoph)
agegp alcgp tobgp ncases ncontrols
1 25-34 0-39g/day 0-9g/day 0 40
2 25-34 0-39g/day 10-19 0 10
3 25-34 0-39g/day 20-29 0 6
4 25-34 0-39g/day 30+ 0 5
5 25-34 40-79 0-9g/day 0 27
6 25-34 40-79 10-19 0 7
> ftable(xtabs(cbind(ncases, ncontrols) ~ agegp + tobgp, data = esoph))
ncases ncontrols
agegp tobgp
25-34 0-9g/day 0 70
10-19 1 19
20-29 0 11
30+ 0 16
35-44 0-9g/day 2 109
10-19 4 46
20-29 3 27
30+ 0 17
45-54 0-9g/day 14 104
10-19 13 57
20-29 8 33
30+ 11 19
55-64 0-9g/day 25 117
10-19 23 65
20-29 12 38
30+ 16 22
65-74 0-9g/day 31 99
10-19 12 38
20-29 10 20
30+ 2 4
75+ 0-9g/day 6 26
10-19 5 11
20-29 0 3
30+ 2 4
> ftable(xtabs(cbind(ncases, ncontrols) ~ agegp, data = esoph))
ncases ncontrols
agegp
25-34 1 116
35-44 9 199
45-54 46 213
55-64 76 242
65-74 55 161
75+ 13 44

Here are the same examples using pandas:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
In [19]: table = esoph.pivot_table(['ncases', 'ncontrols'],
rows=['agegp', 'tobgp'],
aggfunc=np.sum)
In [20]: table
Out[20]:
ncases ncontrols
agegp tobgp
25-34 0-9g/day 0. 70.
10-19 1. 19.
20-29 0. 11.
30+ 0. 16.
35-44 0-9g/day 2. 109
10-19 4. 46.
20-29 3. 27.
30+ 0. 17.
45-54 0-9g/day 14 104
10-19 13 57.
20-29 8. 33.
30+ 11 19.
55-64 0-9g/day 25 117
10-19 23 65.
20-29 12 38.
30+ 16 22.
65-74 0-9g/day 31 99.
10-19 12 38.
20-29 10 20.
30+ 2. 4.0
75+ 0-9g/day 6. 26.
10-19 5. 11.
20-29 0. 3.0
30+ 2. 4.0
In [22]: table2 = esoph.pivot_table(['ncases', 'ncontrols'],
rows='agegp', aggfunc=np.sum)
In [23]: table2
Out[23]:
ncases ncontrols
agegp
25-34 1. 116
35-44 9. 199
45-54 46 213
55-64 76 242
65-74 55 161
75+ 13 44.

One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:

Python
1
2
3
4
5
6
7
8
In [26]: table['ncases'].unstack('agegp')
Out[26]:
agegp 25-34 35-44 45-54 55-64 65-74 75+
tobgp
0-9g/day 0 2 14 25 31 6
10-19 1 4 13 23 12 5
20-29 0 3 8. 12 10 0
30+ 0 0 11 16 2. 2

Here is another fun example:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
In [31]: import pandas.rpy.common as com
In [32]: wp = com.load_data('warpbreaks')
In [33]: wp['replicate'] = np.tile(range(1, 10), 6)
In [34]: wp.pivot_table('breaks', rows=['wool', 'tension'],
cols='replicate', aggfunc=np.sum)
Out[34]:
replicate 1 2 3 4 5 6 7 8 9
wool tension
A H 36 21 24 18 10 43 28 15 26
L 26 30 54 25 70 52 51 26 67
M 18 21 29 17 12 18 35 30 36
B H 20 21 24 17 13 15 15 16 28
L 27 14 29 19 29 31 41 20 44
M 42 26 19 16 39 28 21 39 29

Here is the equivalent R code:

1
2
3
4
5
6
7
8
9
10
> warpbreaks$replicate <- rep(1:9, len = 54)
> ftable(xtabs(breaks ~ wool + tension + replicate, data = warpbreaks))
replicate 1 2 3 4 5 6 7 8 9
wool tension
A L 26 30 54 25 70 52 51 26 67
M 18 21 29 17 12 18 35 30 36
H 36 21 24 18 10 43 28 15 26
B L 27 14 29 19 29 31 41 20 44
M 42 26 19 16 39 28 21 39 29
H 20 21 24 17 13 15 15 16 28

As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.

PyHPC 2011 Pre-print paper on pandas

I’ll be presenting this paper in a couple of weeks at SC11. It’s the most up-to-date white paper about pandas.

PDF Download Link

pandas: a Foundational Python Library for Data Analysis and Statistics

R’s factor function is slow

I’m not making this up:

> arr <- rep(letters, 1000)
> system.time(for (i in 1:1000) foo <- factor(arr))
user system elapsed
1.809 0.023 1.821

> arr <- 1:10000
> system.time(for (i in 1:1000) foo <- factor(arr))
user system elapsed
7.782 0.088 8.261

Here’s the equivalent code using the pandas Factor class (git HEAD):

In [35]: import string
In [36]: arr = np.tile(list(string.letters[:26]), 1000).astype(object)
In [37]: Factor(arr)
Out[37]:
Factor:
array([a, b, c, ..., x, y, z], dtype=object)
Levels (26): [a b c d e f g h i j k l m n o p q r s t u v w x y z]
In [38]: timeit Factor(arr)
1000 loops, best of 3: 869 us per loop

In [39]: arr = np.arange(10000, dtype=object)
In [40]: timeit Factor(arr)
1000 loops, best of 3: 1.91 ms per loop

I just edited this up since R 2.11.1 is a lot slower than 2.13.1 (the latest release) so it wasn’t quite fair to perform comparisons off of that. So there are two cases here:

  • Few unique values (the 26 letters of the alphabet). R does pretty well here, about 2x slower. This is down from 12x slower in R 2.11.1, so I can only assume someone on the R core team noticed this in the last year or so.
  • All unique values. R is about 4x slower now.

It’s possible I’m using a different algorithm (and so is R between 2.11 and 2.13!). Maybe Python’s much lauded dict implementation is responsible? I’m more curious than anything. Maybe an R guru could let me know (I’m afraid to look at the GPL source code).

For the record, here’s the Python and Cython code involved. First, the Python code:

def unique_with_labels(values):
    uniques = Index(_tseries.fast_unique(values))
    labels = _tseries.get_unique_labels(values, uniques.indexMap)
    return uniques, labels

And the two Cython functions

@cython.wraparound(False)
@cython.boundscheck(False)
def fast_unique(ndarray[object] values):
    cdef:
        Py_ssize_t i, n = len(values)
        list uniques = []
        dict table = {}
        object val, stub = 0

    for i from 0 <= i < n:
        val = values[i]
        if val not in table:
            table[val] = stub
            uniques.append(val)
    try:
        uniques.sort()
    except Exception:
        pass

    return np.asarray(uniques, dtype=object)

@cython.wraparound(False)
@cython.boundscheck(False)
def get_unique_labels(ndarray[object] values, dict idMap):
    cdef int i, length
    cdef object idx
    cdef ndarray[int32_t] labels
    length = len(values)
    labels = np.empty(length, dtype=np.int32)
    for i from 0 <= i < length:
        idx = values[i]
        labels[i] = idMap[idx]

    return labels

GroupBy-fu: improvements in grouping and aggregating data in pandas

A couple of weeks ago in my inaugural blog post I wrote about the state of GroupBy in pandas and gave an example application. However, I was dissatisfied with the limited expressiveness (see the end of the article), so I decided to invest some serious time in the groupby functionality in pandas over the last 2 weeks in beefing up what you can do. So this article is a part show-and-tell, part quick tutorial on the new features. Note that I haven’t added a lot of this to the official documentation yet.

GroupBy primer

GroupBy may be one of the least well-understood features in pandas. Let’s consider a DataFrame (basically a labeled 2D array):

In [18]: df
Out[18]:
                      A         B        C         D
2000-03-31 00:00:00   0.6304   -1.493    0.2082    0.3378
2000-06-30 00:00:00  -0.1041    1.093    0.1227   -0.2362
2000-09-29 00:00:00  -1.822     0.7325  -0.4964   -1.707
2000-12-29 00:00:00   0.3325   -0.2213   0.5465    1.265
2001-03-30 00:00:00   0.8422   -0.1265   0.1234    0.7743
2001-06-29 00:00:00  -0.07774  -0.3906  -0.6328   -1.259
2001-09-28 00:00:00  -0.3417   -0.4209   1.724     0.3069
2001-12-31 00:00:00   0.493    -0.2815   1.314     1.004
2002-03-29 00:00:00   1.003    -0.4807   0.7232    1.079
2002-06-28 00:00:00   0.1415   -2.275    0.6178    0.8817
2002-09-30 00:00:00  -0.6855    0.3921  -1.468     1.522
2002-12-31 00:00:00  -1.635     0.3083   0.1209   -0.5606
2003-03-31 00:00:00   0.5707   -0.7125  -0.07677   0.5154
2003-06-30 00:00:00   0.3646   -0.8574  -0.562     0.2176
2003-09-30 00:00:00   1.024    -0.8113   0.6771   -1.261

Here, the index (row labels) contains dates and the columns are names for each time series. When performing a groupby operation, we may different goals:

  • Perform an aggregation, like computing the sum of mean of each group. Functionally this means applying a function to each group and putting the aggregated results into a DataFrame
  • Slicing the DataFrame into chunks (groups) and then doing something with the slices
  • Performing a transformation, like standardizing each group (computing the zscore)

So there are two tasks: first, grouping the data; second, doing something with the grouped data. As far as grouping, in a totally abstract sense we have to define some kind of mapping that assigns labels (one of the axes) into group buckets. A mapping could be one of many things:

  • A Python function, to be called on each label
  • A dict, containing {label : group_name} mappings
  • An array the same length as the axis containing the group correspondences. This may often be one of the columns in a DataFrame, but need not be. Every other mapping type gets rendered to this form eventually.

So in the above data, let’s say we wanted to group the data by year. Since the row labels are Python datetime objects, we can access the year attribute when calling groupby

In [21]: grouped = df.groupby(lambda x: x.year)

This returns an object of type GroupBy. With this object, you can do a lot of things including: iteration, aggregation, or transformation. So we could iterate like so:

In [20]: for year, group in grouped:
   ....:     print year
   ....:     print group
   ....:
2000
                      A        B        C        D
2000-03-31 00:00:00   0.6304  -1.493    0.2082   0.3378
2000-06-30 00:00:00  -0.1041   1.093    0.1227  -0.2362
2000-09-29 00:00:00  -1.822    0.7325  -0.4964  -1.707
2000-12-29 00:00:00   0.3325  -0.2213   0.5465   1.265

2001
                      A         B        C        D
2001-03-30 00:00:00   0.8422   -0.1265   0.1234   0.7743
2001-06-29 00:00:00  -0.07774  -0.3906  -0.6328  -1.259
2001-09-28 00:00:00  -0.3417   -0.4209   1.724    0.3069
2001-12-31 00:00:00   0.493    -0.2815   1.314    1.004

2002
                      A        B        C        D
2002-03-29 00:00:00   1.003   -0.4807   0.7232   1.079
2002-06-28 00:00:00   0.1415  -2.275    0.6178   0.8817
2002-09-30 00:00:00  -0.6855   0.3921  -1.468    1.522
2002-12-31 00:00:00  -1.635    0.3083   0.1209  -0.5606

2003
                      A        B        C         D
2003-03-31 00:00:00   0.5707  -0.7125  -0.07677   0.5154
2003-06-30 00:00:00   0.3646  -0.8574  -0.562     0.2176
2003-09-30 00:00:00   1.024   -0.8113   0.6771   -1.261

Or aggregate like so:

In [21]: grouped.aggregate(np.sum)
Out[21]:
       A        B        C          D
2000  -0.9629   0.1108   0.3811    -0.3405
2001   0.9159  -1.219    2.529      0.8268
2002  -1.176   -2.055   -0.006253   2.921
2003   1.959   -2.381    0.03838   -0.528

Or transform groups like so (here I standardize each group):

In [22]: zscore = lambda x: (x - x.mean()) / x.std()

In [23]: transformed = grouped.transform(zscore)
Out[23]:
                      A        B         C         D
2000-03-31 00:00:00   0.7946  -1.316     0.2598    0.3396
2000-06-30 00:00:00   0.1246   0.9216    0.06319  -0.1213
2000-09-29 00:00:00  -1.442    0.61     -1.361    -1.302
2000-12-29 00:00:00   0.5228  -0.2155    1.038     1.084
2001-03-30 00:00:00   1.142    1.34     -0.47      0.5569
2001-06-29 00:00:00  -0.571   -0.644    -1.168    -1.438
2001-09-28 00:00:00  -1.062   -0.8715    1.009     0.0983
2001-12-31 00:00:00   0.4917   0.1758    0.6299    0.7827
2002-03-29 00:00:00   1.149    0.02675   0.7159    0.3864
2002-06-28 00:00:00   0.3857  -1.422     0.6118    0.168
2002-09-30 00:00:00  -0.3469   0.7316   -1.449     0.878
2002-12-31 00:00:00  -1.188    0.664     0.1209   -1.432
2003-03-31 00:00:00  -0.2443   1.097    -0.1434    0.7267
2003-06-30 00:00:00  -0.8552  -0.8598   -0.9205    0.4138
2003-09-30 00:00:00   1.099   -0.2376    1.064    -1.14

Some aggregations are so common that they’re provided as instance methods on the GroupBy object:

In [24]: grouped.sum()
Out[24]:
       A        B        C          D
2000  -0.9629   0.1108   0.3811    -0.3405
2001   0.9159  -1.219    2.529      0.8268
2002  -1.176   -2.055   -0.006253   2.921
2003   1.959   -2.381    0.03838   -0.528

In [25]: grouped.mean()
Out[25]:
       A        B         C          D
2000  -0.2407   0.02769   0.09526   -0.08512
2001   0.229   -0.3049    0.6324     0.2067
2002  -0.2939  -0.5138   -0.001563   0.7303
2003   0.6531  -0.7938    0.01279   -0.176

For those with SQL experience, a DataFrame can be treated like an SQL table. As such, grouping “by columns” is quite natural:

In [27]: df2
Out[27]:
    A     B       C        D
0   foo   one    -0.7883   0.7743
1   bar   one    -0.5866   0.06009
2   foo   two     0.9312   1.2
3   bar   three  -0.6417   0.3444
4   foo   two    -0.8841  -0.08126
5   bar   two     1.194   -0.7933
6   foo   one    -1.624   -0.1403
7   foo   three   0.5046   0.5833

In [28]: for name, group in df2.groupby('A'):
   ....:     print name
   ....:     print group
   ....:
bar
    A     B       C        D
1   bar   one    -0.5866   0.06009
3   bar   three  -0.6417   0.3444
5   bar   two     1.194   -0.7933

foo
    A     B       C        D
0   foo   one    -0.7883   0.7743
2   foo   two     0.9312   1.2
4   foo   two    -0.8841  -0.08126
6   foo   one    -1.624   -0.1403
7   foo   three   0.5046   0.5833

Note the presence of the “nuisance” B column when we group by A. I want this to work:

In [30]: df2.groupby('A').mean()
Out[30]:
      C         D
bar  -0.01137  -0.1296
foo  -0.3722    0.4671

More on dealing with this below.

New: Column selection

In my last post I mentioned wanting to be able to do a groupby-operation on a single column of a DataFrame using another column or set of columns as the groupers. So in the above example, what I want is:

grouped = df['C'].groupby(df['A'])

# or
grouped = df['C'].groupby([df['A'], df['B']])

This is too verbose / awkward for my tastes. So now you can do:

In [111]: df2.groupby('A')['C'].mean()
Out[111]:
bar    -0.447919901908
foo    -0.158016186685

Behind the scenes, this simply passes the C column to a Series GroupBy object along with the already-computed grouping(s).

New: Group by multiple columns / key functions

The ability to group by multiple criteria (just like SQL) has been one of my most desired GroupBy features for a long time. It’s also very hard to implement efficiently. So I bit the bullet and carefully crafted Cython code to speedily do aggregations on data grouped by multiple columns. So now you can do:

In [113]: df2.groupby(['A', 'B']).mean()
Out[113]:
    A     B       C        D
0   foo   one     0.3558  -0.4818
1   foo   two    -0.2758  -1.271
2   foo   three  -0.95     1.709
3   bar   one     1.916   -0.5093
4   bar   two    -2.315    0.1177
5   bar   three  -0.9447   0.2519

Iteration with multiple groups flattens the key pairs:

In [114]: for k1, k2, group in df2.groupby(['A', 'B']):
   .....:     print k1, k2
   .....:     print group
   .....:
foo one
    A     B     C        D
0   foo   one   0.8248   0.3856
6   foo   one  -0.1133  -1.349

foo two
    A     B     C        D
2   foo   two  -1.175   -1.036
4   foo   two   0.6239  -1.506

foo three
    A     B       C      D
7   foo   three  -0.95   1.709

bar one
    A     B     C       D
1   bar   one   1.916  -0.5093

bar two
    A     B     C       D
5   bar   two  -2.315   0.1177

bar three
    A     B       C        D
3   bar   three  -0.9447   0.2519

As a less trivial example, I loaded some data from the Bureau of Labor Statistics, which I’ll write more about another time:

In [126]: df = read_table('ce.data.44.FinActOtherCurr.txt',
                          sep='\s+', index_col=None)

In [127]: df
Out[127]:
<class 'pandas.core.frame.DataFrame'>
Index: 32806 entries, 0 to 32805
Data columns:
series_id    32806  non-null values
year         32806  non-null values
period       32806  non-null values
value        32806  non-null values
dtypes: int64(1), float64(1), object(2)

In [128]: df[:15]
Out[128]:
     series_id       year   period   value
0    CEU5500000034   1990   M01      83
1    CEU5500000034   1990   M02      83.6
2    CEU5500000034   1990   M03      83.4
3    CEU5500000034   1990   M04      84.9
4    CEU5500000034   1990   M05      84.1
5    CEU5500000034   1990   M06      85.4
6    CEU5500000034   1990   M07      86.5
7    CEU5500000034   1990   M08      86
8    CEU5500000034   1990   M09      85.5
9    CEU5500000034   1990   M10      83.6
10   CEU5500000034   1990   M11      83.6
11   CEU5500000034   1990   M12      84.7
12   CEU5500000034   1990   M13      84.5
13   CEU5500000034   1991   M01      82.9
14   CEU5500000034   1991   M02      83.1

This data set has 118 unique series_id’s and spans 22 years (the period column gives the month of observation), so we’re talking about 2596 unique groups. Let’s say we wanted to compute the mean value for each year, it’s pretty simple with the improved groupby:

In [129]: df.groupby(['series_id', 'year']).mean()
Out[129]:
<class 'pandas.core.frame.DataFrame'>
Index: 2596 entries, 0 to 2595
Data columns:
series_id    2596  non-null values
value        2596  non-null values
year         2596  non-null values
dtypes: int64(1), float64(1), object(1)

I was also pretty impressed with how fast this operation is (the power of Cython!):

In [130]: timeit df.groupby(['series_id', 'year']).mean()
10 loops, best of 3: 22.9 ms per loop

I haven’t compared this with an SQL engine (SQLite, MySQL, Postgres, …) but I hope that it’s competitive. If you call a non-Cythonized function, the speed goes down considerably:

In [132]: timeit df.groupby(['series_id', 'year'])['value'].std()
1 loops, best of 3: 232 ms per loop

Edit: It’s also possible to pass multiple functions instead of column names or to mix and match column names with functions:

In [22]: df.groupby([lambda x: x.year, lambda x: x.month]).mean()
Out[22]:
     A          B         C          D         key_0   key_1
0    0.09433    0.06146   0.1891     0.2079    2000    1    
1    0.02674   -0.09411   0.1557    -0.07926   2000    2    
2    0.07089    0.3938   -0.08026   -0.416     2000    3    
3    0.133     -0.04532   0.03341   -0.2866    2000    4    
4    0.1301     0.1803   -0.4241     0.1482    2000    5    
5   -0.3804    -0.08935   0.04791    0.08268   2000    6    
6    0.2211     0.2483   -0.3559    -0.06205   2000    7    
7   -0.2908     0.07606  -0.2001    -0.05225   2000    8    
8   -0.07888   -0.3915    0.06442    0.5109    2000    9    
9    0.3809     0.4037   -0.2431    -0.2892    2000    10  
10  -0.2765     0.01461  -0.2653     0.1335    2000    11  
11   0.6983    -0.1167    0.3964     0.03776   2000    12  
12  -0.2347     0.2333   -0.06508    0.3       2001    1    
13   0.1751     0.05545  -0.004038   0.06194   2001    2    
14   0.08792    0.03517   0.05583   -0.4356    2001    3    
15  -0.08408    0.05567  -0.1705     0.1914    2001    4    
16   0.008912   0.2317    0.08875    0.08637   2001    5    
17   0.3304     0.3635    0.005158   0.3588    2001    6    
18  -0.1324     0.3419   -0.2527    -0.2249    2001    7    
19   0.0472     0.2552   -0.07589   -0.07992   2001    8    
20   0.07582   -0.2295    0.1002     0.1077    2001    9    
21   0.2368     0.02214  -0.3866    -0.2671    2001    10  
22   0.2116    -0.07872  -0.06399    0.2309    2001    11

New: Omitting “nuisance” columns

It’s pretty common to group by a column and ignore other columns containing non-floating point data. It used to be that you’d get an error, forcing you to first drop the “nuisance” columns, e.g. suppose we only want to group by the A column here:

In [133]: df2
Out[133]:
    A     B       C        D
0   foo   one     0.8248   0.3856
1   bar   one     1.916   -0.5093
2   foo   two    -1.175   -1.036
3   bar   three  -0.9447   0.2519
4   foo   two     0.6239  -1.506
5   bar   two    -2.315    0.1177
6   foo   one    -0.1133  -1.349
7   foo   three  -0.95     1.709

If you try to compute the mean of the B column, things won’t quite work out:

In [134]: grouped = df2.groupby('A')['B'].mean()
TypeError: unsupported operand type(s) for /: 'str' and 'float'

To cope with this, so-called “nuisance” columns are now automatically dropped. While I don’t like discarding information, I think “practicality beats purity” in this case:

In [135]: df2.groupby('A').mean()
Out[135]:
      C        D
bar  -0.4479  -0.0466
foo  -0.158   -0.3593

New: “auto-dispatching” to DataFrame / Series

Since very few functions in DataFrame and Series/TimeSeries are implemented on the GroupBy classes, you may wish to simply invoke a Series instance method on each group, let’s say:

In [140]: df.groupby('series_id')['value'].agg(Series.skew)
Out[140]:
CEU5500000034    -0.253887792525
CEU5500000035    0.0938252241898
CEU5552200034    0.387563548587
CEU5552200035    0.0868489117242
CEU5552210034    0.74506367536
CEU5552210035    0.551290226183
<snip>

But since we’re got Python at our disposal, it’s relatively straightforward to override __getattribute__ to wrap and dispatch calls to Series/DataFrame instance methods so you can do this:

In [141]: df.groupby('series_id')['value'].skew()
Out[141]:
CEU5500000034    -0.253887792525
CEU5500000035    0.0938252241898
CEU5552200034    0.387563548587
CEU5552200035    0.0868489117242
CEU5552210034    0.74506367536
<snip>

Some of the more complicated methods, like the super-useful describe method can be used in fact:

1
2
3
4
5
6
7
8
9
10
In [142]: df.groupby('series_id')['value'].describe()
Out[142]:
count mean std min 10% 50% 90% max
CEU5500000034 279 96.54 8.317 81.7 84 98.6 107.2 112
CEU5500000035 279 93.88 28.9 50.4 55.28 91.9 134.5 140.2
CEU5552200034 278 97.95 6.839 86.9 89.47 97.15 108.2 113.9
CEU5552200035 278 93.58 26.18 58 60.34 89.2 127.4 137.3
CEU5552210034 278 100.7 3.95 94.1 95.87 99.8 106 113.1
CEU5552210035 278 98.84 23.86 71.6 73.77 92.6 137.2 145.8
...

For those interested, the internal code to do this dispatching is short and simple, by the way:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def __getattribute__(self, attr):
try:
return object.__getattribute__(self, attr)
except AttributeError:
if hasattr(self.obj, attr):
return self._make_wrapper(attr)
raise
def _make_wrapper(self, name):
f = getattr(self.obj, name)
if not isinstance(f, types.MethodType):
return self.aggregate(lambda self: getattr(self, name))
# return the class reference
f = getattr(type(self.obj), name)
def wrapper(*args, **kwargs):
curried = lambda self: f(self, *args, **kwargs)
return self.aggregate(curried)
return wrapper

Old but not well known: apply multiple functions

This has always been around but not well-known by many people. Suppose you want to apply multiple functions to a group and collect all the results. You can pass a dict of functions:

In [144]: grouped.agg({'mean' : np.mean, 'variance' : np.var})
Out[144]:
                mean    variance
CEU5500000034   96.54   68.92  
CEU5500000035   93.88   832.3  
CEU5552200034   97.95   46.6    
CEU5552200035   93.58   683.1  
CEU5552210034   100.7   15.54  
CEU5552210035   98.84   567.1  
CEU5552211034   100.9   13.73  
CEU5552211035   99.34   622    
<snip>

I have a mind to expand on this idea for DataFrame objects to get us closer to the full flexibility of SQL with Pythonic syntax. If anyone has ideas or suggestions please let me know.

Conclusions

GroupBy is certainly not done. If you use these tools and find them useful, please let me know. If you can think of ways to make them better, that would be nice information too. I think it would be great to implement a full SQL engine on top of pandas (similar to the SAS “proc sql”), and this new GroupBy functionality gets us closer to that goal.

A Roadmap for Rich Scientific Data Structures in Python

Discussion thread on Hacker News

So, this post is a bit of a brain dump on rich data structures in Python and what needs to happen in the very near future. I care about them for statistical computing (I want to build a statistical computing environment that trounces R) and financial data analysis (all evidence leads me to believe that Python is the best all-around tool for the finance space). Other people in the scientific Python community want them for numerous other applications: geophysics, neuroscience, etc. It’s really hard to make everyone happy with a single solution. But the current state of affairs has me rather anxious. And I’d like to explain why. For a really quick summary on some of the work I’ve been doing, here’s my SciPy 2010 talk slides:

Data structures with metadata, the backstory

In the wake of SciPy 2011 I’ve been thinking a lot about the way forward from here in terms of building rich Pythonic data structures for statistics and many, many other fields. By rich I mean: is not just a NumPy ndarray, contains metadata (however we define metadata) and has operations which depend on the metadata, and in general does far more than structured arrays currently do for you. This touches on a great many topics and features that people want (partial list):

  • Manipulating heterogeneously-typed (what I loosely call “mixed type”) data
  • Size mutability: can add easily add “columns” or otherwise N-1-dimensional hyperslices without necessarily copying data
  • Metadata about what each axis represents: axis=3 is less meaningful than axis=’temperature’
  • Metadata about axis labels (ticks)
  • Label / tick-based data alignment / reshaping, either automatic or explicit
  • Label / tick-based (fancy) indexing, both setting and getting
  • Hierarchical columns
  • Robust handling of missing / NA data (a là R)
  • Tons of operations needing heterogeneous data and metadata: group by, filtering, sorting, selection / querying, reindexing (reshaping to conform to a new set of labels), axis selection based on names, etc. etc. 
The list goes on and on. I could write a 50-page manuscript on the exact specification of what exact functionality is desired on each of the above bullet points. What I do know is that after using a rich data structure like the ones in pandas, it’s very, very hard to go back to using vanilla ndarrays. To wit, R users coming to Python they have a similar experience: the lack of data.frame and all the functions which operate on data.frames is a gigantic loss of functionality. When I work with MATLAB and R users (especially in the finance industry) and get them up and running with pandas, I get a lot of “where was this tool all my life?”. It’s just that much better. In fact, users can get by with only a very rudimentary understanding of NumPy if the data structures are good enough; I think this is highly desirable. Even for purely interactive data analysis (forget operations which actually utilize the metadata), isn’t this much better:
In [4]: data.corr()
Out[4]:
       AAPL     GOOG     MSFT     YHOO  
AAPL   1        0.5724   0.4714   0.3447
GOOG   0.5724   1        0.5231   0.3409
MSFT   0.4714   0.5231   1        0.3012
YHOO   0.3447   0.3409   0.3012   1

than this:

In [11]: np.corrcoef(data.T)
Out[11]:
array([[ 1.    ,  0.5724,  0.4714,  0.3447],
       [ 0.5724,  1.    ,  0.5231,  0.3409],
       [ 0.4714,  0.5231,  1.    ,  0.3012],
       [ 0.3447,  0.3409,  0.3012,  1.    ]])

Of course if data were a structured ndarray you would be completely up a creek (most NumPy functions do not play well with structured arrays). But that’s another topic.

But anyway, to the point of why I’m writing: we have a ton of really talented people with real problems to solve, and lots of great ideas about how to solve them. Last year at SciPy 2010 in Austin we had a Birds of a Feather session led by the venerable Fernando Pérez and myself to talk about the datarray project, pandas, tabular, larry, and other various ideas about data structures that people have kicked around. The topic is important enough that Enthought hosted a gathering this past May in Austin, the DataArray Summit, to talk about these issues and figure out where to go from here. It was a great meeting and we hashed out in gory detail many of the problem areas that we’d like to solve with a richer data structure. So that’s a bit of the backstory.

But even given all these great discussions and work being done, we have a really fundamental problem:

Fragmentation is killing us

There, I said it =) All us NumPy ninjas can navigated the fragmented, incohesive collection of data structures and tools, but it’s confusing as hell for new and existing users. NumPy alone is not good enough for a lot of people (statisticians, financial data analysts, etc.), but they’re left with a confusing choice between pandas, larry, datarray, or something else. Also, these tools have largely not been integrated with any other tools because of the community’s collective commitment anxiety. We talk, hem and haw, and wring our hands. And still no integration. I don’t mean to complain: I just deeply care about making the scientific Python stack the most powerful data analysis stack to ever exist. Seriously. And I can’t do it alone. And I don’t want to make unilateral decisions and shove anything down anyone’s throat. We’ve started working on integration of pandas in statsmodels (which is already going to make a really huge difference), but we need to collectively get our proverbial sh*t together. And soon.

My work on pandas lately and why it matters

On my end, in the 2 months since the DataArray summit, I decided to put my PhD on hold and focus more seriously on Python development, statistics / statsmodels, pandas, and other projects I care deeply about. So I’ve easily invested more time in pandas in the last 2 months than in the previous 2 years. This included heavily redesigned internals (which I’ve extremely pleased with) and tackling various other thorny internal issues which had long been an annoyance for me and an external source of criticism by end users (like, “why are there 2 2-dimensional data structures, DataFrame and DataMatrix?” The answer is that the internal implementation was different, with different performance characteristics depending on use). I’m also trying to clear out my extensive backlog of wishlist features (lots of blog articles to come on these). What’s happened is that I started prototyping a new data structure which I’m calling NDFrame which I think is really going to be a big deal. Basically, the NDFrame is a by-product of the redesigned internal data structure (currently called BlockManager) backing DataFrame and friends. I can and should write an entire post about BlockManager and exactly what it needs to accomplish, but the short story is that BlockManager:

  • Stores an arbitrary collection of homogeneously-typed N-dimensional NumPy ndarray objects (I call these Block objects or simple blocks)
  • Has axis labels and is capable of reshaping the “blocks” to a new set of labels
  • Is capable of  consolidating blocks (gluing them together) having the same dtype
  • Knows how to appropriately introduce missing data, and upcast dtypes (int, bool) when the missing data marker (NaN, currently. Crossing my fingers for a good NA implementation!) needs to be introduced
  • Can deal with single “item” type casts (ha, try to do that with structured arrays!)
  • Can accept new blocks without copying data
Maybe this is too meta (heh) for some people. As illustration, here’s literally what’s going on inside DataFrame now after all my latest hacking:
In [40]: df
Out[40]:
    b         a         c       e   f
0   1.213     1.507     True    1   a
1  -0.6765    0.06237   True    1   b
2   0.3126   -0.2575    False   1   c
3   0.1505    0.2242    True    1   d
4  -0.7952    0.2909    True    1   e
5   1.341    -0.9712    False   2   f
6   0.01121   1.654     True    2   g
7  -0.173    -1.385     False   2   h
8   0.1637   -0.898     False   2   i
9   0.5979   -1.035     False   2   j

In [41]: df._data
Out[41]:
BlockManager
Items: [b a c e f]
Axis 1: [0 1 2 3 4 5 6 7 8 9]
FloatBlock: [b a], 2 x 10, dtype float64
BoolBlock: [c], 1 x 10, dtype bool
IntBlock: [e], 1 x 10, dtype int64
ObjectBlock: [f], 1 x 10, dtype object

The user would of course never be intended to look at this, it’s purely internal. But for example things like this are OK to do:

In [42]: df['e'] = df['e'].astype(float)
In [43]: df._data
Out[43]:
BlockManager
Items: [b a c e f]
Axis 1: [0 1 2 3 4 5 6 7 8 9]
FloatBlock: [b a], 2 x 10, dtype float64
BoolBlock: [c], 1 x 10, dtype bool
ObjectBlock: [f], 1 x 10, dtype object
FloatBlock: [e], 1 x 10, dtype float64

Since in that case there are now multiple float blocks, they can be explicitly consolidated, but if you use DataFrame many operations will cause it to happen automatically (which is highly desirable, especially when you have only one dtype, for doing row-oriented operations):

In [44]: df._data.consolidate()
Out[44]:
BlockManager
Items: [b a c e f]
Axis 1: [0 1 2 3 4 5 6 7 8 9]
BoolBlock: [c], 1 x 10, dtype bool
FloatBlock: [b a e], 3 x 10, dtype float64
ObjectBlock: [f], 1 x 10, dtype object

Now, this is a pretty abstract business. Here’s my point: when I started thinking about NDFrame, a user-facing n-dimensional data structure backed by BlockManager, I realized that what I am going to build is a nearly strict superset of the functionality provided by every other rich data structure I know of. I made a picture of the feature overlap, and note that arrows loosely mean: “can be used to implement”:

Data Structure Features Venn Diagram

For example, I need to write generic fancy indexing on the NDFrame, a task largely tackled by DataArray. So rather than reinvent the wheel, I should just co-opt that code (love that BSD license), but then I’ve effectively created a fork (nooooo!). I think having all these different libraries (and leaving users the confusing choice between them) is kind of nuts. Ideally DataArray (homogeneous) should just be a part of pandas (and I’m not opposed to changing the name, though it has stronger branding and far more users than datarray or larry). But once we’ve gone down that route, larry is just a DataArray (homogeneous) with automatic data alignment. We’re all doing the exact same kinds of things. So why not have one library?

Remember: the real world is heterogeneously-typed

Some other casual (and potentially controversial) observations

Here’s just a dumping ground of various thoughts I have on this and related topics:

  • The NumPy type hierarchy (int8, int16, int32, int64, uint8, uint16, …) isn’t that important to me. R and MATLAB don’t really have a type hierarchy and it doesn’t seem to pose a problem. So I haven’t gone down the road of creating Block objects mapping onto the NumPy type hierarchy. If someone wants to do it without complicating the end-user pandas experience, be my guest
  • Tying yourself to the ndarray is too restrictive. This is a major problem with DataArray and larry; they don’t do mixed-type data. So if you want to build something that competes with R you have failed before you have even begun by using a homogeneous-only data structure. Remember, DataFrame can be homogeneous whenever it wants to and getting the underlying ndarray is just a few keystrokes.
  • Structured arrays are probably not the answer. Size mutability (ability to add columns) and the ability to change dtypes are actually a big deal. As is the ability to do row-oriented computations, broadcasting, and all the things that structured arrays are (currently) unable to do. They are a darned convenient way of serializing / deserializing data, though. But memory layout and all that is far, far less important to me than usability / user interface / experience

SciPy 2011 talk on Time Series Analysis

Adventures in Aggregating Data (Group By)

I’m a big fan of Interactive Brokers. Cheap trading, great APIs, lightning fast execution with no frills, designed for the professional trader. Perfect for people like me. Here’s the background for this article: I started writing short-dated out-of-the-money call options on my stock and ETF positions last year as a way to hedge my downside risk (while limiting upside) and maybe even make a little extra income on the side in this volatile and frequently sideways market. Recently I was looking over my brokerage statements and trying to get a grasp on my realized and mark-to-market PnL (profit-and-loss) this year and last since we’re talking about equity positions plus options that expired in each month since the beginning of the year. After poring over the various forms of reports I can get from IB I concluded the quickest-and-dirtiest (not to mention the most fun) way was to write a little Python program to do it for me.

I’ll spare you the munging bit of getting the data out of the HTML activity statement using BeautifulSoup and skip to where I’ve scraped the relevant data into a DataFrame. Here’s some faux data for illustration purposes:

In [65]: data
Out[65]:
     Symbol                 Underlying   Kind     MTM     Realized
0    AAPL                   AAPL         Stock    120.5   108.5
1    CTXS                   CTXS         Stock    101.5   91.39
2    NFLX                   NFLX         Stock   -193.5  -174.1
3    SPY                    SPY          Stock    67.77   60.99
4    USO                    USO          Stock   -3.355  -3.019
5    VMW                    VMW          Stock    39.13   35.21
6    AAPL 18JUN11 350.0 C   AAPL         Option  -52.76  -47.48
7    AAPL 18JUN11 355.0 C   AAPL         Option   211.2   190.1
8    CTXS 18JUN11 87.5 C    CTXS         Option   65.17   58.65
9    CTXS 18JUN11 90.0 C    CTXS         Option  -33.16  -29.84
10   NFLX 18JUN11 270.0 C   NFLX         Option   80.38   72.34
11   NFLX 18JUN11 280.0 C   NFLX         Option   76.82   69.14
12   SPY 16APR11 132.0 C    SPY          Option   54.25   48.83
13   USO 16APR11 44.0 C     USO          Option  -122.4  -110.1
14   USO 16APR11 45.0 C     USO          Option   73.06   65.76
15   USO 18JUN11 40.0 C     USO          Option   50.96   45.86
16   VMW 18JUN11 100.0 C    VMW          Option   25.59   23.03
17   VMW 18JUN11 105.0 C    VMW          Option   91.11   82

 

So, I want to aggregate the Realized and MTM columns by Underlying and by whether it was an equity or option instrument. Were this data in a SQL table, this operation could be expressed extremely concisely:

SELECT Underlying, Kind,
SUM(Realized) AS Realized, SUM(MTM) AS MTM
FROM MTMReport
GROUP BY Underlying, Kind
ORDER BY Underlying

But since I like doing interactive data analysis in Python (and loath writing lots of SQL queries), I’d rather do this with pandas and also produce some easier-to-read console output. Here’s what I ended up with:

mtm = {}
realized = {}
for kind, group in data.groupby('Kind'):
    mtm_by_underlying = group['MTM'].groupby(group['Underlying'])
    realized_by_underlying = group['Realized'].groupby(group['Underlying'])
    mtm[kind] = mtm_by_underlying.sum()
    realized[kind] = realized_by_underlying.sum()

# Convert to DataFrame and fill NA values with 0
mtm = DataFrame(mtm).fillna(0)
realized = DataFrame(realized).fillna(0)

And it outputs exactly what I’m looking for:

In [77]: mtm
Out[77]:
       Option   Stock
AAPL   158.4    120.5
CTXS   32.01    101.5
NFLX   157.2   -193.5
SPY    54.25    67.77
USO    1.668   -3.355
VMW    116.7    39.13

In [79]: mtm.sum(1)
Out[79]:
AAPL    278.941528842
CTXS    133.551244568
NFLX    -36.2669453776
SPY     122.022253941
USO     -1.68704136356
VMW     155.829013419

It occurred to me after writing this code that, with some changes to the GroupBy functionality in pandas, this could be made *much easier*. Ideally I’d like to write:

grouped = frame.groupby('Underlying', 'Kind')
mtm = grouped['MTM'].sum()
realized = grouped['Realized'].sum()

I’ll get to work on this in the pandas repo.

Here’s the Gist of the full script