Tag Archives: pandas

Mastering high performance data algorithms I: Group By

I’m on my way back from R/Finance 2012. Those guys did a nice job of organizing the conference and was great to meet everyone there.

As part of pandas development, I have had to develop a suite of high performance data algorithms and implementation strategies which are the heart and soul of the library. I get asked a lot why pandas’s performance is much better than R and other data manipulation tools. The reasons are two-fold: careful implementation (in Cython and and C, so minimizing the computational friction) and carefully-thought-out algorithms.

Here are some of the more important tools and ideas that I make use of on a day-to-day basis:

  • Hash tables I use klib. It’s awesome and easy to use. I have this for critical algorithms like unique, factorize (unique + integer label assignment)
  • O(n) sort, known in pandas as groupsort, on integers with known range. This is a variant of counting sort; if you have N integers with known range from 0 to K – 1, this can be sorted in O(N) time. Combining this tool with factorize (hash table-based), you can categorize and sort a large data set in linear time. Failure to understand these two algorithms will force you to pay O(N log N), dominating the runtime of your algorithm.
  • Vectorized data movement and subsetting routines: take, put, putmask, replace, etc.
  • Let me give you a prime example from a commit yesterday of me applying these ideas to great effect. Suppose I had a time series (or DataFrame containing time series) that I want to group by year, month, and day:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    In [6]: rng = date_range('1/1/2000', periods=20, freq='4h')
    In [7]: ts = Series(np.random.randn(len(rng)), index=rng)
    In [8]: ts
    Out[8]:
    2000-01-01 00:00:00 -0.891761
    2000-01-01 04:00:00 0.204853
    2000-01-01 08:00:00 0.690581
    2000-01-01 12:00:00 0.454010
    2000-01-01 16:00:00 -0.123102
    2000-01-01 20:00:00 0.300111
    2000-01-02 00:00:00 -1.052215
    2000-01-02 04:00:00 0.094484
    2000-01-02 08:00:00 0.318417
    2000-01-02 12:00:00 0.779984
    2000-01-02 16:00:00 -1.514042
    2000-01-02 20:00:00 2.550011
    2000-01-03 00:00:00 0.983423
    2000-01-03 04:00:00 -0.710861
    2000-01-03 08:00:00 -1.350554
    2000-01-03 12:00:00 -0.464388
    2000-01-03 16:00:00 0.817372
    2000-01-03 20:00:00 1.057514
    2000-01-04 00:00:00 0.743033
    2000-01-04 04:00:00 0.925849
    Freq: 4H
    In [9]: by = lambda x: lambda y: getattr(y, x)
    In [10]: ts.groupby([by('year'), by('month'), by('day')]).mean()
    Out[10]:
    2000 1 1 0.105782
    2 0.196106
    3 0.055418
    4 0.834441

    This is a small dataset, but imagine you have millions of observations and thousands or even millions of groups. How does that look algorithmically? I guarantee if you take a naive approach, you will crash and burn when the data increases in size. I know, because I did just that (take a look at the vbenchmarks). Laying down the infrastructure for doing a better job is not simple. Here are the steps for efficiently aggregating data like this:

  • Factorize the group keys, computing integer label arrays (O(N) per key)
  • Compute a “cartesian product number” or group index for each observation (since you could theoretically have K_1 * … * K_p groups observed, where K_i is the number of unique values in key i). This is again O(n) work.
  • If the maximum number of groups is large, “compress” the group index by using the factorize algorithm on it again. imagine you have 1000 uniques per key and 3 keys; most likely you do not actually observe 1 billion key combinations but rather some much smaller number. O(n) work again
  • For simple aggregations, like mean, go ahead and aggregate the data in one linear sweep without moving anything around.
  • Otherwise, for arbitrary user-defined aggregations, we have two options for chopping up the data set: sorting it (O(N) using groupsort!) or create index arrays indicating which observations belong to each group. The latter would be preferable in many cases to sorting a large data set (reordering the data, even though linear time, can be very costly).
  • I worked on speeding up the latter part of this last bullet point yesterday. The resulting code looked like this:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    def _get_indices_dict(label_list, keys):
    # Accepts factorized labels and unique key values
    shape = [len(x) for x in keys]
    group_index = get_group_index(label_list, shape) # Compute group index
    sorter, _ = lib.groupsort_indexer(com._ensure_int64(group_index),
    np.prod(shape))
    # Reorder key labels and group index
    sorted_labels = [lab.take(sorter) for lab in label_list]
    group_index = group_index.take(sorter)
    # Compute dict of {group tuple -> location NumPy array for group}
    index = np.arange(len(group_index)).take(sorter)
    return lib.indices_fast(index, group_index, keys, sorted_labels)

    The details of lib.indices_fast aren’t that interesting; it chops up np.arange(len(group_index)).take(sorter), the sorted indirect indices, to produce the index dictionary. Running %lprun to get a line profiling on a large-ish data set:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    In [11]: rng = date_range('1/1/2000', '12/31/2005', freq='H')
    In [12]: year, month, day = rng.year, rng.month, rng.day
    In [13]: ts = Series(np.random.randn(len(rng)), index=rng)
    In [14]: lprun -f gp._get_indices_dict for i in xrange(100): ts.groupby([year, month, day]).indices
    Timer unit: 1e-06 s
    File: pandas/core/groupby.py
    Function: _get_indices_dict at line 1975
    Total time: 0.628506 s
    Line # Hits Time Per Hit % Time Line Contents
    ==============================================================
    1975 def _get_indices_dict(label_list, keys):
    1976 400 695 1.7 0.1 shape = [len(x) for x in keys]
    1977 100 114388 1143.9 18.2 group_index = get_group_index(label_list, shape)
    1978
    1979 100 320 3.2 0.1 sorter, _ = lib.groupsort_indexer(com._ensure_int64(group_index),
    1980 100 62007 620.1 9.9 np.prod(shape))
    1981
    1982 400 53530 133.8 8.5 sorted_labels = [lab.take(sorter) for lab in label_list]
    1983 100 19516 195.2 3.1 group_index = group_index.take(sorter)
    1984 100 20189 201.9 3.2 index = np.arange(len(group_index)).take(sorter)
    1985
    1986 100 357861 3578.6 56.9 return lib.indices_fast(index, group_index, keys, sorted_labels)

    You might say, well, this seems like a lot of work and maybe we should just zip the keys (forming an array of Python tuples) and do a dumb algorithm? The speed difference ends up being something like an order of magnitude or more faster by being careful in this way and working with indirect integer index arrays.

    Anyway, in conclusion, it’s these kinds of algorithms and ideas why pandas is perhaps the best-performing open-source data analysis toolkit for in memory data (I’m going to get to out-of-core data processing and “big data” eventually, just hang tight). It goes beyond language features and data structure internals (though this naturally also has a major impact, a lot of the things I do are easy to express in Python but would be very awkward or impossible to do in, say, R. Maybe I should write a whole article on this.); carefully thought-out algorithms are a major piece of the puzzle.

    Contingency tables and cross-tabulations in pandas

    Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.

    When calling help(xtabs) (or help(table)) in R, you get a number of examples for using table, xtabs, and ftable:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    > head(esoph)
    agegp alcgp tobgp ncases ncontrols
    1 25-34 0-39g/day 0-9g/day 0 40
    2 25-34 0-39g/day 10-19 0 10
    3 25-34 0-39g/day 20-29 0 6
    4 25-34 0-39g/day 30+ 0 5
    5 25-34 40-79 0-9g/day 0 27
    6 25-34 40-79 10-19 0 7
    > ftable(xtabs(cbind(ncases, ncontrols) ~ agegp + tobgp, data = esoph))
    ncases ncontrols
    agegp tobgp
    25-34 0-9g/day 0 70
    10-19 1 19
    20-29 0 11
    30+ 0 16
    35-44 0-9g/day 2 109
    10-19 4 46
    20-29 3 27
    30+ 0 17
    45-54 0-9g/day 14 104
    10-19 13 57
    20-29 8 33
    30+ 11 19
    55-64 0-9g/day 25 117
    10-19 23 65
    20-29 12 38
    30+ 16 22
    65-74 0-9g/day 31 99
    10-19 12 38
    20-29 10 20
    30+ 2 4
    75+ 0-9g/day 6 26
    10-19 5 11
    20-29 0 3
    30+ 2 4
    > ftable(xtabs(cbind(ncases, ncontrols) ~ agegp, data = esoph))
    ncases ncontrols
    agegp
    25-34 1 116
    35-44 9 199
    45-54 46 213
    55-64 76 242
    65-74 55 161
    75+ 13 44

    Here are the same examples using pandas:

    Python
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    In [19]: table = esoph.pivot_table(['ncases', 'ncontrols'],
    rows=['agegp', 'tobgp'],
    aggfunc=np.sum)
    In [20]: table
    Out[20]:
    ncases ncontrols
    agegp tobgp
    25-34 0-9g/day 0. 70.
    10-19 1. 19.
    20-29 0. 11.
    30+ 0. 16.
    35-44 0-9g/day 2. 109
    10-19 4. 46.
    20-29 3. 27.
    30+ 0. 17.
    45-54 0-9g/day 14 104
    10-19 13 57.
    20-29 8. 33.
    30+ 11 19.
    55-64 0-9g/day 25 117
    10-19 23 65.
    20-29 12 38.
    30+ 16 22.
    65-74 0-9g/day 31 99.
    10-19 12 38.
    20-29 10 20.
    30+ 2. 4.0
    75+ 0-9g/day 6. 26.
    10-19 5. 11.
    20-29 0. 3.0
    30+ 2. 4.0
    In [22]: table2 = esoph.pivot_table(['ncases', 'ncontrols'],
    rows='agegp', aggfunc=np.sum)
    In [23]: table2
    Out[23]:
    ncases ncontrols
    agegp
    25-34 1. 116
    35-44 9. 199
    45-54 46 213
    55-64 76 242
    65-74 55 161
    75+ 13 44.

    One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:

    Python
    1
    2
    3
    4
    5
    6
    7
    8
    In [26]: table['ncases'].unstack('agegp')
    Out[26]:
    agegp 25-34 35-44 45-54 55-64 65-74 75+
    tobgp
    0-9g/day 0 2 14 25 31 6
    10-19 1 4 13 23 12 5
    20-29 0 3 8. 12 10 0
    30+ 0 0 11 16 2. 2

    Here is another fun example:

    Python
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    In [31]: import pandas.rpy.common as com
    In [32]: wp = com.load_data('warpbreaks')
    In [33]: wp['replicate'] = np.tile(range(1, 10), 6)
    In [34]: wp.pivot_table('breaks', rows=['wool', 'tension'],
    cols='replicate', aggfunc=np.sum)
    Out[34]:
    replicate 1 2 3 4 5 6 7 8 9
    wool tension
    A H 36 21 24 18 10 43 28 15 26
    L 26 30 54 25 70 52 51 26 67
    M 18 21 29 17 12 18 35 30 36
    B H 20 21 24 17 13 15 15 16 28
    L 27 14 29 19 29 31 41 20 44
    M 42 26 19 16 39 28 21 39 29

    Here is the equivalent R code:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    > warpbreaks$replicate <- rep(1:9, len = 54)
    > ftable(xtabs(breaks ~ wool + tension + replicate, data = warpbreaks))
    replicate 1 2 3 4 5 6 7 8 9
    wool tension
    A L 26 30 54 25 70 52 51 26 67
    M 18 21 29 17 12 18 35 30 36
    H 36 21 24 18 10 43 28 15 26
    B L 27 14 29 19 29 31 41 20 44
    M 42 26 19 16 39 28 21 39 29
    H 20 21 24 17 13 15 15 16 28

    As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.

    NYCPython 1/10/2012: A look inside pandas design and development

    I had the privilege of speaking last night at the NYCPython meetup group. I’ve given tons of “use pandas!” talks so I thought I would take a slightly different angle and talk about some of the design and implementation work that I’ve done for getting good performance in critical data manipulations. I’ll turn some of this material into some blog articles in the near future.

    Wes McKinney: pandas design and development from Adam Klein on Vimeo.

    Here’s some more video footable shot by my awesome friend Emily Paup with her HDSLR.

    I did a little interactive demo (using the ever-amazing IPython HTML Notebook) on Ashley Williams’s Food Nutrient JSON Database:

    Link to PDF output of Demo

    Link to IPython Notebook file

    If you want to run the code in the IPython notebook, you’ll have to download the food database file above.

    The audience and I learned from this demo that if you’re after Tryptophan, “Sea lion, Steller, meat with fat (Alaska Native)” is where to get it (in the highest density).

    Some pandas Database Join (merge) Benchmarks vs. R base::merge

    Over the last week I have completely retooled pandas’s “database” join infrastructure / algorithms in order to support the full gamut of SQL-style many-to-many merges (pandas has had one-to-one and many-to-one joins for a long time). I was curious about the performance with reasonably large data sets as compared with base::merge.data.frame which I’ve used many times in R. So I just ran a little benchmark for joining a 100000-row DataFrame with a 10000-row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.

    Note this new functionality will be shipping in the upcoming 0.7.0 release (!).

    There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it’s completely apples to apples. These are mean runtimes in seconds:


    Sort by key columns
    pandas R Ratio
    inner 0.07819 0.2286 2.924
    outer 0.09013 1.284 14.25
    left 0.0853 0.7766 9.104
    right 0.08112 0.3371 4.156


    Don’t sort by key columns
    pandas R Ratio
    inner 0.02808 0.2297 8.18
    outer 0.03879 1.181 30.45
    left 0.0365 0.6706 18.37
    right 0.03021 0.2995 9.912

    As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn’t really an indictment of R’s merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven’t run benchmarks against SQLite or another SQL database yet; that will happen eventually.

    I’m going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast–it was not easy!

    Here is the R code.

    N <- 10000
    indices = rep(NA, N)
    for (i in 1:N)
      indices[i] <- paste(sample(letters, 10), collapse="")

    left <- data.frame(key=rep(indices, 10),
                       key2=sample(rep(indices, 10)),
                       value=rnorm(100000))
    right <- data.frame(key=indices,
                        key2=sample(indices),
                        value2=rnorm(10000))
    timeit <- function(func, niter=10) {
      timing = rep(NA, niter)
      for (i in 1:niter) {
        gc()
        timing[i] <- system.time(func())[3]
      }
      mean(timing)
    }

    left.join <- function(sort=TRUE) {
      result <- merge(left, right, all.x=TRUE, sort=sort)
    }
    right.join <- function(sort=TRUE) {
      result <- merge(left, right, all.y=TRUE, sort=sort)
    }
    outer.join <- function(sort=TRUE) {
      result <- merge(left, right, all=TRUE, sort=sort)
    }
    inner.join <- function(sort=TRUE) {
      reuslt <- merge(left, right, sort=sort)
    }

    sort.options <- c(FALSE, TRUE)

    results <- matrix(nrow=4, ncol=2)
    colnames(results) <- c("dont_sort", "sort")
    rownames(results) <- c("inner", "outer", "left", "right")

    join.functions <- c(inner.join, outer.join, left.join, right.join)
    for (i in 1:4) {
      results[1, 1] <- timeit(function() {inner.join(sort=sort.options[1])})
      results[1, 2] <- timeit(function() {inner.join(sort=sort.options[2])})
      results[2, 1] <- timeit(function() {outer.join(sort=sort.options[1])})
      results[2, 2] <- timeit(function() {outer.join(sort=sort.options[2])})
      results[3, 1] <- timeit(function() {left.join(sort=sort.options[1])})
      results[3, 2] <- timeit(function() {left.join(sort=sort.options[2])})
      results[4, 1] <- timeit(function() {right.join(sort=sort.options[1])})
      results[4, 2] <- timeit(function() {right.join(sort=sort.options[2])})
    }

    And the Python code

    from collections import defaultdict
    import gc
    import time
    from pandas.util.testing import rands
    N = 10000
    indices = np.array([rands(10) for _ in xrange(N)], dtype='O')

    key = np.tile(indices, 10)
    key2 = key.copy()
    random.shuffle(key2)
    indices2 = indices.copy()
    random.shuffle(indices2)
    left = DataFrame({'key' : key, 'key2':key2,
                      'value' : np.random.randn(100000)})
    right = DataFrame({'key': indices, 'key2':indices2,
                       'value2' : np.random.randn(10000)})
    join_methods = ['inner', 'outer', 'left', 'right']
    results = DataFrame(index=join_methods, columns=[False, True])
    niter = 10
    for sort in [False, True]:
        for join_method in join_methods:
            f = lambda: merge(left, right, how=join_method, sort=sort)
            gc.disable()
            start = time.time()
            for _ in xrange(niter):
                f()
            elapsed = (time.time() - start) / niter
            gc.enable()
            results[sort][join_method] = elapsed
    results.columns = ['dont_sort', 'sort']

    # R results
    from StringIO import StringIO
    r_results = read_table(StringIO("""dont_sort   sort
    inner    0.2297 0.2286
    outer    1.1811 1.2843
    left     0.6706 0.7766
    right    0.2995 0.3371
    """
    ), sep='\s+')

    sort_results = DataFrame.from_items([('pandas', results['sort']),
                                         ('R', r_results['sort'])])
    sort_results['Ratio'] = sort_results['R'] / sort_results['pandas']

    nosort_results = DataFrame.from_items([('pandas', results['dont_sort']),
                                           ('R', r_results['dont_sort'])])
    nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas']

    PyHPC 2011 Pre-print paper on pandas

    I’ll be presenting this paper in a couple of weeks at SC11. It’s the most up-to-date white paper about pandas.

    PDF Download Link

    pandas: a Foundational Python Library for Data Analysis and Statistics

    Fast and easy pivot tables in pandas 0.5.0

    Over the last several months, I’ve invested a great deal in the GroupBy and indexing infrastructure of pandas. GroupBy in particular could still use more work to be made even higher performance, but even for quite large data sets, most users will find it more than satisfactory compared with other Python alternatives (which are mainly DIY approaches these days) or even in other data analysis packages, e.g. those in R, which I’ll talk a bit about in this article.

    An easy application of GroupBy is creating pivot tables from a tabular data set. This involves aggregating a data set on some number of keys, then reshaping it to form a 2D table with the stratified aggregated values. I added a new function “pivot_table“ which provides a high level interface for creating pivot tables. Let’s consider the tips data set used in some examples of the Hadley Wickham’s excellent R reshape2 package:

    In [1]: import pandas.rpy.common as com

    In [2]: tips = com.load_data('tips', package='reshape2')

    In [3]: tips['tip_pct'] = tips['tip'] / tips['total_bill']

    In [4]: tips.head()
    Out[4]:
       total_bill  tip   sex     smoker  day  time    size  tip_pct
    1  16.99       1.01  Female  No      Sun  Dinner  2     0.05945
    2  10.34       1.66  Male    No      Sun  Dinner  3     0.1605
    3  21.01       3.5   Male    No      Sun  Dinner  3     0.1666
    4  23.68       3.31  Male    No      Sun  Dinner  2     0.1398
    5  24.59       3.61  Female  No      Sun  Dinner  4     0.1468

    Since most readers won’t have rpy2, you can get the data file here and read it with:

    tips = read_csv('tips.csv')

    All of this new functionality is available now on GitHub but will be a part of the upcoming pandas 0.5.0 release. Note that I am only using rpy2 here to pull the R data set into my Python sesson– all of the below code is implemented in pure Python (with a little Cython magic to make things fast).

    Pivot table basics

    To use pivot_table, pass the pandas DataFrame and specify the column you wish to aggregate and the columns to group on the rows and columns of the table. Note the default aggregation is mean, though this can be specified:

    In [9]: table = pivot_table(tips, 'tip_pct', rows=['time', 'sex'],
                                cols='smoker')

    In [10]: table
    Out[10]:
      smoker       No      Yes  
    time   sex                  
    Dinner Female  0.1568  0.1851
           Male    0.1594  0.1489
    Lunch  Female  0.1571  0.1753
           Male    0.1657  0.1667

    In [17]: pivot_table(tips, 'tip_pct', rows=['day', 'time'],
                         cols='sex')
    Out[17]:
      sex        Female  Male  
    day  time                  
    Fri  Dinner  0.1991  0.1302
         Lunch   0.1997  0.1741
    Sat  Dinner  0.1565  0.1516
    Sun  Dinner  0.1816  0.1623
    Thur Dinner  0.1597  NaN  
         Lunch   0.1575  0.1653

    Note that the returned object is still a DataFrame, so you can use all of the standard indexing technology to select portions of the resulting table:

    In [22]: table.ix['Dinner']
    Out[22]:
      smoker  No      Yes  
    sex                    
    Female    0.1568  0.1851
    Male      0.1594  0.1489

    If you don’t specify a particular values column, it will try to aggregate all the other columns:

    In [24]: pivot_table(tips, rows=['sex', 'smoker'])
    Out[24]:
                   size   tip    tip_pct  total_bill
    sex    smoker                                  
    Female No      2.593  2.774  0.1569   18.11    
           Yes     2.242  2.932  0.1822   17.98    
    Male   No      2.711  3.113  0.1607   19.79    
           Yes     2.5    3.051  0.1528   22.28

    Other aggregations can be performed, like using len to get group counts:

    In [26]: pivot_table(tips, 'tip_pct', rows='sex', cols='smoker', aggfunc=len)
    Out[26]:
      smoker  No  Yes
    sex              
    Female    54  33
    Male      97  60

    Once you’ve done the aggregation, you’re free to use the built-in reshaping capability to reshape the data in the table:

    In [29]: table = pivot_table(tips, 'tip_pct', rows=['sex', 'day'],
                                 cols='smoker', aggfunc=len)

    In [30]: table
    Out[30]:
      smoker     No  Yes
    sex    day          
    Female Fri   2   7  
           Sat   13  15
           Sun   14  4  
           Thur  25  7  
    Male   Fri   2   8  
           Sat   32  27
           Sun   43  15
           Thur  20  10

    In [31]: table.unstack('sex')
    Out[31]:
      smoker  No            Yes        
      sex     Female  Male  Female  Male
    day                                
    Fri       2       2     7       8  
    Sat       13      32    15      27  
    Sun       14      43    4       15  
    Thur      25      20    7       10

    If you know a bit more about groupby, you can get clever and pass a dict of functions to perform multiple aggregations at once:

    In [59]: pivot_table(tips, rows=['sex', 'smoker'],
                         aggfunc={'tip_pct' : 'mean', 'size' : 'sum'})
    Out[59]:
                   size  tip_pct
    sex    smoker              
    Female No      140   0.1569
           Yes     74    0.1822
    Male   No      263   0.1607
           Yes     150   0.1528

    I thought that was pretty cool.

    Comparison with reshape2

    R’s reshape2 package relies on two functions, melt and cast, to do the reshaping. Its implementation is, at least in my opinion, a slightly more ad hoc approach than in pandas (the actual pivot_table function is only about 10 lines and basically falls out of the groupby and hierarchical-indexing based reshaping). However, it can take advantage of R’s built-in formulas which makes the syntax very intuitive. Here are some of the same examples using reshape2:

    > tips$tip.pct <- tips$tip / tips$total_bill
    > mtips = melt(tips, id=c("sex", "smoker", "day", "time"))
    > acast(mtips, day ~ sex + smoker, length, subset=.(variable=="tip"))
         Female_No Female_Yes Male_No Male_Yes
    Fri          2          7       2        8
    Sat         13         15      32       27
    Sun         14          4      43       15
    Thur        25          7      20       10

    > acast(mtips, sex + smoker ~ variable, mean)
               total_bill      tip     size   tip.pct
    Female_No    18.10519 2.773519 2.592593 0.1569210
    Female_Yes   17.97788 2.931515 2.242424 0.1821504
    Male_No      19.79124 3.113402 2.711340 0.1606687
    Male_Yes     22.28450 3.051167 2.500000 0.1527712

    R lacks hierarchical indexing as far as I know, so reshape2 munges the group labels together into a row label.

    In R there is also the xtabs function for doing cross-tabulation:

    > ftable(xtabs(size ~ time + sex + smoker + day, data=tips))
                         day Fri Sat Sun Thur
    time   sex    smoker                    
    Dinner Female No           2  30  43    2
                  Yes          8  33  10    0
           Male   No           4  85 124    0
                  Yes         12  71  39    0
    Lunch  Female No           3   0   0   60
                  Yes          6   0   0   17
           Male   No           0   0   0   50
                  Yes          5   0   0   23

    ftable just creates a flat table which can be more easily viewed. This table could be created with pandas by:

    In [54]: pivot_table(tips, 'size', rows=['time', 'sex', 'smoker'],
       ....:             cols='day', aggfunc=np.sum, fill_value=0)
    Out[54]:
      day                 Fri  Sat  Sun  Thur
    time   sex    smoker                    
    Dinner Female No      2    30   43   2  
                  Yes     8    33   10   0  
    Dinner Male   No      4    85   124  0  
                  Yes     12   71   39   0  
    Lunch  Female No      3    0    0    60  
                  Yes     6    0    0    17  
    Lunch  Male   No      0    0    0    50  
                  Yes     5    0    0    23

    Performance

    pandas is very fast as I’ve invested a great deal in optimizing the indexing infrastructure and other core algorithms related to things such as this. I don’t have a lot of points of comparison, but here is a simple benchmark of reshape2 versus pandas.pivot_table on a data set with 100000 entries and 25 groups. Here is the R code for the benchmark:

    library(reshape2)
    n <- 100000
    a.size <- 5
    b.size <- 5
    data <- data.frame(a=sample(letters[1:a.size], n, replace=T),
                       b=sample(letters[1:b.size], n, replace=T),
                       c=rnorm(n),
                       d=rnorm(n))
    timings <- numeric()

    for (i in 1:10) {
      gc()
      tim <- system.time(acast(melt(data, id=c("a", "b")), a + b ~ variable, mean))
      timings[i] = tim[3]
    }
    mean(timings)

    Here is what the actual table looks like:

    > table <- acast(melt(data, id=c("a", "b")), a + b ~ variable, mean)
    > head(table)
                   c            d
    a_a  0.002761963 -0.008793516
    a_b -0.004618980 -0.026978744
    a_c  0.022491767 -0.002163986
    a_d  0.005027362 -0.002532782
    a_e  0.006146438  0.031699238
    b_a -0.025094899  0.003334301

    Here’s the equivalent Python code (sans timings, which I’ll do in IPython):

    from pandas import *
    import string
    n = 100000
    asize = 5
    bsize = 5

    letters = np.asarray(list(string.letters), dtype=object)

    data = DataFrame(dict(foo=letters[:asize][np.random.randint(0, asize, n)],
                          bar=letters[:bsize][np.random.randint(0, bsize, n)],
                          baz=np.random.randn(n),
                          qux=np.random.randn(n)))

    table = pivot_table(data, rows=['foo', 'bar'])

    Similarly:

    In [36]: table = pivot_table(data, rows=['foo', 'bar'])

    In [37]: table[:10]
    Out[37]:
             baz       qux      
    foo bar                    
    a   a   -0.02162   0.02186  
        b   -0.002155  0.01173  
        c   -0.002331  0.006608
        d   -0.01734   0.01109  
        e    0.02837   0.01093  
    b   a    0.01703  -0.005845
        b   -0.001043  0.01283  
        c   -0.006015 -0.0005866
        d    0.009683  0.02211  
        e    0.02347  -8.346e-05

    And the R timing

    > mean(timings)
    [1] 0.42

    And pandas:

    In [38]: timeit table = pivot_table(data, rows=['foo', 'bar'])
    10 loops, best of 3: 117 ms per loop

    So it’s about 3-4x faster than reshape2. This obviously depends on the data set and how you structure the aggregation. Here’s a couple slightly different benchmarks on the same data:

    > acast(melt(data, id=c("a", "b")), a~b, mean, subset=.(variable=="c"))
                 a            b            c           d            e
    a  0.002761963 -0.004618980  0.022491767 0.005027362  0.006146438
    b -0.025094899  0.015966167 -0.007663059 0.006795551  0.006648496
    c  0.012273797  0.006139820 -0.019169968 0.009035374 -0.006079683
    d -0.014174747 -0.009844566  0.022450738 0.025967638 -0.006572791
    e -0.003623382  0.005924704  0.008156482 0.013288116  0.009839315

    Timing
    > mean(timings)
    [1] 0.3036

    and pandas:

    In [41]: pivot_table(data, 'baz', rows='foo', cols='bar')
    Out[41]:
      bar  a         b          c         d         e      
    foo                                                    
    a     -0.02162  -0.002155  -0.002331 -0.01734   0.02837
    b      0.01703  -0.001043  -0.006015  0.009683  0.02347
    c      0.01561   0.0003992  0.01451  -0.01046  -0.00368
    d      0.003504 -0.01437   -0.03265   0.003471 -0.004638
    e     -0.01656  -0.01276   -0.02481   0.008629  0.011  

    In [42]: timeit pivot_table(data, 'baz', rows='foo', cols='bar')
    10 loops, best of 3: 55 ms per loop

    I suspect one source of slowdown reshape/reshape2 approach is that the melt operation is very expensive, as is variable subsetting. However, it enables easier computation of group margins, something which I have not yet addressed (but will, eventually) in pandas.

    R’s factor function is slow

    I’m not making this up:

    > arr <- rep(letters, 1000)
    > system.time(for (i in 1:1000) foo <- factor(arr))
    user system elapsed
    1.809 0.023 1.821

    > arr <- 1:10000
    > system.time(for (i in 1:1000) foo <- factor(arr))
    user system elapsed
    7.782 0.088 8.261

    Here’s the equivalent code using the pandas Factor class (git HEAD):

    In [35]: import string
    In [36]: arr = np.tile(list(string.letters[:26]), 1000).astype(object)
    In [37]: Factor(arr)
    Out[37]:
    Factor:
    array([a, b, c, ..., x, y, z], dtype=object)
    Levels (26): [a b c d e f g h i j k l m n o p q r s t u v w x y z]
    In [38]: timeit Factor(arr)
    1000 loops, best of 3: 869 us per loop

    In [39]: arr = np.arange(10000, dtype=object)
    In [40]: timeit Factor(arr)
    1000 loops, best of 3: 1.91 ms per loop

    I just edited this up since R 2.11.1 is a lot slower than 2.13.1 (the latest release) so it wasn’t quite fair to perform comparisons off of that. So there are two cases here:

    • Few unique values (the 26 letters of the alphabet). R does pretty well here, about 2x slower. This is down from 12x slower in R 2.11.1, so I can only assume someone on the R core team noticed this in the last year or so.
    • All unique values. R is about 4x slower now.

    It’s possible I’m using a different algorithm (and so is R between 2.11 and 2.13!). Maybe Python’s much lauded dict implementation is responsible? I’m more curious than anything. Maybe an R guru could let me know (I’m afraid to look at the GPL source code).

    For the record, here’s the Python and Cython code involved. First, the Python code:

    def unique_with_labels(values):
        uniques = Index(_tseries.fast_unique(values))
        labels = _tseries.get_unique_labels(values, uniques.indexMap)
        return uniques, labels

    And the two Cython functions

    @cython.wraparound(False)
    @cython.boundscheck(False)
    def fast_unique(ndarray[object] values):
        cdef:
            Py_ssize_t i, n = len(values)
            list uniques = []
            dict table = {}
            object val, stub = 0

        for i from 0 <= i < n:
            val = values[i]
            if val not in table:
                table[val] = stub
                uniques.append(val)
        try:
            uniques.sort()
        except Exception:
            pass

        return np.asarray(uniques, dtype=object)

    @cython.wraparound(False)
    @cython.boundscheck(False)
    def get_unique_labels(ndarray[object] values, dict idMap):
        cdef int i, length
        cdef object idx
        cdef ndarray[int32_t] labels
        length = len(values)
        labels = np.empty(length, dtype=np.int32)
        for i from 0 <= i < length:
            idx = values[i]
            labels[i] = idMap[idx]

        return labels

    Slide deck: What’s new in pandas / SciPy stack for financial users

    GroupBy-fu: improvements in grouping and aggregating data in pandas

    A couple of weeks ago in my inaugural blog post I wrote about the state of GroupBy in pandas and gave an example application. However, I was dissatisfied with the limited expressiveness (see the end of the article), so I decided to invest some serious time in the groupby functionality in pandas over the last 2 weeks in beefing up what you can do. So this article is a part show-and-tell, part quick tutorial on the new features. Note that I haven’t added a lot of this to the official documentation yet.

    GroupBy primer

    GroupBy may be one of the least well-understood features in pandas. Let’s consider a DataFrame (basically a labeled 2D array):

    In [18]: df
    Out[18]:
                          A         B        C         D
    2000-03-31 00:00:00   0.6304   -1.493    0.2082    0.3378
    2000-06-30 00:00:00  -0.1041    1.093    0.1227   -0.2362
    2000-09-29 00:00:00  -1.822     0.7325  -0.4964   -1.707
    2000-12-29 00:00:00   0.3325   -0.2213   0.5465    1.265
    2001-03-30 00:00:00   0.8422   -0.1265   0.1234    0.7743
    2001-06-29 00:00:00  -0.07774  -0.3906  -0.6328   -1.259
    2001-09-28 00:00:00  -0.3417   -0.4209   1.724     0.3069
    2001-12-31 00:00:00   0.493    -0.2815   1.314     1.004
    2002-03-29 00:00:00   1.003    -0.4807   0.7232    1.079
    2002-06-28 00:00:00   0.1415   -2.275    0.6178    0.8817
    2002-09-30 00:00:00  -0.6855    0.3921  -1.468     1.522
    2002-12-31 00:00:00  -1.635     0.3083   0.1209   -0.5606
    2003-03-31 00:00:00   0.5707   -0.7125  -0.07677   0.5154
    2003-06-30 00:00:00   0.3646   -0.8574  -0.562     0.2176
    2003-09-30 00:00:00   1.024    -0.8113   0.6771   -1.261

    Here, the index (row labels) contains dates and the columns are names for each time series. When performing a groupby operation, we may different goals:

    • Perform an aggregation, like computing the sum of mean of each group. Functionally this means applying a function to each group and putting the aggregated results into a DataFrame
    • Slicing the DataFrame into chunks (groups) and then doing something with the slices
    • Performing a transformation, like standardizing each group (computing the zscore)

    So there are two tasks: first, grouping the data; second, doing something with the grouped data. As far as grouping, in a totally abstract sense we have to define some kind of mapping that assigns labels (one of the axes) into group buckets. A mapping could be one of many things:

    • A Python function, to be called on each label
    • A dict, containing {label : group_name} mappings
    • An array the same length as the axis containing the group correspondences. This may often be one of the columns in a DataFrame, but need not be. Every other mapping type gets rendered to this form eventually.

    So in the above data, let’s say we wanted to group the data by year. Since the row labels are Python datetime objects, we can access the year attribute when calling groupby

    In [21]: grouped = df.groupby(lambda x: x.year)

    This returns an object of type GroupBy. With this object, you can do a lot of things including: iteration, aggregation, or transformation. So we could iterate like so:

    In [20]: for year, group in grouped:
       ....:     print year
       ....:     print group
       ....:
    2000
                          A        B        C        D
    2000-03-31 00:00:00   0.6304  -1.493    0.2082   0.3378
    2000-06-30 00:00:00  -0.1041   1.093    0.1227  -0.2362
    2000-09-29 00:00:00  -1.822    0.7325  -0.4964  -1.707
    2000-12-29 00:00:00   0.3325  -0.2213   0.5465   1.265

    2001
                          A         B        C        D
    2001-03-30 00:00:00   0.8422   -0.1265   0.1234   0.7743
    2001-06-29 00:00:00  -0.07774  -0.3906  -0.6328  -1.259
    2001-09-28 00:00:00  -0.3417   -0.4209   1.724    0.3069
    2001-12-31 00:00:00   0.493    -0.2815   1.314    1.004

    2002
                          A        B        C        D
    2002-03-29 00:00:00   1.003   -0.4807   0.7232   1.079
    2002-06-28 00:00:00   0.1415  -2.275    0.6178   0.8817
    2002-09-30 00:00:00  -0.6855   0.3921  -1.468    1.522
    2002-12-31 00:00:00  -1.635    0.3083   0.1209  -0.5606

    2003
                          A        B        C         D
    2003-03-31 00:00:00   0.5707  -0.7125  -0.07677   0.5154
    2003-06-30 00:00:00   0.3646  -0.8574  -0.562     0.2176
    2003-09-30 00:00:00   1.024   -0.8113   0.6771   -1.261

    Or aggregate like so:

    In [21]: grouped.aggregate(np.sum)
    Out[21]:
           A        B        C          D
    2000  -0.9629   0.1108   0.3811    -0.3405
    2001   0.9159  -1.219    2.529      0.8268
    2002  -1.176   -2.055   -0.006253   2.921
    2003   1.959   -2.381    0.03838   -0.528

    Or transform groups like so (here I standardize each group):

    In [22]: zscore = lambda x: (x - x.mean()) / x.std()

    In [23]: transformed = grouped.transform(zscore)
    Out[23]:
                          A        B         C         D
    2000-03-31 00:00:00   0.7946  -1.316     0.2598    0.3396
    2000-06-30 00:00:00   0.1246   0.9216    0.06319  -0.1213
    2000-09-29 00:00:00  -1.442    0.61     -1.361    -1.302
    2000-12-29 00:00:00   0.5228  -0.2155    1.038     1.084
    2001-03-30 00:00:00   1.142    1.34     -0.47      0.5569
    2001-06-29 00:00:00  -0.571   -0.644    -1.168    -1.438
    2001-09-28 00:00:00  -1.062   -0.8715    1.009     0.0983
    2001-12-31 00:00:00   0.4917   0.1758    0.6299    0.7827
    2002-03-29 00:00:00   1.149    0.02675   0.7159    0.3864
    2002-06-28 00:00:00   0.3857  -1.422     0.6118    0.168
    2002-09-30 00:00:00  -0.3469   0.7316   -1.449     0.878
    2002-12-31 00:00:00  -1.188    0.664     0.1209   -1.432
    2003-03-31 00:00:00  -0.2443   1.097    -0.1434    0.7267
    2003-06-30 00:00:00  -0.8552  -0.8598   -0.9205    0.4138
    2003-09-30 00:00:00   1.099   -0.2376    1.064    -1.14

    Some aggregations are so common that they’re provided as instance methods on the GroupBy object:

    In [24]: grouped.sum()
    Out[24]:
           A        B        C          D
    2000  -0.9629   0.1108   0.3811    -0.3405
    2001   0.9159  -1.219    2.529      0.8268
    2002  -1.176   -2.055   -0.006253   2.921
    2003   1.959   -2.381    0.03838   -0.528

    In [25]: grouped.mean()
    Out[25]:
           A        B         C          D
    2000  -0.2407   0.02769   0.09526   -0.08512
    2001   0.229   -0.3049    0.6324     0.2067
    2002  -0.2939  -0.5138   -0.001563   0.7303
    2003   0.6531  -0.7938    0.01279   -0.176

    For those with SQL experience, a DataFrame can be treated like an SQL table. As such, grouping “by columns” is quite natural:

    In [27]: df2
    Out[27]:
        A     B       C        D
    0   foo   one    -0.7883   0.7743
    1   bar   one    -0.5866   0.06009
    2   foo   two     0.9312   1.2
    3   bar   three  -0.6417   0.3444
    4   foo   two    -0.8841  -0.08126
    5   bar   two     1.194   -0.7933
    6   foo   one    -1.624   -0.1403
    7   foo   three   0.5046   0.5833

    In [28]: for name, group in df2.groupby('A'):
       ....:     print name
       ....:     print group
       ....:
    bar
        A     B       C        D
    1   bar   one    -0.5866   0.06009
    3   bar   three  -0.6417   0.3444
    5   bar   two     1.194   -0.7933

    foo
        A     B       C        D
    0   foo   one    -0.7883   0.7743
    2   foo   two     0.9312   1.2
    4   foo   two    -0.8841  -0.08126
    6   foo   one    -1.624   -0.1403
    7   foo   three   0.5046   0.5833

    Note the presence of the “nuisance” B column when we group by A. I want this to work:

    In [30]: df2.groupby('A').mean()
    Out[30]:
          C         D
    bar  -0.01137  -0.1296
    foo  -0.3722    0.4671

    More on dealing with this below.

    New: Column selection

    In my last post I mentioned wanting to be able to do a groupby-operation on a single column of a DataFrame using another column or set of columns as the groupers. So in the above example, what I want is:

    grouped = df['C'].groupby(df['A'])

    # or
    grouped = df['C'].groupby([df['A'], df['B']])

    This is too verbose / awkward for my tastes. So now you can do:

    In [111]: df2.groupby('A')['C'].mean()
    Out[111]:
    bar    -0.447919901908
    foo    -0.158016186685

    Behind the scenes, this simply passes the C column to a Series GroupBy object along with the already-computed grouping(s).

    New: Group by multiple columns / key functions

    The ability to group by multiple criteria (just like SQL) has been one of my most desired GroupBy features for a long time. It’s also very hard to implement efficiently. So I bit the bullet and carefully crafted Cython code to speedily do aggregations on data grouped by multiple columns. So now you can do:

    In [113]: df2.groupby(['A', 'B']).mean()
    Out[113]:
        A     B       C        D
    0   foo   one     0.3558  -0.4818
    1   foo   two    -0.2758  -1.271
    2   foo   three  -0.95     1.709
    3   bar   one     1.916   -0.5093
    4   bar   two    -2.315    0.1177
    5   bar   three  -0.9447   0.2519

    Iteration with multiple groups flattens the key pairs:

    In [114]: for k1, k2, group in df2.groupby(['A', 'B']):
       .....:     print k1, k2
       .....:     print group
       .....:
    foo one
        A     B     C        D
    0   foo   one   0.8248   0.3856
    6   foo   one  -0.1133  -1.349

    foo two
        A     B     C        D
    2   foo   two  -1.175   -1.036
    4   foo   two   0.6239  -1.506

    foo three
        A     B       C      D
    7   foo   three  -0.95   1.709

    bar one
        A     B     C       D
    1   bar   one   1.916  -0.5093

    bar two
        A     B     C       D
    5   bar   two  -2.315   0.1177

    bar three
        A     B       C        D
    3   bar   three  -0.9447   0.2519

    As a less trivial example, I loaded some data from the Bureau of Labor Statistics, which I’ll write more about another time:

    In [126]: df = read_table('ce.data.44.FinActOtherCurr.txt',
                              sep='\s+', index_col=None)

    In [127]: df
    Out[127]:
    <class 'pandas.core.frame.DataFrame'>
    Index: 32806 entries, 0 to 32805
    Data columns:
    series_id    32806  non-null values
    year         32806  non-null values
    period       32806  non-null values
    value        32806  non-null values
    dtypes: int64(1), float64(1), object(2)

    In [128]: df[:15]
    Out[128]:
         series_id       year   period   value
    0    CEU5500000034   1990   M01      83
    1    CEU5500000034   1990   M02      83.6
    2    CEU5500000034   1990   M03      83.4
    3    CEU5500000034   1990   M04      84.9
    4    CEU5500000034   1990   M05      84.1
    5    CEU5500000034   1990   M06      85.4
    6    CEU5500000034   1990   M07      86.5
    7    CEU5500000034   1990   M08      86
    8    CEU5500000034   1990   M09      85.5
    9    CEU5500000034   1990   M10      83.6
    10   CEU5500000034   1990   M11      83.6
    11   CEU5500000034   1990   M12      84.7
    12   CEU5500000034   1990   M13      84.5
    13   CEU5500000034   1991   M01      82.9
    14   CEU5500000034   1991   M02      83.1

    This data set has 118 unique series_id’s and spans 22 years (the period column gives the month of observation), so we’re talking about 2596 unique groups. Let’s say we wanted to compute the mean value for each year, it’s pretty simple with the improved groupby:

    In [129]: df.groupby(['series_id', 'year']).mean()
    Out[129]:
    <class 'pandas.core.frame.DataFrame'>
    Index: 2596 entries, 0 to 2595
    Data columns:
    series_id    2596  non-null values
    value        2596  non-null values
    year         2596  non-null values
    dtypes: int64(1), float64(1), object(1)

    I was also pretty impressed with how fast this operation is (the power of Cython!):

    In [130]: timeit df.groupby(['series_id', 'year']).mean()
    10 loops, best of 3: 22.9 ms per loop

    I haven’t compared this with an SQL engine (SQLite, MySQL, Postgres, …) but I hope that it’s competitive. If you call a non-Cythonized function, the speed goes down considerably:

    In [132]: timeit df.groupby(['series_id', 'year'])['value'].std()
    1 loops, best of 3: 232 ms per loop

    Edit: It’s also possible to pass multiple functions instead of column names or to mix and match column names with functions:

    In [22]: df.groupby([lambda x: x.year, lambda x: x.month]).mean()
    Out[22]:
         A          B         C          D         key_0   key_1
    0    0.09433    0.06146   0.1891     0.2079    2000    1    
    1    0.02674   -0.09411   0.1557    -0.07926   2000    2    
    2    0.07089    0.3938   -0.08026   -0.416     2000    3    
    3    0.133     -0.04532   0.03341   -0.2866    2000    4    
    4    0.1301     0.1803   -0.4241     0.1482    2000    5    
    5   -0.3804    -0.08935   0.04791    0.08268   2000    6    
    6    0.2211     0.2483   -0.3559    -0.06205   2000    7    
    7   -0.2908     0.07606  -0.2001    -0.05225   2000    8    
    8   -0.07888   -0.3915    0.06442    0.5109    2000    9    
    9    0.3809     0.4037   -0.2431    -0.2892    2000    10  
    10  -0.2765     0.01461  -0.2653     0.1335    2000    11  
    11   0.6983    -0.1167    0.3964     0.03776   2000    12  
    12  -0.2347     0.2333   -0.06508    0.3       2001    1    
    13   0.1751     0.05545  -0.004038   0.06194   2001    2    
    14   0.08792    0.03517   0.05583   -0.4356    2001    3    
    15  -0.08408    0.05567  -0.1705     0.1914    2001    4    
    16   0.008912   0.2317    0.08875    0.08637   2001    5    
    17   0.3304     0.3635    0.005158   0.3588    2001    6    
    18  -0.1324     0.3419   -0.2527    -0.2249    2001    7    
    19   0.0472     0.2552   -0.07589   -0.07992   2001    8    
    20   0.07582   -0.2295    0.1002     0.1077    2001    9    
    21   0.2368     0.02214  -0.3866    -0.2671    2001    10  
    22   0.2116    -0.07872  -0.06399    0.2309    2001    11

    New: Omitting “nuisance” columns

    It’s pretty common to group by a column and ignore other columns containing non-floating point data. It used to be that you’d get an error, forcing you to first drop the “nuisance” columns, e.g. suppose we only want to group by the A column here:

    In [133]: df2
    Out[133]:
        A     B       C        D
    0   foo   one     0.8248   0.3856
    1   bar   one     1.916   -0.5093
    2   foo   two    -1.175   -1.036
    3   bar   three  -0.9447   0.2519
    4   foo   two     0.6239  -1.506
    5   bar   two    -2.315    0.1177
    6   foo   one    -0.1133  -1.349
    7   foo   three  -0.95     1.709

    If you try to compute the mean of the B column, things won’t quite work out:

    In [134]: grouped = df2.groupby('A')['B'].mean()
    TypeError: unsupported operand type(s) for /: 'str' and 'float'

    To cope with this, so-called “nuisance” columns are now automatically dropped. While I don’t like discarding information, I think “practicality beats purity” in this case:

    In [135]: df2.groupby('A').mean()
    Out[135]:
          C        D
    bar  -0.4479  -0.0466
    foo  -0.158   -0.3593

    New: “auto-dispatching” to DataFrame / Series

    Since very few functions in DataFrame and Series/TimeSeries are implemented on the GroupBy classes, you may wish to simply invoke a Series instance method on each group, let’s say:

    In [140]: df.groupby('series_id')['value'].agg(Series.skew)
    Out[140]:
    CEU5500000034    -0.253887792525
    CEU5500000035    0.0938252241898
    CEU5552200034    0.387563548587
    CEU5552200035    0.0868489117242
    CEU5552210034    0.74506367536
    CEU5552210035    0.551290226183
    <snip>

    But since we’re got Python at our disposal, it’s relatively straightforward to override __getattribute__ to wrap and dispatch calls to Series/DataFrame instance methods so you can do this:

    In [141]: df.groupby('series_id')['value'].skew()
    Out[141]:
    CEU5500000034    -0.253887792525
    CEU5500000035    0.0938252241898
    CEU5552200034    0.387563548587
    CEU5552200035    0.0868489117242
    CEU5552210034    0.74506367536
    <snip>

    Some of the more complicated methods, like the super-useful describe method can be used in fact:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    In [142]: df.groupby('series_id')['value'].describe()
    Out[142]:
    count mean std min 10% 50% 90% max
    CEU5500000034 279 96.54 8.317 81.7 84 98.6 107.2 112
    CEU5500000035 279 93.88 28.9 50.4 55.28 91.9 134.5 140.2
    CEU5552200034 278 97.95 6.839 86.9 89.47 97.15 108.2 113.9
    CEU5552200035 278 93.58 26.18 58 60.34 89.2 127.4 137.3
    CEU5552210034 278 100.7 3.95 94.1 95.87 99.8 106 113.1
    CEU5552210035 278 98.84 23.86 71.6 73.77 92.6 137.2 145.8
    ...

    For those interested, the internal code to do this dispatching is short and simple, by the way:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    def __getattribute__(self, attr):
    try:
    return object.__getattribute__(self, attr)
    except AttributeError:
    if hasattr(self.obj, attr):
    return self._make_wrapper(attr)
    raise
    def _make_wrapper(self, name):
    f = getattr(self.obj, name)
    if not isinstance(f, types.MethodType):
    return self.aggregate(lambda self: getattr(self, name))
    # return the class reference
    f = getattr(type(self.obj), name)
    def wrapper(*args, **kwargs):
    curried = lambda self: f(self, *args, **kwargs)
    return self.aggregate(curried)
    return wrapper

    Old but not well known: apply multiple functions

    This has always been around but not well-known by many people. Suppose you want to apply multiple functions to a group and collect all the results. You can pass a dict of functions:

    In [144]: grouped.agg({'mean' : np.mean, 'variance' : np.var})
    Out[144]:
                    mean    variance
    CEU5500000034   96.54   68.92  
    CEU5500000035   93.88   832.3  
    CEU5552200034   97.95   46.6    
    CEU5552200035   93.58   683.1  
    CEU5552210034   100.7   15.54  
    CEU5552210035   98.84   567.1  
    CEU5552211034   100.9   13.73  
    CEU5552211035   99.34   622    
    <snip>

    I have a mind to expand on this idea for DataFrame objects to get us closer to the full flexibility of SQL with Pythonic syntax. If anyone has ideas or suggestions please let me know.

    Conclusions

    GroupBy is certainly not done. If you use these tools and find them useful, please let me know. If you can think of ways to make them better, that would be nice information too. I think it would be great to implement a full SQL engine on top of pandas (similar to the SAS “proc sql”), and this new GroupBy functionality gets us closer to that goal.

    A Roadmap for Rich Scientific Data Structures in Python

    Discussion thread on Hacker News

    So, this post is a bit of a brain dump on rich data structures in Python and what needs to happen in the very near future. I care about them for statistical computing (I want to build a statistical computing environment that trounces R) and financial data analysis (all evidence leads me to believe that Python is the best all-around tool for the finance space). Other people in the scientific Python community want them for numerous other applications: geophysics, neuroscience, etc. It’s really hard to make everyone happy with a single solution. But the current state of affairs has me rather anxious. And I’d like to explain why. For a really quick summary on some of the work I’ve been doing, here’s my SciPy 2010 talk slides:

    Data structures with metadata, the backstory

    In the wake of SciPy 2011 I’ve been thinking a lot about the way forward from here in terms of building rich Pythonic data structures for statistics and many, many other fields. By rich I mean: is not just a NumPy ndarray, contains metadata (however we define metadata) and has operations which depend on the metadata, and in general does far more than structured arrays currently do for you. This touches on a great many topics and features that people want (partial list):

    • Manipulating heterogeneously-typed (what I loosely call “mixed type”) data
    • Size mutability: can add easily add “columns” or otherwise N-1-dimensional hyperslices without necessarily copying data
    • Metadata about what each axis represents: axis=3 is less meaningful than axis=’temperature’
    • Metadata about axis labels (ticks)
    • Label / tick-based data alignment / reshaping, either automatic or explicit
    • Label / tick-based (fancy) indexing, both setting and getting
    • Hierarchical columns
    • Robust handling of missing / NA data (a là R)
    • Tons of operations needing heterogeneous data and metadata: group by, filtering, sorting, selection / querying, reindexing (reshaping to conform to a new set of labels), axis selection based on names, etc. etc. 
    The list goes on and on. I could write a 50-page manuscript on the exact specification of what exact functionality is desired on each of the above bullet points. What I do know is that after using a rich data structure like the ones in pandas, it’s very, very hard to go back to using vanilla ndarrays. To wit, R users coming to Python they have a similar experience: the lack of data.frame and all the functions which operate on data.frames is a gigantic loss of functionality. When I work with MATLAB and R users (especially in the finance industry) and get them up and running with pandas, I get a lot of “where was this tool all my life?”. It’s just that much better. In fact, users can get by with only a very rudimentary understanding of NumPy if the data structures are good enough; I think this is highly desirable. Even for purely interactive data analysis (forget operations which actually utilize the metadata), isn’t this much better:
    In [4]: data.corr()
    Out[4]:
           AAPL     GOOG     MSFT     YHOO  
    AAPL   1        0.5724   0.4714   0.3447
    GOOG   0.5724   1        0.5231   0.3409
    MSFT   0.4714   0.5231   1        0.3012
    YHOO   0.3447   0.3409   0.3012   1

    than this:

    In [11]: np.corrcoef(data.T)
    Out[11]:
    array([[ 1.    ,  0.5724,  0.4714,  0.3447],
           [ 0.5724,  1.    ,  0.5231,  0.3409],
           [ 0.4714,  0.5231,  1.    ,  0.3012],
           [ 0.3447,  0.3409,  0.3012,  1.    ]])

    Of course if data were a structured ndarray you would be completely up a creek (most NumPy functions do not play well with structured arrays). But that’s another topic.

    But anyway, to the point of why I’m writing: we have a ton of really talented people with real problems to solve, and lots of great ideas about how to solve them. Last year at SciPy 2010 in Austin we had a Birds of a Feather session led by the venerable Fernando Pérez and myself to talk about the datarray project, pandas, tabular, larry, and other various ideas about data structures that people have kicked around. The topic is important enough that Enthought hosted a gathering this past May in Austin, the DataArray Summit, to talk about these issues and figure out where to go from here. It was a great meeting and we hashed out in gory detail many of the problem areas that we’d like to solve with a richer data structure. So that’s a bit of the backstory.

    But even given all these great discussions and work being done, we have a really fundamental problem:

    Fragmentation is killing us

    There, I said it =) All us NumPy ninjas can navigated the fragmented, incohesive collection of data structures and tools, but it’s confusing as hell for new and existing users. NumPy alone is not good enough for a lot of people (statisticians, financial data analysts, etc.), but they’re left with a confusing choice between pandas, larry, datarray, or something else. Also, these tools have largely not been integrated with any other tools because of the community’s collective commitment anxiety. We talk, hem and haw, and wring our hands. And still no integration. I don’t mean to complain: I just deeply care about making the scientific Python stack the most powerful data analysis stack to ever exist. Seriously. And I can’t do it alone. And I don’t want to make unilateral decisions and shove anything down anyone’s throat. We’ve started working on integration of pandas in statsmodels (which is already going to make a really huge difference), but we need to collectively get our proverbial sh*t together. And soon.

    My work on pandas lately and why it matters

    On my end, in the 2 months since the DataArray summit, I decided to put my PhD on hold and focus more seriously on Python development, statistics / statsmodels, pandas, and other projects I care deeply about. So I’ve easily invested more time in pandas in the last 2 months than in the previous 2 years. This included heavily redesigned internals (which I’ve extremely pleased with) and tackling various other thorny internal issues which had long been an annoyance for me and an external source of criticism by end users (like, “why are there 2 2-dimensional data structures, DataFrame and DataMatrix?” The answer is that the internal implementation was different, with different performance characteristics depending on use). I’m also trying to clear out my extensive backlog of wishlist features (lots of blog articles to come on these). What’s happened is that I started prototyping a new data structure which I’m calling NDFrame which I think is really going to be a big deal. Basically, the NDFrame is a by-product of the redesigned internal data structure (currently called BlockManager) backing DataFrame and friends. I can and should write an entire post about BlockManager and exactly what it needs to accomplish, but the short story is that BlockManager:

    • Stores an arbitrary collection of homogeneously-typed N-dimensional NumPy ndarray objects (I call these Block objects or simple blocks)
    • Has axis labels and is capable of reshaping the “blocks” to a new set of labels
    • Is capable of  consolidating blocks (gluing them together) having the same dtype
    • Knows how to appropriately introduce missing data, and upcast dtypes (int, bool) when the missing data marker (NaN, currently. Crossing my fingers for a good NA implementation!) needs to be introduced
    • Can deal with single “item” type casts (ha, try to do that with structured arrays!)
    • Can accept new blocks without copying data
    Maybe this is too meta (heh) for some people. As illustration, here’s literally what’s going on inside DataFrame now after all my latest hacking:
    In [40]: df
    Out[40]:
        b         a         c       e   f
    0   1.213     1.507     True    1   a
    1  -0.6765    0.06237   True    1   b
    2   0.3126   -0.2575    False   1   c
    3   0.1505    0.2242    True    1   d
    4  -0.7952    0.2909    True    1   e
    5   1.341    -0.9712    False   2   f
    6   0.01121   1.654     True    2   g
    7  -0.173    -1.385     False   2   h
    8   0.1637   -0.898     False   2   i
    9   0.5979   -1.035     False   2   j

    In [41]: df._data
    Out[41]:
    BlockManager
    Items: [b a c e f]
    Axis 1: [0 1 2 3 4 5 6 7 8 9]
    FloatBlock: [b a], 2 x 10, dtype float64
    BoolBlock: [c], 1 x 10, dtype bool
    IntBlock: [e], 1 x 10, dtype int64
    ObjectBlock: [f], 1 x 10, dtype object

    The user would of course never be intended to look at this, it’s purely internal. But for example things like this are OK to do:

    In [42]: df['e'] = df['e'].astype(float)
    In [43]: df._data
    Out[43]:
    BlockManager
    Items: [b a c e f]
    Axis 1: [0 1 2 3 4 5 6 7 8 9]
    FloatBlock: [b a], 2 x 10, dtype float64
    BoolBlock: [c], 1 x 10, dtype bool
    ObjectBlock: [f], 1 x 10, dtype object
    FloatBlock: [e], 1 x 10, dtype float64

    Since in that case there are now multiple float blocks, they can be explicitly consolidated, but if you use DataFrame many operations will cause it to happen automatically (which is highly desirable, especially when you have only one dtype, for doing row-oriented operations):

    In [44]: df._data.consolidate()
    Out[44]:
    BlockManager
    Items: [b a c e f]
    Axis 1: [0 1 2 3 4 5 6 7 8 9]
    BoolBlock: [c], 1 x 10, dtype bool
    FloatBlock: [b a e], 3 x 10, dtype float64
    ObjectBlock: [f], 1 x 10, dtype object

    Now, this is a pretty abstract business. Here’s my point: when I started thinking about NDFrame, a user-facing n-dimensional data structure backed by BlockManager, I realized that what I am going to build is a nearly strict superset of the functionality provided by every other rich data structure I know of. I made a picture of the feature overlap, and note that arrows loosely mean: “can be used to implement”:

    Data Structure Features Venn Diagram

    For example, I need to write generic fancy indexing on the NDFrame, a task largely tackled by DataArray. So rather than reinvent the wheel, I should just co-opt that code (love that BSD license), but then I’ve effectively created a fork (nooooo!). I think having all these different libraries (and leaving users the confusing choice between them) is kind of nuts. Ideally DataArray (homogeneous) should just be a part of pandas (and I’m not opposed to changing the name, though it has stronger branding and far more users than datarray or larry). But once we’ve gone down that route, larry is just a DataArray (homogeneous) with automatic data alignment. We’re all doing the exact same kinds of things. So why not have one library?

    Remember: the real world is heterogeneously-typed

    Some other casual (and potentially controversial) observations

    Here’s just a dumping ground of various thoughts I have on this and related topics:

    • The NumPy type hierarchy (int8, int16, int32, int64, uint8, uint16, …) isn’t that important to me. R and MATLAB don’t really have a type hierarchy and it doesn’t seem to pose a problem. So I haven’t gone down the road of creating Block objects mapping onto the NumPy type hierarchy. If someone wants to do it without complicating the end-user pandas experience, be my guest
    • Tying yourself to the ndarray is too restrictive. This is a major problem with DataArray and larry; they don’t do mixed-type data. So if you want to build something that competes with R you have failed before you have even begun by using a homogeneous-only data structure. Remember, DataFrame can be homogeneous whenever it wants to and getting the underlying ndarray is just a few keystrokes.
    • Structured arrays are probably not the answer. Size mutability (ability to add columns) and the ability to change dtypes are actually a big deal. As is the ability to do row-oriented computations, broadcasting, and all the things that structured arrays are (currently) unable to do. They are a darned convenient way of serializing / deserializing data, though. But memory layout and all that is far, far less important to me than usability / user interface / experience