Category Archives: python

NumPy indexing peculiarities

Many scientific Python users are surprised when I tell them that ndarray.take is faster than __getitem__-based (a.k.a. “fancy” as I call it) indexing.

import numpy as np
import random

arr = np.random.randn(10000, 5)
indexer = np.arange(10000)
random.shuffle(indexer)

In [26]: timeit arr[indexer]
1000 loops, best of 3: 1.25 ms per loop

In [27]: timeit arr.take(indexer, axis=0)
10000 loops, best of 3: 127 us per loop

It’s actually kind of unbelievable when you think about it. What’s going on here that take is almost 10x faster? I really should take a closer at the internals of what __getitem__ does because this has always struck me as pretty bad. Maybe I shouldn’t be complaining? I mean, R 2.13′s indexing falls somewhere in the middle:

mat <- matrix(rnorm(50000), nrow=10000, ncol=5)
set.seed(12345)
indexer <- sample(1:10000)
> system.time(for (i in 1:1000) mat[indexer,])
   user  system elapsed
  0.460   0.197   0.656

So 656 microseconds per iteration. (In an earlier version of this post I used rpy2 to do the benchmark and got 1.05 ms, but there was apparently some overhead from rpy2)

Another peculiarity that I noticed with take is that performance gets worse when you use the out argument, which tells the function to use an array you pass in to write out the result:

out = np.empty_like(arr)

In [50]: timeit np.take(arr, indexer, axis=0, out=out)
10000 loops, best of 3: 200 us per loop

EDIT: I’ve been informed that using mode='clip' or mode='wrap' makes this run as fast as without the out argument.

Weird! I was dissatisfied by this, so I got curious how fast a hand-coded little Cython function can do this:

@cython.wraparound(False)
@cython.boundscheck(False)
def take_axis0(ndarray[float64_t, ndim=2] values,
               ndarray[int32_t] indexer,
               out=None):
    cdef:
        Py_ssize_t i, j, k, n, idx
        ndarray[float64_t, ndim=2] outbuf

    if out is None:
        outbuf = np.empty_like(values)
    else:
        outbuf = out

    n = len(indexer)
    k = values.shape[1]
    for i from 0 <= i < n:
        idx = indexer[i]

        if idx == -1:
            for j from 0 <= j < k:
                outbuf[i, j] = NaN
        else:
            for j from 0 <= j < k:
                outbuf[i, j] = values[idx, j]

Don’t worry about the -1 thing– that’s a specialization that I’m using inside pandas. Curiously, this function is a lot faster than take using out but faster than the regular take by a handful of microseconds.

In [53]: timeit lib.take_axis0(arr, indexer)
10000 loops, best of 3: 115 us per loop

In [54]: timeit lib.take_axis0(arr, indexer, out)
10000 loops, best of 3: 109 us per loop

Very interesting.

TL;DR

  • Use take not []-based indexing to get best performance
  • Cython is just as fast for my specific application and a lot faster if you’re passing an out array (which I will be for the application that I needed this for)
  • R’s matrix indexing performance is better than NumPy’s fancy indexing, but about 5-6x slower than ndarray.take. This can probably be improved.
  • Leaking memory like a pro in Python

    For all my Python zealotry, I never claimed it was perfect. In fact, there are a number of places you can shoot yourself in the foot if you aren’t careful. But I was kind of bummed about this one:

    import numpy as np

    class MainScreen(object):

        def __init__(self, cats):
            self.cats = cats
            self._signal = None

        def turn_on(self):
            if self._signal is None:
                signal = Signal(self)
                self._signal = signal
            return self._signal

    class Signal(object):

        def __init__(self, main_screen):
            # well we want to know which screen we came from!
            self.main_screen = main_screen

        def we_get(self):
            return True

    for i in xrange(50):
        cats = np.empty((10000, 500), dtype=float)
        cats.fill(0)
        main_screen = MainScreen(cats)
        signal = main_screen.turn_on()
        # all your base...

    Try it.

    EDIT: I’ve been informed by an astute reader that the GC is actually capable of dealing with the above case. Yang Zhang says “The collector actually works with cycles, including this case. (Actually, most objects are managed via reference-counting; the gc is specifically for cyclic references.) The problem is that it’s triggered by # allocations, not # bytes. In this case, each cats array is large enough that you need several GB of allocations before the threshold is triggered. Try adding gc.collect() or gc.get_count() in the loop (and inspect gc.get_threshold()).”

    All is not lost however. This pattern can be implemented using weak references:

    class MainScreen(object):

        def __init__(self, cats):
            self.cats = cats
            self._signal = None

        def turn_on(self):
            import weakref
            if self._signal is None:
                signal = Signal(self)
                self._signal = weakref.ref(signal)
            return self._signal()

    R’s factor function is slow

    I’m not making this up:

    > arr <- rep(letters, 1000)
    > system.time(for (i in 1:1000) foo <- factor(arr))
    user system elapsed
    1.809 0.023 1.821

    > arr <- 1:10000
    > system.time(for (i in 1:1000) foo <- factor(arr))
    user system elapsed
    7.782 0.088 8.261

    Here’s the equivalent code using the pandas Factor class (git HEAD):

    In [35]: import string
    In [36]: arr = np.tile(list(string.letters[:26]), 1000).astype(object)
    In [37]: Factor(arr)
    Out[37]:
    Factor:
    array([a, b, c, ..., x, y, z], dtype=object)
    Levels (26): [a b c d e f g h i j k l m n o p q r s t u v w x y z]
    In [38]: timeit Factor(arr)
    1000 loops, best of 3: 869 us per loop

    In [39]: arr = np.arange(10000, dtype=object)
    In [40]: timeit Factor(arr)
    1000 loops, best of 3: 1.91 ms per loop

    I just edited this up since R 2.11.1 is a lot slower than 2.13.1 (the latest release) so it wasn’t quite fair to perform comparisons off of that. So there are two cases here:

    • Few unique values (the 26 letters of the alphabet). R does pretty well here, about 2x slower. This is down from 12x slower in R 2.11.1, so I can only assume someone on the R core team noticed this in the last year or so.
    • All unique values. R is about 4x slower now.

    It’s possible I’m using a different algorithm (and so is R between 2.11 and 2.13!). Maybe Python’s much lauded dict implementation is responsible? I’m more curious than anything. Maybe an R guru could let me know (I’m afraid to look at the GPL source code).

    For the record, here’s the Python and Cython code involved. First, the Python code:

    def unique_with_labels(values):
        uniques = Index(_tseries.fast_unique(values))
        labels = _tseries.get_unique_labels(values, uniques.indexMap)
        return uniques, labels

    And the two Cython functions

    @cython.wraparound(False)
    @cython.boundscheck(False)
    def fast_unique(ndarray[object] values):
        cdef:
            Py_ssize_t i, n = len(values)
            list uniques = []
            dict table = {}
            object val, stub = 0

        for i from 0 <= i < n:
            val = values[i]
            if val not in table:
                table[val] = stub
                uniques.append(val)
        try:
            uniques.sort()
        except Exception:
            pass

        return np.asarray(uniques, dtype=object)

    @cython.wraparound(False)
    @cython.boundscheck(False)
    def get_unique_labels(ndarray[object] values, dict idMap):
        cdef int i, length
        cdef object idx
        cdef ndarray[int32_t] labels
        length = len(values)
        labels = np.empty(length, dtype=np.int32)
        for i from 0 <= i < length:
            idx = values[i]
            labels[i] = idMap[idx]

        return labels

    GroupBy-fu: improvements in grouping and aggregating data in pandas

    A couple of weeks ago in my inaugural blog post I wrote about the state of GroupBy in pandas and gave an example application. However, I was dissatisfied with the limited expressiveness (see the end of the article), so I decided to invest some serious time in the groupby functionality in pandas over the last 2 weeks in beefing up what you can do. So this article is a part show-and-tell, part quick tutorial on the new features. Note that I haven’t added a lot of this to the official documentation yet.

    GroupBy primer

    GroupBy may be one of the least well-understood features in pandas. Let’s consider a DataFrame (basically a labeled 2D array):

    In [18]: df
    Out[18]:
                          A         B        C         D
    2000-03-31 00:00:00   0.6304   -1.493    0.2082    0.3378
    2000-06-30 00:00:00  -0.1041    1.093    0.1227   -0.2362
    2000-09-29 00:00:00  -1.822     0.7325  -0.4964   -1.707
    2000-12-29 00:00:00   0.3325   -0.2213   0.5465    1.265
    2001-03-30 00:00:00   0.8422   -0.1265   0.1234    0.7743
    2001-06-29 00:00:00  -0.07774  -0.3906  -0.6328   -1.259
    2001-09-28 00:00:00  -0.3417   -0.4209   1.724     0.3069
    2001-12-31 00:00:00   0.493    -0.2815   1.314     1.004
    2002-03-29 00:00:00   1.003    -0.4807   0.7232    1.079
    2002-06-28 00:00:00   0.1415   -2.275    0.6178    0.8817
    2002-09-30 00:00:00  -0.6855    0.3921  -1.468     1.522
    2002-12-31 00:00:00  -1.635     0.3083   0.1209   -0.5606
    2003-03-31 00:00:00   0.5707   -0.7125  -0.07677   0.5154
    2003-06-30 00:00:00   0.3646   -0.8574  -0.562     0.2176
    2003-09-30 00:00:00   1.024    -0.8113   0.6771   -1.261

    Here, the index (row labels) contains dates and the columns are names for each time series. When performing a groupby operation, we may different goals:

    • Perform an aggregation, like computing the sum of mean of each group. Functionally this means applying a function to each group and putting the aggregated results into a DataFrame
    • Slicing the DataFrame into chunks (groups) and then doing something with the slices
    • Performing a transformation, like standardizing each group (computing the zscore)

    So there are two tasks: first, grouping the data; second, doing something with the grouped data. As far as grouping, in a totally abstract sense we have to define some kind of mapping that assigns labels (one of the axes) into group buckets. A mapping could be one of many things:

    • A Python function, to be called on each label
    • A dict, containing {label : group_name} mappings
    • An array the same length as the axis containing the group correspondences. This may often be one of the columns in a DataFrame, but need not be. Every other mapping type gets rendered to this form eventually.

    So in the above data, let’s say we wanted to group the data by year. Since the row labels are Python datetime objects, we can access the year attribute when calling groupby

    In [21]: grouped = df.groupby(lambda x: x.year)

    This returns an object of type GroupBy. With this object, you can do a lot of things including: iteration, aggregation, or transformation. So we could iterate like so:

    In [20]: for year, group in grouped:
       ....:     print year
       ....:     print group
       ....:
    2000
                          A        B        C        D
    2000-03-31 00:00:00   0.6304  -1.493    0.2082   0.3378
    2000-06-30 00:00:00  -0.1041   1.093    0.1227  -0.2362
    2000-09-29 00:00:00  -1.822    0.7325  -0.4964  -1.707
    2000-12-29 00:00:00   0.3325  -0.2213   0.5465   1.265

    2001
                          A         B        C        D
    2001-03-30 00:00:00   0.8422   -0.1265   0.1234   0.7743
    2001-06-29 00:00:00  -0.07774  -0.3906  -0.6328  -1.259
    2001-09-28 00:00:00  -0.3417   -0.4209   1.724    0.3069
    2001-12-31 00:00:00   0.493    -0.2815   1.314    1.004

    2002
                          A        B        C        D
    2002-03-29 00:00:00   1.003   -0.4807   0.7232   1.079
    2002-06-28 00:00:00   0.1415  -2.275    0.6178   0.8817
    2002-09-30 00:00:00  -0.6855   0.3921  -1.468    1.522
    2002-12-31 00:00:00  -1.635    0.3083   0.1209  -0.5606

    2003
                          A        B        C         D
    2003-03-31 00:00:00   0.5707  -0.7125  -0.07677   0.5154
    2003-06-30 00:00:00   0.3646  -0.8574  -0.562     0.2176
    2003-09-30 00:00:00   1.024   -0.8113   0.6771   -1.261

    Or aggregate like so:

    In [21]: grouped.aggregate(np.sum)
    Out[21]:
           A        B        C          D
    2000  -0.9629   0.1108   0.3811    -0.3405
    2001   0.9159  -1.219    2.529      0.8268
    2002  -1.176   -2.055   -0.006253   2.921
    2003   1.959   -2.381    0.03838   -0.528

    Or transform groups like so (here I standardize each group):

    In [22]: zscore = lambda x: (x - x.mean()) / x.std()

    In [23]: transformed = grouped.transform(zscore)
    Out[23]:
                          A        B         C         D
    2000-03-31 00:00:00   0.7946  -1.316     0.2598    0.3396
    2000-06-30 00:00:00   0.1246   0.9216    0.06319  -0.1213
    2000-09-29 00:00:00  -1.442    0.61     -1.361    -1.302
    2000-12-29 00:00:00   0.5228  -0.2155    1.038     1.084
    2001-03-30 00:00:00   1.142    1.34     -0.47      0.5569
    2001-06-29 00:00:00  -0.571   -0.644    -1.168    -1.438
    2001-09-28 00:00:00  -1.062   -0.8715    1.009     0.0983
    2001-12-31 00:00:00   0.4917   0.1758    0.6299    0.7827
    2002-03-29 00:00:00   1.149    0.02675   0.7159    0.3864
    2002-06-28 00:00:00   0.3857  -1.422     0.6118    0.168
    2002-09-30 00:00:00  -0.3469   0.7316   -1.449     0.878
    2002-12-31 00:00:00  -1.188    0.664     0.1209   -1.432
    2003-03-31 00:00:00  -0.2443   1.097    -0.1434    0.7267
    2003-06-30 00:00:00  -0.8552  -0.8598   -0.9205    0.4138
    2003-09-30 00:00:00   1.099   -0.2376    1.064    -1.14

    Some aggregations are so common that they’re provided as instance methods on the GroupBy object:

    In [24]: grouped.sum()
    Out[24]:
           A        B        C          D
    2000  -0.9629   0.1108   0.3811    -0.3405
    2001   0.9159  -1.219    2.529      0.8268
    2002  -1.176   -2.055   -0.006253   2.921
    2003   1.959   -2.381    0.03838   -0.528

    In [25]: grouped.mean()
    Out[25]:
           A        B         C          D
    2000  -0.2407   0.02769   0.09526   -0.08512
    2001   0.229   -0.3049    0.6324     0.2067
    2002  -0.2939  -0.5138   -0.001563   0.7303
    2003   0.6531  -0.7938    0.01279   -0.176

    For those with SQL experience, a DataFrame can be treated like an SQL table. As such, grouping “by columns” is quite natural:

    In [27]: df2
    Out[27]:
        A     B       C        D
    0   foo   one    -0.7883   0.7743
    1   bar   one    -0.5866   0.06009
    2   foo   two     0.9312   1.2
    3   bar   three  -0.6417   0.3444
    4   foo   two    -0.8841  -0.08126
    5   bar   two     1.194   -0.7933
    6   foo   one    -1.624   -0.1403
    7   foo   three   0.5046   0.5833

    In [28]: for name, group in df2.groupby('A'):
       ....:     print name
       ....:     print group
       ....:
    bar
        A     B       C        D
    1   bar   one    -0.5866   0.06009
    3   bar   three  -0.6417   0.3444
    5   bar   two     1.194   -0.7933

    foo
        A     B       C        D
    0   foo   one    -0.7883   0.7743
    2   foo   two     0.9312   1.2
    4   foo   two    -0.8841  -0.08126
    6   foo   one    -1.624   -0.1403
    7   foo   three   0.5046   0.5833

    Note the presence of the “nuisance” B column when we group by A. I want this to work:

    In [30]: df2.groupby('A').mean()
    Out[30]:
          C         D
    bar  -0.01137  -0.1296
    foo  -0.3722    0.4671

    More on dealing with this below.

    New: Column selection

    In my last post I mentioned wanting to be able to do a groupby-operation on a single column of a DataFrame using another column or set of columns as the groupers. So in the above example, what I want is:

    grouped = df['C'].groupby(df['A'])

    # or
    grouped = df['C'].groupby([df['A'], df['B']])

    This is too verbose / awkward for my tastes. So now you can do:

    In [111]: df2.groupby('A')['C'].mean()
    Out[111]:
    bar    -0.447919901908
    foo    -0.158016186685

    Behind the scenes, this simply passes the C column to a Series GroupBy object along with the already-computed grouping(s).

    New: Group by multiple columns / key functions

    The ability to group by multiple criteria (just like SQL) has been one of my most desired GroupBy features for a long time. It’s also very hard to implement efficiently. So I bit the bullet and carefully crafted Cython code to speedily do aggregations on data grouped by multiple columns. So now you can do:

    In [113]: df2.groupby(['A', 'B']).mean()
    Out[113]:
        A     B       C        D
    0   foo   one     0.3558  -0.4818
    1   foo   two    -0.2758  -1.271
    2   foo   three  -0.95     1.709
    3   bar   one     1.916   -0.5093
    4   bar   two    -2.315    0.1177
    5   bar   three  -0.9447   0.2519

    Iteration with multiple groups flattens the key pairs:

    In [114]: for k1, k2, group in df2.groupby(['A', 'B']):
       .....:     print k1, k2
       .....:     print group
       .....:
    foo one
        A     B     C        D
    0   foo   one   0.8248   0.3856
    6   foo   one  -0.1133  -1.349

    foo two
        A     B     C        D
    2   foo   two  -1.175   -1.036
    4   foo   two   0.6239  -1.506

    foo three
        A     B       C      D
    7   foo   three  -0.95   1.709

    bar one
        A     B     C       D
    1   bar   one   1.916  -0.5093

    bar two
        A     B     C       D
    5   bar   two  -2.315   0.1177

    bar three
        A     B       C        D
    3   bar   three  -0.9447   0.2519

    As a less trivial example, I loaded some data from the Bureau of Labor Statistics, which I’ll write more about another time:

    In [126]: df = read_table('ce.data.44.FinActOtherCurr.txt',
                              sep='\s+', index_col=None)

    In [127]: df
    Out[127]:
    <class 'pandas.core.frame.DataFrame'>
    Index: 32806 entries, 0 to 32805
    Data columns:
    series_id    32806  non-null values
    year         32806  non-null values
    period       32806  non-null values
    value        32806  non-null values
    dtypes: int64(1), float64(1), object(2)

    In [128]: df[:15]
    Out[128]:
         series_id       year   period   value
    0    CEU5500000034   1990   M01      83
    1    CEU5500000034   1990   M02      83.6
    2    CEU5500000034   1990   M03      83.4
    3    CEU5500000034   1990   M04      84.9
    4    CEU5500000034   1990   M05      84.1
    5    CEU5500000034   1990   M06      85.4
    6    CEU5500000034   1990   M07      86.5
    7    CEU5500000034   1990   M08      86
    8    CEU5500000034   1990   M09      85.5
    9    CEU5500000034   1990   M10      83.6
    10   CEU5500000034   1990   M11      83.6
    11   CEU5500000034   1990   M12      84.7
    12   CEU5500000034   1990   M13      84.5
    13   CEU5500000034   1991   M01      82.9
    14   CEU5500000034   1991   M02      83.1

    This data set has 118 unique series_id’s and spans 22 years (the period column gives the month of observation), so we’re talking about 2596 unique groups. Let’s say we wanted to compute the mean value for each year, it’s pretty simple with the improved groupby:

    In [129]: df.groupby(['series_id', 'year']).mean()
    Out[129]:
    <class 'pandas.core.frame.DataFrame'>
    Index: 2596 entries, 0 to 2595
    Data columns:
    series_id    2596  non-null values
    value        2596  non-null values
    year         2596  non-null values
    dtypes: int64(1), float64(1), object(1)

    I was also pretty impressed with how fast this operation is (the power of Cython!):

    In [130]: timeit df.groupby(['series_id', 'year']).mean()
    10 loops, best of 3: 22.9 ms per loop

    I haven’t compared this with an SQL engine (SQLite, MySQL, Postgres, …) but I hope that it’s competitive. If you call a non-Cythonized function, the speed goes down considerably:

    In [132]: timeit df.groupby(['series_id', 'year'])['value'].std()
    1 loops, best of 3: 232 ms per loop

    Edit: It’s also possible to pass multiple functions instead of column names or to mix and match column names with functions:

    In [22]: df.groupby([lambda x: x.year, lambda x: x.month]).mean()
    Out[22]:
         A          B         C          D         key_0   key_1
    0    0.09433    0.06146   0.1891     0.2079    2000    1    
    1    0.02674   -0.09411   0.1557    -0.07926   2000    2    
    2    0.07089    0.3938   -0.08026   -0.416     2000    3    
    3    0.133     -0.04532   0.03341   -0.2866    2000    4    
    4    0.1301     0.1803   -0.4241     0.1482    2000    5    
    5   -0.3804    -0.08935   0.04791    0.08268   2000    6    
    6    0.2211     0.2483   -0.3559    -0.06205   2000    7    
    7   -0.2908     0.07606  -0.2001    -0.05225   2000    8    
    8   -0.07888   -0.3915    0.06442    0.5109    2000    9    
    9    0.3809     0.4037   -0.2431    -0.2892    2000    10  
    10  -0.2765     0.01461  -0.2653     0.1335    2000    11  
    11   0.6983    -0.1167    0.3964     0.03776   2000    12  
    12  -0.2347     0.2333   -0.06508    0.3       2001    1    
    13   0.1751     0.05545  -0.004038   0.06194   2001    2    
    14   0.08792    0.03517   0.05583   -0.4356    2001    3    
    15  -0.08408    0.05567  -0.1705     0.1914    2001    4    
    16   0.008912   0.2317    0.08875    0.08637   2001    5    
    17   0.3304     0.3635    0.005158   0.3588    2001    6    
    18  -0.1324     0.3419   -0.2527    -0.2249    2001    7    
    19   0.0472     0.2552   -0.07589   -0.07992   2001    8    
    20   0.07582   -0.2295    0.1002     0.1077    2001    9    
    21   0.2368     0.02214  -0.3866    -0.2671    2001    10  
    22   0.2116    -0.07872  -0.06399    0.2309    2001    11

    New: Omitting “nuisance” columns

    It’s pretty common to group by a column and ignore other columns containing non-floating point data. It used to be that you’d get an error, forcing you to first drop the “nuisance” columns, e.g. suppose we only want to group by the A column here:

    In [133]: df2
    Out[133]:
        A     B       C        D
    0   foo   one     0.8248   0.3856
    1   bar   one     1.916   -0.5093
    2   foo   two    -1.175   -1.036
    3   bar   three  -0.9447   0.2519
    4   foo   two     0.6239  -1.506
    5   bar   two    -2.315    0.1177
    6   foo   one    -0.1133  -1.349
    7   foo   three  -0.95     1.709

    If you try to compute the mean of the B column, things won’t quite work out:

    In [134]: grouped = df2.groupby('A')['B'].mean()
    TypeError: unsupported operand type(s) for /: 'str' and 'float'

    To cope with this, so-called “nuisance” columns are now automatically dropped. While I don’t like discarding information, I think “practicality beats purity” in this case:

    In [135]: df2.groupby('A').mean()
    Out[135]:
          C        D
    bar  -0.4479  -0.0466
    foo  -0.158   -0.3593

    New: “auto-dispatching” to DataFrame / Series

    Since very few functions in DataFrame and Series/TimeSeries are implemented on the GroupBy classes, you may wish to simply invoke a Series instance method on each group, let’s say:

    In [140]: df.groupby('series_id')['value'].agg(Series.skew)
    Out[140]:
    CEU5500000034    -0.253887792525
    CEU5500000035    0.0938252241898
    CEU5552200034    0.387563548587
    CEU5552200035    0.0868489117242
    CEU5552210034    0.74506367536
    CEU5552210035    0.551290226183
    <snip>

    But since we’re got Python at our disposal, it’s relatively straightforward to override __getattribute__ to wrap and dispatch calls to Series/DataFrame instance methods so you can do this:

    In [141]: df.groupby('series_id')['value'].skew()
    Out[141]:
    CEU5500000034    -0.253887792525
    CEU5500000035    0.0938252241898
    CEU5552200034    0.387563548587
    CEU5552200035    0.0868489117242
    CEU5552210034    0.74506367536
    <snip>

    Some of the more complicated methods, like the super-useful describe method can be used in fact:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    In [142]: df.groupby('series_id')['value'].describe()
    Out[142]:
    count mean std min 10% 50% 90% max
    CEU5500000034 279 96.54 8.317 81.7 84 98.6 107.2 112
    CEU5500000035 279 93.88 28.9 50.4 55.28 91.9 134.5 140.2
    CEU5552200034 278 97.95 6.839 86.9 89.47 97.15 108.2 113.9
    CEU5552200035 278 93.58 26.18 58 60.34 89.2 127.4 137.3
    CEU5552210034 278 100.7 3.95 94.1 95.87 99.8 106 113.1
    CEU5552210035 278 98.84 23.86 71.6 73.77 92.6 137.2 145.8
    ...

    For those interested, the internal code to do this dispatching is short and simple, by the way:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    def __getattribute__(self, attr):
    try:
    return object.__getattribute__(self, attr)
    except AttributeError:
    if hasattr(self.obj, attr):
    return self._make_wrapper(attr)
    raise
    def _make_wrapper(self, name):
    f = getattr(self.obj, name)
    if not isinstance(f, types.MethodType):
    return self.aggregate(lambda self: getattr(self, name))
    # return the class reference
    f = getattr(type(self.obj), name)
    def wrapper(*args, **kwargs):
    curried = lambda self: f(self, *args, **kwargs)
    return self.aggregate(curried)
    return wrapper

    Old but not well known: apply multiple functions

    This has always been around but not well-known by many people. Suppose you want to apply multiple functions to a group and collect all the results. You can pass a dict of functions:

    In [144]: grouped.agg({'mean' : np.mean, 'variance' : np.var})
    Out[144]:
                    mean    variance
    CEU5500000034   96.54   68.92  
    CEU5500000035   93.88   832.3  
    CEU5552200034   97.95   46.6    
    CEU5552200035   93.58   683.1  
    CEU5552210034   100.7   15.54  
    CEU5552210035   98.84   567.1  
    CEU5552211034   100.9   13.73  
    CEU5552211035   99.34   622    
    <snip>

    I have a mind to expand on this idea for DataFrame objects to get us closer to the full flexibility of SQL with Pythonic syntax. If anyone has ideas or suggestions please let me know.

    Conclusions

    GroupBy is certainly not done. If you use these tools and find them useful, please let me know. If you can think of ways to make them better, that would be nice information too. I think it would be great to implement a full SQL engine on top of pandas (similar to the SAS “proc sql”), and this new GroupBy functionality gets us closer to that goal.

    A Roadmap for Rich Scientific Data Structures in Python

    Discussion thread on Hacker News

    So, this post is a bit of a brain dump on rich data structures in Python and what needs to happen in the very near future. I care about them for statistical computing (I want to build a statistical computing environment that trounces R) and financial data analysis (all evidence leads me to believe that Python is the best all-around tool for the finance space). Other people in the scientific Python community want them for numerous other applications: geophysics, neuroscience, etc. It’s really hard to make everyone happy with a single solution. But the current state of affairs has me rather anxious. And I’d like to explain why. For a really quick summary on some of the work I’ve been doing, here’s my SciPy 2010 talk slides:

    Data structures with metadata, the backstory

    In the wake of SciPy 2011 I’ve been thinking a lot about the way forward from here in terms of building rich Pythonic data structures for statistics and many, many other fields. By rich I mean: is not just a NumPy ndarray, contains metadata (however we define metadata) and has operations which depend on the metadata, and in general does far more than structured arrays currently do for you. This touches on a great many topics and features that people want (partial list):

    • Manipulating heterogeneously-typed (what I loosely call “mixed type”) data
    • Size mutability: can add easily add “columns” or otherwise N-1-dimensional hyperslices without necessarily copying data
    • Metadata about what each axis represents: axis=3 is less meaningful than axis=’temperature’
    • Metadata about axis labels (ticks)
    • Label / tick-based data alignment / reshaping, either automatic or explicit
    • Label / tick-based (fancy) indexing, both setting and getting
    • Hierarchical columns
    • Robust handling of missing / NA data (a là R)
    • Tons of operations needing heterogeneous data and metadata: group by, filtering, sorting, selection / querying, reindexing (reshaping to conform to a new set of labels), axis selection based on names, etc. etc. 
    The list goes on and on. I could write a 50-page manuscript on the exact specification of what exact functionality is desired on each of the above bullet points. What I do know is that after using a rich data structure like the ones in pandas, it’s very, very hard to go back to using vanilla ndarrays. To wit, R users coming to Python they have a similar experience: the lack of data.frame and all the functions which operate on data.frames is a gigantic loss of functionality. When I work with MATLAB and R users (especially in the finance industry) and get them up and running with pandas, I get a lot of “where was this tool all my life?”. It’s just that much better. In fact, users can get by with only a very rudimentary understanding of NumPy if the data structures are good enough; I think this is highly desirable. Even for purely interactive data analysis (forget operations which actually utilize the metadata), isn’t this much better:
    In [4]: data.corr()
    Out[4]:
           AAPL     GOOG     MSFT     YHOO  
    AAPL   1        0.5724   0.4714   0.3447
    GOOG   0.5724   1        0.5231   0.3409
    MSFT   0.4714   0.5231   1        0.3012
    YHOO   0.3447   0.3409   0.3012   1

    than this:

    In [11]: np.corrcoef(data.T)
    Out[11]:
    array([[ 1.    ,  0.5724,  0.4714,  0.3447],
           [ 0.5724,  1.    ,  0.5231,  0.3409],
           [ 0.4714,  0.5231,  1.    ,  0.3012],
           [ 0.3447,  0.3409,  0.3012,  1.    ]])

    Of course if data were a structured ndarray you would be completely up a creek (most NumPy functions do not play well with structured arrays). But that’s another topic.

    But anyway, to the point of why I’m writing: we have a ton of really talented people with real problems to solve, and lots of great ideas about how to solve them. Last year at SciPy 2010 in Austin we had a Birds of a Feather session led by the venerable Fernando Pérez and myself to talk about the datarray project, pandas, tabular, larry, and other various ideas about data structures that people have kicked around. The topic is important enough that Enthought hosted a gathering this past May in Austin, the DataArray Summit, to talk about these issues and figure out where to go from here. It was a great meeting and we hashed out in gory detail many of the problem areas that we’d like to solve with a richer data structure. So that’s a bit of the backstory.

    But even given all these great discussions and work being done, we have a really fundamental problem:

    Fragmentation is killing us

    There, I said it =) All us NumPy ninjas can navigated the fragmented, incohesive collection of data structures and tools, but it’s confusing as hell for new and existing users. NumPy alone is not good enough for a lot of people (statisticians, financial data analysts, etc.), but they’re left with a confusing choice between pandas, larry, datarray, or something else. Also, these tools have largely not been integrated with any other tools because of the community’s collective commitment anxiety. We talk, hem and haw, and wring our hands. And still no integration. I don’t mean to complain: I just deeply care about making the scientific Python stack the most powerful data analysis stack to ever exist. Seriously. And I can’t do it alone. And I don’t want to make unilateral decisions and shove anything down anyone’s throat. We’ve started working on integration of pandas in statsmodels (which is already going to make a really huge difference), but we need to collectively get our proverbial sh*t together. And soon.

    My work on pandas lately and why it matters

    On my end, in the 2 months since the DataArray summit, I decided to put my PhD on hold and focus more seriously on Python development, statistics / statsmodels, pandas, and other projects I care deeply about. So I’ve easily invested more time in pandas in the last 2 months than in the previous 2 years. This included heavily redesigned internals (which I’ve extremely pleased with) and tackling various other thorny internal issues which had long been an annoyance for me and an external source of criticism by end users (like, “why are there 2 2-dimensional data structures, DataFrame and DataMatrix?” The answer is that the internal implementation was different, with different performance characteristics depending on use). I’m also trying to clear out my extensive backlog of wishlist features (lots of blog articles to come on these). What’s happened is that I started prototyping a new data structure which I’m calling NDFrame which I think is really going to be a big deal. Basically, the NDFrame is a by-product of the redesigned internal data structure (currently called BlockManager) backing DataFrame and friends. I can and should write an entire post about BlockManager and exactly what it needs to accomplish, but the short story is that BlockManager:

    • Stores an arbitrary collection of homogeneously-typed N-dimensional NumPy ndarray objects (I call these Block objects or simple blocks)
    • Has axis labels and is capable of reshaping the “blocks” to a new set of labels
    • Is capable of  consolidating blocks (gluing them together) having the same dtype
    • Knows how to appropriately introduce missing data, and upcast dtypes (int, bool) when the missing data marker (NaN, currently. Crossing my fingers for a good NA implementation!) needs to be introduced
    • Can deal with single “item” type casts (ha, try to do that with structured arrays!)
    • Can accept new blocks without copying data
    Maybe this is too meta (heh) for some people. As illustration, here’s literally what’s going on inside DataFrame now after all my latest hacking:
    In [40]: df
    Out[40]:
        b         a         c       e   f
    0   1.213     1.507     True    1   a
    1  -0.6765    0.06237   True    1   b
    2   0.3126   -0.2575    False   1   c
    3   0.1505    0.2242    True    1   d
    4  -0.7952    0.2909    True    1   e
    5   1.341    -0.9712    False   2   f
    6   0.01121   1.654     True    2   g
    7  -0.173    -1.385     False   2   h
    8   0.1637   -0.898     False   2   i
    9   0.5979   -1.035     False   2   j

    In [41]: df._data
    Out[41]:
    BlockManager
    Items: [b a c e f]
    Axis 1: [0 1 2 3 4 5 6 7 8 9]
    FloatBlock: [b a], 2 x 10, dtype float64
    BoolBlock: [c], 1 x 10, dtype bool
    IntBlock: [e], 1 x 10, dtype int64
    ObjectBlock: [f], 1 x 10, dtype object

    The user would of course never be intended to look at this, it’s purely internal. But for example things like this are OK to do:

    In [42]: df['e'] = df['e'].astype(float)
    In [43]: df._data
    Out[43]:
    BlockManager
    Items: [b a c e f]
    Axis 1: [0 1 2 3 4 5 6 7 8 9]
    FloatBlock: [b a], 2 x 10, dtype float64
    BoolBlock: [c], 1 x 10, dtype bool
    ObjectBlock: [f], 1 x 10, dtype object
    FloatBlock: [e], 1 x 10, dtype float64

    Since in that case there are now multiple float blocks, they can be explicitly consolidated, but if you use DataFrame many operations will cause it to happen automatically (which is highly desirable, especially when you have only one dtype, for doing row-oriented operations):

    In [44]: df._data.consolidate()
    Out[44]:
    BlockManager
    Items: [b a c e f]
    Axis 1: [0 1 2 3 4 5 6 7 8 9]
    BoolBlock: [c], 1 x 10, dtype bool
    FloatBlock: [b a e], 3 x 10, dtype float64
    ObjectBlock: [f], 1 x 10, dtype object

    Now, this is a pretty abstract business. Here’s my point: when I started thinking about NDFrame, a user-facing n-dimensional data structure backed by BlockManager, I realized that what I am going to build is a nearly strict superset of the functionality provided by every other rich data structure I know of. I made a picture of the feature overlap, and note that arrows loosely mean: “can be used to implement”:

    Data Structure Features Venn Diagram

    For example, I need to write generic fancy indexing on the NDFrame, a task largely tackled by DataArray. So rather than reinvent the wheel, I should just co-opt that code (love that BSD license), but then I’ve effectively created a fork (nooooo!). I think having all these different libraries (and leaving users the confusing choice between them) is kind of nuts. Ideally DataArray (homogeneous) should just be a part of pandas (and I’m not opposed to changing the name, though it has stronger branding and far more users than datarray or larry). But once we’ve gone down that route, larry is just a DataArray (homogeneous) with automatic data alignment. We’re all doing the exact same kinds of things. So why not have one library?

    Remember: the real world is heterogeneously-typed

    Some other casual (and potentially controversial) observations

    Here’s just a dumping ground of various thoughts I have on this and related topics:

    • The NumPy type hierarchy (int8, int16, int32, int64, uint8, uint16, …) isn’t that important to me. R and MATLAB don’t really have a type hierarchy and it doesn’t seem to pose a problem. So I haven’t gone down the road of creating Block objects mapping onto the NumPy type hierarchy. If someone wants to do it without complicating the end-user pandas experience, be my guest
    • Tying yourself to the ndarray is too restrictive. This is a major problem with DataArray and larry; they don’t do mixed-type data. So if you want to build something that competes with R you have failed before you have even begun by using a homogeneous-only data structure. Remember, DataFrame can be homogeneous whenever it wants to and getting the underlying ndarray is just a few keystrokes.
    • Structured arrays are probably not the answer. Size mutability (ability to add columns) and the ability to change dtypes are actually a big deal. As is the ability to do row-oriented computations, broadcasting, and all the things that structured arrays are (currently) unable to do. They are a darned convenient way of serializing / deserializing data, though. But memory layout and all that is far, far less important to me than usability / user interface / experience