Monthly Archives: May 2012

Mastering high performance data algorithms I: Group By

I’m on my way back from R/Finance 2012. Those guys did a nice job of organizing the conference and was great to meet everyone there.

As part of pandas development, I have had to develop a suite of high performance data algorithms and implementation strategies which are the heart and soul of the library. I get asked a lot why pandas’s performance is much better than R and other data manipulation tools. The reasons are two-fold: careful implementation (in Cython and and C, so minimizing the computational friction) and carefully-thought-out algorithms.

Here are some of the more important tools and ideas that I make use of on a day-to-day basis:

  • Hash tables I use klib. It’s awesome and easy to use. I have this for critical algorithms like unique, factorize (unique + integer label assignment)
  • O(n) sort, known in pandas as groupsort, on integers with known range. This is a variant of counting sort; if you have N integers with known range from 0 to K – 1, this can be sorted in O(N) time. Combining this tool with factorize (hash table-based), you can categorize and sort a large data set in linear time. Failure to understand these two algorithms will force you to pay O(N log N), dominating the runtime of your algorithm.
  • Vectorized data movement and subsetting routines: take, put, putmask, replace, etc.
  • Let me give you a prime example from a commit yesterday of me applying these ideas to great effect. Suppose I had a time series (or DataFrame containing time series) that I want to group by year, month, and day:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    In [6]: rng = date_range('1/1/2000', periods=20, freq='4h')
    In [7]: ts = Series(np.random.randn(len(rng)), index=rng)
    In [8]: ts
    Out[8]:
    2000-01-01 00:00:00 -0.891761
    2000-01-01 04:00:00 0.204853
    2000-01-01 08:00:00 0.690581
    2000-01-01 12:00:00 0.454010
    2000-01-01 16:00:00 -0.123102
    2000-01-01 20:00:00 0.300111
    2000-01-02 00:00:00 -1.052215
    2000-01-02 04:00:00 0.094484
    2000-01-02 08:00:00 0.318417
    2000-01-02 12:00:00 0.779984
    2000-01-02 16:00:00 -1.514042
    2000-01-02 20:00:00 2.550011
    2000-01-03 00:00:00 0.983423
    2000-01-03 04:00:00 -0.710861
    2000-01-03 08:00:00 -1.350554
    2000-01-03 12:00:00 -0.464388
    2000-01-03 16:00:00 0.817372
    2000-01-03 20:00:00 1.057514
    2000-01-04 00:00:00 0.743033
    2000-01-04 04:00:00 0.925849
    Freq: 4H
    In [9]: by = lambda x: lambda y: getattr(y, x)
    In [10]: ts.groupby([by('year'), by('month'), by('day')]).mean()
    Out[10]:
    2000 1 1 0.105782
    2 0.196106
    3 0.055418
    4 0.834441

    This is a small dataset, but imagine you have millions of observations and thousands or even millions of groups. How does that look algorithmically? I guarantee if you take a naive approach, you will crash and burn when the data increases in size. I know, because I did just that (take a look at the vbenchmarks). Laying down the infrastructure for doing a better job is not simple. Here are the steps for efficiently aggregating data like this:

  • Factorize the group keys, computing integer label arrays (O(N) per key)
  • Compute a “cartesian product number” or group index for each observation (since you could theoretically have K_1 * … * K_p groups observed, where K_i is the number of unique values in key i). This is again O(n) work.
  • If the maximum number of groups is large, “compress” the group index by using the factorize algorithm on it again. imagine you have 1000 uniques per key and 3 keys; most likely you do not actually observe 1 billion key combinations but rather some much smaller number. O(n) work again
  • For simple aggregations, like mean, go ahead and aggregate the data in one linear sweep without moving anything around.
  • Otherwise, for arbitrary user-defined aggregations, we have two options for chopping up the data set: sorting it (O(N) using groupsort!) or create index arrays indicating which observations belong to each group. The latter would be preferable in many cases to sorting a large data set (reordering the data, even though linear time, can be very costly).
  • I worked on speeding up the latter part of this last bullet point yesterday. The resulting code looked like this:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    def _get_indices_dict(label_list, keys):
    # Accepts factorized labels and unique key values
    shape = [len(x) for x in keys]
    group_index = get_group_index(label_list, shape) # Compute group index
    sorter, _ = lib.groupsort_indexer(com._ensure_int64(group_index),
    np.prod(shape))
    # Reorder key labels and group index
    sorted_labels = [lab.take(sorter) for lab in label_list]
    group_index = group_index.take(sorter)
    # Compute dict of {group tuple -> location NumPy array for group}
    index = np.arange(len(group_index)).take(sorter)
    return lib.indices_fast(index, group_index, keys, sorted_labels)

    The details of lib.indices_fast aren’t that interesting; it chops up np.arange(len(group_index)).take(sorter), the sorted indirect indices, to produce the index dictionary. Running %lprun to get a line profiling on a large-ish data set:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    In [11]: rng = date_range('1/1/2000', '12/31/2005', freq='H')
    In [12]: year, month, day = rng.year, rng.month, rng.day
    In [13]: ts = Series(np.random.randn(len(rng)), index=rng)
    In [14]: lprun -f gp._get_indices_dict for i in xrange(100): ts.groupby([year, month, day]).indices
    Timer unit: 1e-06 s
    File: pandas/core/groupby.py
    Function: _get_indices_dict at line 1975
    Total time: 0.628506 s
    Line # Hits Time Per Hit % Time Line Contents
    ==============================================================
    1975 def _get_indices_dict(label_list, keys):
    1976 400 695 1.7 0.1 shape = [len(x) for x in keys]
    1977 100 114388 1143.9 18.2 group_index = get_group_index(label_list, shape)
    1978
    1979 100 320 3.2 0.1 sorter, _ = lib.groupsort_indexer(com._ensure_int64(group_index),
    1980 100 62007 620.1 9.9 np.prod(shape))
    1981
    1982 400 53530 133.8 8.5 sorted_labels = [lab.take(sorter) for lab in label_list]
    1983 100 19516 195.2 3.1 group_index = group_index.take(sorter)
    1984 100 20189 201.9 3.2 index = np.arange(len(group_index)).take(sorter)
    1985
    1986 100 357861 3578.6 56.9 return lib.indices_fast(index, group_index, keys, sorted_labels)

    You might say, well, this seems like a lot of work and maybe we should just zip the keys (forming an array of Python tuples) and do a dumb algorithm? The speed difference ends up being something like an order of magnitude or more faster by being careful in this way and working with indirect integer index arrays.

    Anyway, in conclusion, it’s these kinds of algorithms and ideas why pandas is perhaps the best-performing open-source data analysis toolkit for in memory data (I’m going to get to out-of-core data processing and “big data” eventually, just hang tight). It goes beyond language features and data structure internals (though this naturally also has a major impact, a lot of the things I do are easy to express in Python but would be very awkward or impossible to do in, say, R. Maybe I should write a whole article on this.); carefully thought-out algorithms are a major piece of the puzzle.

    A O(n log n) NA-friendly time series “as of” using array operations

    In time series data, it’s fairly common to need to compute the last known value “as of” a particular date. However, missing data is the norm, so it’s a touch more complicated than doing a simple binary search. Here is an implementation using array operations that takes these things into account:

    Python
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    def asof_locs(stamps, where, mask):
    """
    Parameters
    ----------
    stamps : array of timestamps
    where : array of timestamps
    Values to determine the "as of" for
    mask : array of booleans where data is not NA
    Returns
    -------
    locs : array of ints
    Locations of each "as of" value, -1 for NA
    """
    locs = stamps[mask].searchsorted(where, side='right')
    locs = np.where(locs > 0, locs - 1, -1)
    result = np.arange(len(stamps))[mask].take(locs)
    first = mask.argmax()
    result[(locs == 0) & (where < stamps[first])] = -1
    return result

    Some algorithmic notes. First, let’s run this through line_profiler on a large time series to see where time is being taken up:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    In [9]: rng = date_range('1/1/2000', '6/1/2000', freq='10s')
    In [10]: s = Series(np.random.randn(len(rng)), index=rng)
    In [11]: rng
    Out[11]:
    <class 'pandas.tseries.index.DatetimeIndex'>
    [2000-01-01 00:00:00, ..., 2000-06-01 00:00:00]
    Length: 1313281, Freq: 10S, Timezone: None
    In [12]: s = s.resample('s')
    In [13]: len(s)
    Out[13]: 13132801
    In [24]: mask = notnull(s)
    In [25]: lprun -f asof_locs asof_locs(s.index, s.index[5::5], mask)
    Timer unit: 1e-06 s
    File: foo.py
    Function: asof_locs at line 3
    Total time: 0.793543 s
    Line # % Time Line Contents
    ==============================================================
    3 def asof_locs(stamps, where, mask):
    4 """
    5 Parameters
    6 ----------
    7 stamps : array of timestamps
    8 where : array of timestamps
    9 Values to determine the "as of" for
    10 mask : array of booleans where data is NA
    11
    12 Returns
    13 -------
    14 locs : array of ints
    15 Locations of each "as of" value, -1 for NA
    16 """
    17 63.3 locs = stamps[mask].searchsorted(where, side='right')
    18 8.3 locs = np.where(locs > 0, locs - 1, -1)
    19
    20 23.3 result = np.arange(len(stamps))[mask].take(locs)
    21
    22 1.6 first = mask.argmax()
    23 3.6 result[(locs == 0) & (where < stamps[first])] = -1
    24
    25 0.0 return result

    The main trickiness here is this step:

    1
    result = np.arange(len(stamps))[mask].take(locs)

    Since the indices returned by searchsorted are relative to the filtered values, you need to remap to what would have been the indices in the original array, and this does exactly that. Lastly, you might be interested in breaking up stamps[mask].searchsorted(where, side='right') to see how much time is spend in stamps[mask] versus searchsorted:

    1
    2
    3
    4
    5
    6
    7
    In [28]: timeit stamps[mask]
    10 loops, best of 3: 127 ms per loop
    In [29]: filt = stamps[mask]
    In [31]: timeit filt.searchsorted(where)
    1 loops, best of 3: 345 ms per loop

    So, this could definitely be sped up by a factor or 2 or more by doing the whole operation in C or Cython, but it’s nice to be able to express it concisely and speedily with array operations.

    Why I’m not on the Julia bandwagon (yet)

    EDIT (2013-07-11): The following article may or may not been accurate anymore since it’s been more than a year. I recommend running your own benchmarks on array expressions with the latest Julia builds and draw your own conclusions.

    I apologize in advance for writing what might be the first neutral to negative article about Julia that you’ve read. For those of you who don’t know, Julia is a new dynamic language with an LLVM-based JIT compiler designed for technical computing. It’s intent is to try to bridge the gap between high level languages and low level ones in terms of giving you the power, flexibility, and general-purpose usefulness of a Lisp or Python-like language but with low-level performance. It has MATLAB-like syntax but has a lot of the features of Lisp (macros, metaprogramming) and Python that we’ve all come to love. Before anything else, I’d like to say that I’m very impressed with the project, the team, and overall their technology choices.

    Many people are very excited about Julia, and one of the main reasons is because of a number of very impressive benchmarks. This is great and I’m also impressed with the benchmarks, but they are in my opinion fairly misleading and don’t represent the reality of a lot of technical computing.

    I compiled Julia in advance of Stefan Karpinski’s (one of the creators) talk in New York this week. I ran a few benchmarks and then proceeded to raise hell on the Julia mailing list.

    TL;DR: if you’re an R, MATLAB, or scientific Python user, it’s probably not time to start programming in Julia if the highest performance code is your goal for minimum effort. It may be in a couple of years when the JIT-compiler improves. However, Julia may soon be a reasonable free computational alternative to MATLAB. I say this even in a perfect world where Julia had all the libraries that Python and R have; as far as a sole language for building large scientific production applications in industry right now, I think Python + Cython are pretty much the only game in town.

    Why am I complaining about the benchmarks? Well, one of the first things I did was look at the performance of array operations. In scientific and technical computing, we work a lot with arrays, right? In fact, convention wisdom is “don’t write loops”. Many people have told me that I should be unrolling array operations into loops and my experience is that this is the way to get the best performance. Keep in mind that I am a performance-obsessed tool maker, and in libraries like pandas I am in control of the vast majority of the areas of people’s applications that would otherwise be bottlenecks.

    Here’s a simple benchmark: compute the inner product of two arrays, in other words, the sum of the element-wise product. Julia code (I disabled the garbage collector to remove its effect from the equation):

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    x = randn(10000000)
    y = randn(10000000)
    function inner(x, y)
    result = 0.0
    for i=1:length(x)
    result += x[i] * y[i]
    end
    result
    end
    gc_disable()
    tic()
    for i=1:50 result = sum(x .* y); end
    timing = toc();
    println("Array operations took ", (timing / 50) * 1000, " ms")
    tic()
    timing = @time for i=1:50 result = inner(x, y); end
    timing = toc();
    println("inner took ", (timing / 50) * 1000, " ms")

    And the result:

    1
    2
    3
    11:28 ~/code/repos/julia (master)$ ./julia bench.jl
    Array operations took 104.71534252166748 ms
    inner took 36.00101947784424ms

    Lastly, you might recognize that an array inner product can be expressed as a matrix operation and carried out by BLAS in Fortran:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    x = randn(1, 10000000)
    y = randn(10000000, 1)
    tic()
    timing = @time for i=1:50 result = x * y; end
    timing = toc();
    println("BLAS took ", (timing / 50) * 1000, "ms")
    => BLAS took 50.72967529296875ms

    So most of you know me as a Python hacker. So I’ll run the same array benchmark with NumPy:

    1
    2
    3
    4
    5
    6
    In [9]: x = randn(10000000)
    In [10]: y = randn(10000000)
    In [11]: timeit (x * y).sum()
    10 loops, best of 3: 57.8 ms per loop

    OK, so NumPy is faster than Julia for array operations, temporary array and all. But how much faster can you go by eliminating temporaries like Julia’s inner function? Here is Cython code:

    1
    2
    3
    4
    5
    6
    def inner(ndarray[float64_t] x, ndarray[float64_t] y):
    cdef Py_ssize_t i, n = len(x)
    cdef float64_t result = 0
    for i in range(n):
    result += x[i] * y[i]
    return result

    And the timing:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    In [1]: x = randn(10000000)
    y
    In [2]: y = randn(10000000)
    In [3]: from pandas._sandbox import inner
    In [4]: inner(x, y)
    Out[4]: -3117.5608515640943
    In [5]: timeit inner(x, y)
    100 loops, best of 3: 14.5 ms per loop

    So Julia is about 2.5 times slower than Cython in this trivial benchmark (straight C wouldn’t that much faster). You can do even better by using the Intel MKL in NumPy:

    1
    2
    In [7]: timeit np.dot(x, y)
    100 loops, best of 3: 9.43 ms per loop

    So anyway, Julia is not a magic solution. It _is_ pretty good though, especially if you never want to have to compile code. Having to unroll array expressions into loops is still not good and I don’t think it’s a habit that should be encouraged. And the JIT will get better over time. The Julia devs are spending a lot of their time I believe more fruitfully on language design, libraries and features, and “big science” topics like distributed computing and parallel processing problems which I think will be very impactful. But if you’re looking for a high performance alternative to R for implementing scientific algorithms with low effort, the sweet spot (at least in my opinion) between performance and fast development time is Cython. But if a perfect marriage of Python, Julia, and Cython is made, perhaps we can all have our cake and eat it too.