Category Archives: pandas

Easy, high performance time zone handling in pandas 0.8.0

Making time zone handling palatable is surprisingly difficult to get right. The generally agreed-upon “best practice” for storing timestamps is to use UTC. Otherwise, you have to worry about daylight savings time ambiguities or non-existent times. The misery of time zone handling is well documented, and summarized nicely last year by Armin Ronacher. When you work in UTC, most of your troubles go away; converting a single timestamp or array of timestamps between time zones becomes in essence a “free” operation since the time zone is simply metadata for the box containing the invariant UTC timestamp.

But it’s not all fun and games. The Python datetime API in this area is generally considered to be severely lacking. It’s so bad that 77-line modules with half a dozen convenience functions can get 245 watchers on GitHub. I often write that much code before I finish my first cup of coffee in the morning :) But, for most applications you can suffer through the API and use pytz, which is an adequate solution in most cases. pytz notably ships the Olson timezone database which is the key piece of information for powering time zone conversions.

But what about pandas? Among other things, pandas is really good for time series data, including very large time series data in the millions of observations. I don’t want to make pandas users suffer because of Python’s datetime API, so I’m happy to provide a better one (a bit more on this later). The biggest issue is: as with many non-scientific Python libraries, pytz and other tools have a fatal illness known as TMP, a.k.a. Too Much (pure) Python. Let me explain:

So, localizing a single datetime.datetime value takes 33 microseconds, or ~33 seconds per million timestamps. Localize serves a couple of important, but annoying functions: checking for ambiguities (“fall back”) and non-existent times (“spring forward”) at DST transition times.

Now, one major problem that I found while examining pytz code is how many temporary datetime.datetime objects are created during a single call to tz.localize. How many do you think?

15. 15 temporary datetime.datetime objects

Don’t believe me? Look for yourself. Just following what is going on inside the function is enough to make your head hurt. The code is vastly complicated by the fact that tz-aware datetimes are not comparable with tz-naive datetimes.

Obviously, there must be a better and faster way. Some might argue that I should improve pytz, but the problem is that the implementation of time zone logic is dependent on the representation of the timestamps. Over the last few months I have stopped using datetime.datetime in pandas in favor of 64-bit integer timestamps via NumPy’s datetime64 data type. Storing large arrays of datetime.datetime values is disastrously inefficient in terms of memory and performance in all time series operations. Obviously I can’t force this design decision on most Python programmers who are not engaged in high-performance data analysis work.

Making time zone handling fast and easy

So, here are my requirements for pandas’s time zone capabilities:

  • All operations must be vectorized and be as fast as possible on large arrays of irregular, not necessarily ordered 64-bit timestamps
  • API must be as simple and non-crappy as possible without sacrificing functionality.
  • pandas 0.8.0 has a new Timestamp data type which is a subclass of datetime.datetime providing nanosecond resolution support and, in my opinion, a strictly superior interface for working with dates and time:

    Timestamps can be created as local or converted to local using tz_localize. Conversions from one time zone to another use tz_convert:

    Wonder what time it is right now in London (it’s 8:50 PM in New York as I type this)?

    So that’s nice. Compared with datetime.datetime, Timestamp doesn’t get in your way as much. Timestamps are equal if and only if their UTC timestamps are equal:

    This makes sense, because they refer to the same moment in time. Also, adding timedeltas will do the right thing around DST transitions:

    OK, great. Scalar operations. I could have done all this with pytz. I’m really interested in vector operations on large time series.

    Localizing all of 1.8 million timestamps (without taking advantage of the fact that this range is regular and lacks any DST transitions, which you cannot assume in the general, irregular case) would have taken about 1 full minute if we were working with pytz and datetime.datetime objects. Here it takes about 390 ms using a vectorized Cython routine of my devising:

    What’s nice about working in UTC is that time zone conversions are now nearly free and do not copy any data (the DatetimeIndex is immutable):

    Scalar values are converted to Timestamp objects with the right hour, minute, second:

    Anyway, this is just a flavor of some of the things you can do in the almost-released version of pandas. Lots more easy-to-use and high-performance data analysis tooling to come.

    Mastering high performance data algorithms I: Group By

    I’m on my way back from R/Finance 2012. Those guys did a nice job of organizing the conference and was great to meet everyone there.

    As part of pandas development, I have had to develop a suite of high performance data algorithms and implementation strategies which are the heart and soul of the library. I get asked a lot why pandas’s performance is much better than R and other data manipulation tools. The reasons are two-fold: careful implementation (in Cython and and C, so minimizing the computational friction) and carefully-thought-out algorithms.

    Here are some of the more important tools and ideas that I make use of on a day-to-day basis:

  • Hash tables I use klib. It’s awesome and easy to use. I have this for critical algorithms like unique, factorize (unique + integer label assignment)
  • O(n) sort, known in pandas as groupsort, on integers with known range. This is a variant of counting sort; if you have N integers with known range from 0 to K – 1, this can be sorted in O(N) time. Combining this tool with factorize (hash table-based), you can categorize and sort a large data set in linear time. Failure to understand these two algorithms will force you to pay O(N log N), dominating the runtime of your algorithm.
  • Vectorized data movement and subsetting routines: take, put, putmask, replace, etc.
  • Let me give you a prime example from a commit yesterday of me applying these ideas to great effect. Suppose I had a time series (or DataFrame containing time series) that I want to group by year, month, and day:

    This is a small dataset, but imagine you have millions of observations and thousands or even millions of groups. How does that look algorithmically? I guarantee if you take a naive approach, you will crash and burn when the data increases in size. I know, because I did just that (take a look at the vbenchmarks). Laying down the infrastructure for doing a better job is not simple. Here are the steps for efficiently aggregating data like this:

  • Factorize the group keys, computing integer label arrays (O(N) per key)
  • Compute a “cartesian product number” or group index for each observation (since you could theoretically have K_1 * … * K_p groups observed, where K_i is the number of unique values in key i). This is again O(n) work.
  • If the maximum number of groups is large, “compress” the group index by using the factorize algorithm on it again. imagine you have 1000 uniques per key and 3 keys; most likely you do not actually observe 1 billion key combinations but rather some much smaller number. O(n) work again
  • For simple aggregations, like mean, go ahead and aggregate the data in one linear sweep without moving anything around.
  • Otherwise, for arbitrary user-defined aggregations, we have two options for chopping up the data set: sorting it (O(N) using groupsort!) or create index arrays indicating which observations belong to each group. The latter would be preferable in many cases to sorting a large data set (reordering the data, even though linear time, can be very costly).
  • I worked on speeding up the latter part of this last bullet point yesterday. The resulting code looked like this:

    The details of lib.indices_fast aren’t that interesting; it chops up np.arange(len(group_index)).take(sorter), the sorted indirect indices, to produce the index dictionary. Running %lprun to get a line profiling on a large-ish data set:

    You might say, well, this seems like a lot of work and maybe we should just zip the keys (forming an array of Python tuples) and do a dumb algorithm? The speed difference ends up being something like an order of magnitude or more faster by being careful in this way and working with indirect integer index arrays.

    Anyway, in conclusion, it’s these kinds of algorithms and ideas why pandas is perhaps the best-performing open-source data analysis toolkit for in memory data (I’m going to get to out-of-core data processing and “big data” eventually, just hang tight). It goes beyond language features and data structure internals (though this naturally also has a major impact, a lot of the things I do are easy to express in Python but would be very awkward or impossible to do in, say, R. Maybe I should write a whole article on this.); carefully thought-out algorithms are a major piece of the puzzle.

    vbench Lightning Talk Slides from PyCon 2012

    I prepared this lightning talk for PyCon on vbench, my performance monitoring tool, but didn’t have a chance to give it. If I get energetic I might make a screencast of it, but here are the slides:

    Contingency tables and cross-tabulations in pandas

    Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.

    When calling help(xtabs) (or help(table)) in R, you get a number of examples for using table, xtabs, and ftable:

    Here are the same examples using pandas:

    One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:

    Here is another fun example:

    Here is the equivalent R code:

    As soon as we add a formula framework to statsmodels I could see incorporating that into this kind of function in Python.

    NYCPython 1/10/2012: A look inside pandas design and development

    I had the privilege of speaking last night at the NYCPython meetup group. I’ve given tons of “use pandas!” talks so I thought I would take a slightly different angle and talk about some of the design and implementation work that I’ve done for getting good performance in critical data manipulations. I’ll turn some of this material into some blog articles in the near future.

    Wes McKinney: pandas design and development from Adam Klein on Vimeo.

    Here’s some more video footable shot by my awesome friend Emily Paup with her HDSLR.

    I did a little interactive demo (using the ever-amazing IPython HTML Notebook) on Ashley Williams’s Food Nutrient JSON Database:

    Link to PDF output of Demo

    Link to IPython Notebook file

    If you want to run the code in the IPython notebook, you’ll have to download the food database file above.

    The audience and I learned from this demo that if you’re after Tryptophan, “Sea lion, Steller, meat with fat (Alaska Native)” is where to get it (in the highest density).

    Some pandas Database Join (merge) Benchmarks vs. R base::merge

    Over the last week I have completely retooled pandas’s “database” join infrastructure / algorithms in order to support the full gamut of SQL-style many-to-many merges (pandas has had one-to-one and many-to-one joins for a long time). I was curious about the performance with reasonably large data sets as compared with base::merge.data.frame which I’ve used many times in R. So I just ran a little benchmark for joining a 100000-row DataFrame with a 10000-row DataFrame on two keys with about 10000 unique key combinations overall. Simulating a somewhat large SQL join.

    Note this new functionality will be shipping in the upcoming 0.7.0 release (!).

    There is a major factor affecting performance of the algorithms in pandas, namely whether the result data needs to be sorted lexicographically (which is the default behavior) by the join columns. R also offers the same option so it’s completely apples to apples. These are mean runtimes in seconds:


    Sort by key columns
    pandas R Ratio
    inner 0.07819 0.2286 2.924
    outer 0.09013 1.284 14.25
    left 0.0853 0.7766 9.104
    right 0.08112 0.3371 4.156


    Don’t sort by key columns
    pandas R Ratio
    inner 0.02808 0.2297 8.18
    outer 0.03879 1.181 30.45
    left 0.0365 0.6706 18.37
    right 0.03021 0.2995 9.912

    As you can see, the sorting time in pandas completely dominates the runtime of the join (sorting 10000 strings is a lot of string comparisons). This isn’t really an indictment of R’s merge algorithm, but rather speaks to the strength of the merge strategy I devised in pandas. After spending a few days working on this problem I definitely think R could do a lot better. I haven’t run benchmarks against SQLite or another SQL database yet; that will happen eventually.

    I’m going to write a blog article in the next few days going into algorithmic detail about how I got merging pretty big data sets to be so fast–it was not easy!

    Here is the R code.

    N <- 10000
    indices = rep(NA, N)
    for (i in 1:N)
      indices[i] <- paste(sample(letters, 10), collapse="")

    left <- data.frame(key=rep(indices, 10),
                       key2=sample(rep(indices, 10)),
                       value=rnorm(100000))
    right <- data.frame(key=indices,
                        key2=sample(indices),
                        value2=rnorm(10000))
    timeit <- function(func, niter=10) {
      timing = rep(NA, niter)
      for (i in 1:niter) {
        gc()
        timing[i] <- system.time(func())[3]
      }
      mean(timing)
    }

    left.join <- function(sort=TRUE) {
      result <- merge(left, right, all.x=TRUE, sort=sort)
    }
    right.join <- function(sort=TRUE) {
      result <- merge(left, right, all.y=TRUE, sort=sort)
    }
    outer.join <- function(sort=TRUE) {
      result <- merge(left, right, all=TRUE, sort=sort)
    }
    inner.join <- function(sort=TRUE) {
      reuslt <- merge(left, right, sort=sort)
    }

    sort.options <- c(FALSE, TRUE)

    results <- matrix(nrow=4, ncol=2)
    colnames(results) <- c("dont_sort", "sort")
    rownames(results) <- c("inner", "outer", "left", "right")

    join.functions <- c(inner.join, outer.join, left.join, right.join)
    for (i in 1:4) {
      results[1, 1] <- timeit(function() {inner.join(sort=sort.options[1])})
      results[1, 2] <- timeit(function() {inner.join(sort=sort.options[2])})
      results[2, 1] <- timeit(function() {outer.join(sort=sort.options[1])})
      results[2, 2] <- timeit(function() {outer.join(sort=sort.options[2])})
      results[3, 1] <- timeit(function() {left.join(sort=sort.options[1])})
      results[3, 2] <- timeit(function() {left.join(sort=sort.options[2])})
      results[4, 1] <- timeit(function() {right.join(sort=sort.options[1])})
      results[4, 2] <- timeit(function() {right.join(sort=sort.options[2])})
    }

    And the Python code

    from collections import defaultdict
    import gc
    import time
    from pandas.util.testing import rands
    N = 10000
    indices = np.array([rands(10) for _ in xrange(N)], dtype='O')

    key = np.tile(indices, 10)
    key2 = key.copy()
    random.shuffle(key2)
    indices2 = indices.copy()
    random.shuffle(indices2)
    left = DataFrame({'key' : key, 'key2':key2,
                      'value' : np.random.randn(100000)})
    right = DataFrame({'key': indices, 'key2':indices2,
                       'value2' : np.random.randn(10000)})
    join_methods = ['inner', 'outer', 'left', 'right']
    results = DataFrame(index=join_methods, columns=[False, True])
    niter = 10
    for sort in [False, True]:
        for join_method in join_methods:
            f = lambda: merge(left, right, how=join_method, sort=sort)
            gc.disable()
            start = time.time()
            for _ in xrange(niter):
                f()
            elapsed = (time.time() - start) / niter
            gc.enable()
            results[sort][join_method] = elapsed
    results.columns = ['dont_sort', 'sort']

    # R results
    from StringIO import StringIO
    r_results = read_table(StringIO("""dont_sort   sort
    inner    0.2297 0.2286
    outer    1.1811 1.2843
    left     0.6706 0.7766
    right    0.2995 0.3371
    """
    ), sep='\s+')

    sort_results = DataFrame.from_items([('pandas', results['sort']),
                                         ('R', r_results['sort'])])
    sort_results['Ratio'] = sort_results['R'] / sort_results['pandas']

    nosort_results = DataFrame.from_items([('pandas', results['dont_sort']),
                                           ('R', r_results['dont_sort'])])
    nosort_results['Ratio'] = nosort_results['R'] / nosort_results['pandas']

    Introducing vbench, new code performance analysis and monitoring tool

    Do you know how fast your code is? Is it faster than it was last week? Or a month ago? How do you know if you accidentally made a function slower by changes elsewhere? Unintentional performance regressions are extremely common in my experience: it’s hard to unit test the performance of your code. Over time I have gotten tired of playing the game of “performance whack-a-mole”. Thus, I started hacking together a little weekend project that I’m calling vbench. If someone thinks up a cleverer name, I’m all ears.

    Link to pandas benchmarks page produced using vbench

    What is vbench?

    vbench is a super-lightweight Python library for running a collection of performance benchmarks over the course of your source repository’s history. Since I’m a GitHub user, it only does git for now, but it could be generalized to support other VCSs. Basically, you define a benchmark:

    common_setup = """
    from pandas import *
    import pandas.util.testing as tm
    import random
    import numpy as np
    """


    setup = common_setup + """

    N = 100000
    ngroups = 100

    def get_test_data(ngroups=100, n=N):
        unique_groups = range(ngroups)
        arr = np.asarray(np.tile(unique_groups, n / ngroups), dtype=object)

        if len(arr) < n:
            arr = np.asarray(list(arr) + unique_groups[:n - len(arr)],
                             dtype=object)

        random.shuffle(arr)
        return arr

    df = DataFrame({'key1' : get_test_data(ngroups=ngroups),
                    'key2' : get_test_data(ngroups=ngroups),
                    'data' : np.random.randn(N)})
    def f():
        df.groupby(['key1', 'key2']).agg(lambda x: x.values.sum())
    """

    stmt2 = "df.groupby(['key1', 'key2']).sum()"
    bm_groupby2 = Benchmark(stmt2, setup, name="GroupBy test 2",
                            start_date=datetime(2011, 7, 1))

    Then you write down the information about your repository and how to build any relevant DLLs, etc., that vary from revision to revision:

    REPO_PATH = '/home/wesm/code/pandas'
    REPO_URL = 'git@github.com:wesm/pandas.git'
    DB_PATH = '/home/wesm/code/pandas/gb_suite/benchmarks.db'
    TMP_DIR = '/home/wesm/tmp/gb_pandas'
    PREPARE = """
    python setup.py clean
    """
    BUILD = """
    python setup.py build_ext --inplace
    """
    START_DATE = datetime(2011, 3, 1)

    Then you pass this info, plus a list of your benchmark objects, to the BenchmarkRunner class:

    runner = BenchmarkRunner(benchmarks, REPO_PATH, REPO_URL,
                             BUILD, DB_PATH, TMP_DIR, PREPARE,
                             run_option='eod', start_date=START_DATE)
    runner.run()

    Now, the BenchmarkRunner makes a clone of your repo, then runs all of the benchmarks once for each revision in the repository (or some other rule, e.g. I’ve set run_option='eod' to only take the last snapshot on each day). It persists the results in a SQLite database so that you can rerun the process and it will skip benchmarks it’s already run (this is key when you add new benchmarks, only the new ones will be updated). Benchmarks are uniquely identified by the MD5 hash of their source code.

    This is the resulting plot over time for the above GroupBy benchmark related to some Cython code that I worked on late last week (where I made a major performance improvement in this case):

    Here is a fully-formed vbench suite in the pandas git repository.

    Kind of like codespeed and speed.pypy.org?

    Before starting to write a new project I looked briefly at codespeed and the excellent work that the PyPy guys have done with speed.pypy.org. But then I started thinking, you know, Django web application, JSON requests to upload benchmark results? Seemed like far too much work to do something relatively simple. The dealbreaker is that codespeed is just a web application. It doesn’t actually (to my knowledge, someone correct me if I’m wrong?) have any kind of a framework for orchestrating the running of benchmarks throughout your code history. That is what this new project is for. I actually see a natural connection between vbench and codespeed, all you need to do is write a script to upload your vbench results to a codespeed web app!

    At some point I’d like to build a simple web front end or wx/Qt viewer for the generated vbench database. I’ve never done any JavaScript, but it would be a good opportunity to learn. Knowing me, I might break down and hack out a stupid little wxPython app with an embedded matplotlib widget anyway.

    Anyway, I’m really excited about this project. It’s very prototype-y at the moment but I tried to design it in a nice and extensible way. I also plan to put all my git repo analysis tools in there (like code churn graphs etc.) so it should become a nice little collection of tools.

    Other ideas for extending

  • Dealing with API changes
  • Multiple library versions (e.g. NumPy) or Python versions
  • PyHPC 2011 Pre-print paper on pandas

    I’ll be presenting this paper in a couple of weeks at SC11. It’s the most up-to-date white paper about pandas.

    PDF Download Link

    pandas: a Foundational Python Library for Data Analysis and Statistics

    Python for Financial Data Analysis with pandas

    I taught a guest lecture tonight to the Baruch MFE program about using Python and pandas for financial data analysis. Many thanks to Alain Ledon and Norman Kabir for inviting me to teach the class. Here are the slides from the first 40 minutes:

    I spent the remaining 90 minutes or so going through a fairly epic whirlwind tour of some of the most important nuts and bolts features of pandas for working with time series and other kinds of financial data. Been a while since I’ve done a talk like this– I usually am talking to a more general audience that don’t want to hear about quant finance! I did a number of fun applications:

  • Computing rolling 3 year betas to the S&P 500
  • Grouped regression of MSFT on AAPL returns by year and month
  • Plotting heatmap of daily and monthly stock correlation matrix (this one got some “ooh”s and “ahh”s)
  • Minimum tracking error portfolio optimization problem and lite backtest
  • Here is the full PDF of the IPython notebook and a zip file of the IPython notebook and supporting files themselves. Note that the notebook requires a recent GitHub snapshot of statsmodels. I also use the quadratic programming solver in CVXOPT for the portfolio optimization application: if you don’t have CVXOPT, you can change it to use the brute force BFGS optimizer (though it’s about 50x slower).

    See my prior blog post with some more details about installing and getting set up with the IPython notebook.

    IPython Notebook PDF Output
    IPython Notebook and supporting files

    Python, R, and the allure of magic

    R is much more magical than Python. What do I mean by this? In R, things like this are a part of everyday life:

    > a <- rnorm(10)
    > b <- rnorm(10)
    > cbind(a, b)
                   a          b
     [1,]  0.8729978  0.5170078
     [2,] -0.6885048 -0.4430447
     [3,]  0.4017740  1.8985843
     [4,]  2.1088905 -1.4121763
     [5,]  0.9375273  0.4703302
     [6,]  0.5558276 -0.5825152
     [7,] -2.1606252  0.7379874
     [8,] -0.7651046 -0.4534345
     [9,] -4.2604901  0.9561077
    [10,]  0.3940632 -0.8331285

    If you’re a seasoned Python programmer, you might have the sort of visceral negative reaction that I do to this. Seriously, just where in the hell did those variable names come from? So when I say magic here I’m talking about abusing the language’s parser. There is nothing special about R that makes the above behavior possible, but rather taking a fundamentally different design philosophy to, say, Python. As any Python programmer knows: Explicit is better than implicit. I happen to agree. There is also a bit of a semantic difference in R versus Python in that assignment in R typically copies data, whereas variables in Python are simply references (labels) for a particular object. So you could make the argument that the names a and b above are more strongly linked to the underlying data.

    While building pandas over the last several years, I occasionally grapple with issues like the above. Maybe I should just break from Python ethos and embrace magic? I mean, how hard would it be to get the above behavior in Python? Python gives you stack frames and the ast module after all. So I went down the rabbit hole and wrote this little code snippet:

    While this is woefully unpythonic, it’s also kind of cool:

    In [27]: merge(a, b)
    Out[27]:
                a         b      
    2000-01-03 -1.35      0.8398
    2000-01-04  0.999    -1.617  
    2000-01-05  0.2537    1.433  
    2000-01-06  0.6273   -0.3959
    2000-01-07  0.7963   -0.789  
    2000-01-10  0.004295 -1.446

    This can even parse and format more complicated expressions (harder than it looks, because you have to walk the whole AST):

    In [30]: merge(a, np.log(b))
    Out[30]:
                a        np.log(b)
    2000-01-03  0.6243   0.7953  
    2000-01-04  0.3593  -1.199    
    2000-01-05  2.805   -1.059    
    2000-01-06  0.6369  -0.9067  
    2000-01-07 -0.2734   NaN      
    2000-01-10 -1.023    0.3326

    Now, I am *not* suggesting we do this any time soon. I’m going to prefer the explicit approach (cf. the Zen of Python) any day of the week:

    In [32]: DataFrame({'a' : a, 'log(b)' : np.log(b)})
    Out[32]:
                a        log(b)
    2000-01-03  0.6243   0.7953
    2000-01-04  0.3593  -1.199  
    2000-01-05  2.805   -1.059  
    2000-01-06  0.6369  -0.9067
    2000-01-07 -0.2734   NaN    
    2000-01-10 -1.023    0.3326