A Roadmap for Rich Scientific Data Structures in Python

Discussion thread on Hacker News

So, this post is a bit of a brain dump on rich data structures in Python and what needs to happen in the very near future. I care about them for statistical computing (I want to build a statistical computing environment that trounces R) and financial data analysis (all evidence leads me to believe that Python is the best all-around tool for the finance space). Other people in the scientific Python community want them for numerous other applications: geophysics, neuroscience, etc. It’s really hard to make everyone happy with a single solution. But the current state of affairs has me rather anxious. And I’d like to explain why. For a really quick summary on some of the work I’ve been doing, here’s my SciPy 2010 talk slides:

Data structures with metadata, the backstory

In the wake of SciPy 2011 I’ve been thinking a lot about the way forward from here in terms of building rich Pythonic data structures for statistics and many, many other fields. By rich I mean: is not just a NumPy ndarray, contains metadata (however we define metadata) and has operations which depend on the metadata, and in general does far more than structured arrays currently do for you. This touches on a great many topics and features that people want (partial list):

  • Manipulating heterogeneously-typed (what I loosely call “mixed type”) data
  • Size mutability: can add easily add “columns” or otherwise N-1-dimensional hyperslices without necessarily copying data
  • Metadata about what each axis represents: axis=3 is less meaningful than axis=’temperature’
  • Metadata about axis labels (ticks)
  • Label / tick-based data alignment / reshaping, either automatic or explicit
  • Label / tick-based (fancy) indexing, both setting and getting
  • Hierarchical columns
  • Robust handling of missing / NA data (a là R)
  • Tons of operations needing heterogeneous data and metadata: group by, filtering, sorting, selection / querying, reindexing (reshaping to conform to a new set of labels), axis selection based on names, etc. etc. 
The list goes on and on. I could write a 50-page manuscript on the exact specification of what exact functionality is desired on each of the above bullet points. What I do know is that after using a rich data structure like the ones in pandas, it’s very, very hard to go back to using vanilla ndarrays. To wit, R users coming to Python they have a similar experience: the lack of data.frame and all the functions which operate on data.frames is a gigantic loss of functionality. When I work with MATLAB and R users (especially in the finance industry) and get them up and running with pandas, I get a lot of “where was this tool all my life?”. It’s just that much better. In fact, users can get by with only a very rudimentary understanding of NumPy if the data structures are good enough; I think this is highly desirable. Even for purely interactive data analysis (forget operations which actually utilize the metadata), isn’t this much better:
In [4]: data.corr()
Out[4]:
       AAPL     GOOG     MSFT     YHOO  
AAPL   1        0.5724   0.4714   0.3447
GOOG   0.5724   1        0.5231   0.3409
MSFT   0.4714   0.5231   1        0.3012
YHOO   0.3447   0.3409   0.3012   1

than this:

In [11]: np.corrcoef(data.T)
Out[11]:
array([[ 1.    ,  0.5724,  0.4714,  0.3447],
       [ 0.5724,  1.    ,  0.5231,  0.3409],
       [ 0.4714,  0.5231,  1.    ,  0.3012],
       [ 0.3447,  0.3409,  0.3012,  1.    ]])

Of course if data were a structured ndarray you would be completely up a creek (most NumPy functions do not play well with structured arrays). But that’s another topic.

But anyway, to the point of why I’m writing: we have a ton of really talented people with real problems to solve, and lots of great ideas about how to solve them. Last year at SciPy 2010 in Austin we had a Birds of a Feather session led by the venerable Fernando Pérez and myself to talk about the datarray project, pandas, tabular, larry, and other various ideas about data structures that people have kicked around. The topic is important enough that Enthought hosted a gathering this past May in Austin, the DataArray Summit, to talk about these issues and figure out where to go from here. It was a great meeting and we hashed out in gory detail many of the problem areas that we’d like to solve with a richer data structure. So that’s a bit of the backstory.

But even given all these great discussions and work being done, we have a really fundamental problem:

Fragmentation is killing us

There, I said it =) All us NumPy ninjas can navigated the fragmented, incohesive collection of data structures and tools, but it’s confusing as hell for new and existing users. NumPy alone is not good enough for a lot of people (statisticians, financial data analysts, etc.), but they’re left with a confusing choice between pandas, larry, datarray, or something else. Also, these tools have largely not been integrated with any other tools because of the community’s collective commitment anxiety. We talk, hem and haw, and wring our hands. And still no integration. I don’t mean to complain: I just deeply care about making the scientific Python stack the most powerful data analysis stack to ever exist. Seriously. And I can’t do it alone. And I don’t want to make unilateral decisions and shove anything down anyone’s throat. We’ve started working on integration of pandas in statsmodels (which is already going to make a really huge difference), but we need to collectively get our proverbial sh*t together. And soon.

My work on pandas lately and why it matters

On my end, in the 2 months since the DataArray summit, I decided to put my PhD on hold and focus more seriously on Python development, statistics / statsmodels, pandas, and other projects I care deeply about. So I’ve easily invested more time in pandas in the last 2 months than in the previous 2 years. This included heavily redesigned internals (which I’ve extremely pleased with) and tackling various other thorny internal issues which had long been an annoyance for me and an external source of criticism by end users (like, “why are there 2 2-dimensional data structures, DataFrame and DataMatrix?” The answer is that the internal implementation was different, with different performance characteristics depending on use). I’m also trying to clear out my extensive backlog of wishlist features (lots of blog articles to come on these). What’s happened is that I started prototyping a new data structure which I’m calling NDFrame which I think is really going to be a big deal. Basically, the NDFrame is a by-product of the redesigned internal data structure (currently called BlockManager) backing DataFrame and friends. I can and should write an entire post about BlockManager and exactly what it needs to accomplish, but the short story is that BlockManager:

  • Stores an arbitrary collection of homogeneously-typed N-dimensional NumPy ndarray objects (I call these Block objects or simple blocks)
  • Has axis labels and is capable of reshaping the “blocks” to a new set of labels
  • Is capable of  consolidating blocks (gluing them together) having the same dtype
  • Knows how to appropriately introduce missing data, and upcast dtypes (int, bool) when the missing data marker (NaN, currently. Crossing my fingers for a good NA implementation!) needs to be introduced
  • Can deal with single “item” type casts (ha, try to do that with structured arrays!)
  • Can accept new blocks without copying data
Maybe this is too meta (heh) for some people. As illustration, here’s literally what’s going on inside DataFrame now after all my latest hacking:
In [40]: df
Out[40]:
    b         a         c       e   f
0   1.213     1.507     True    1   a
1  -0.6765    0.06237   True    1   b
2   0.3126   -0.2575    False   1   c
3   0.1505    0.2242    True    1   d
4  -0.7952    0.2909    True    1   e
5   1.341    -0.9712    False   2   f
6   0.01121   1.654     True    2   g
7  -0.173    -1.385     False   2   h
8   0.1637   -0.898     False   2   i
9   0.5979   -1.035     False   2   j

In [41]: df._data
Out[41]:
BlockManager
Items: [b a c e f]
Axis 1: [0 1 2 3 4 5 6 7 8 9]
FloatBlock: [b a], 2 x 10, dtype float64
BoolBlock: [c], 1 x 10, dtype bool
IntBlock: [e], 1 x 10, dtype int64
ObjectBlock: [f], 1 x 10, dtype object

The user would of course never be intended to look at this, it’s purely internal. But for example things like this are OK to do:

In [42]: df['e'] = df['e'].astype(float)
In [43]: df._data
Out[43]:
BlockManager
Items: [b a c e f]
Axis 1: [0 1 2 3 4 5 6 7 8 9]
FloatBlock: [b a], 2 x 10, dtype float64
BoolBlock: [c], 1 x 10, dtype bool
ObjectBlock: [f], 1 x 10, dtype object
FloatBlock: [e], 1 x 10, dtype float64

Since in that case there are now multiple float blocks, they can be explicitly consolidated, but if you use DataFrame many operations will cause it to happen automatically (which is highly desirable, especially when you have only one dtype, for doing row-oriented operations):

In [44]: df._data.consolidate()
Out[44]:
BlockManager
Items: [b a c e f]
Axis 1: [0 1 2 3 4 5 6 7 8 9]
BoolBlock: [c], 1 x 10, dtype bool
FloatBlock: [b a e], 3 x 10, dtype float64
ObjectBlock: [f], 1 x 10, dtype object

Now, this is a pretty abstract business. Here’s my point: when I started thinking about NDFrame, a user-facing n-dimensional data structure backed by BlockManager, I realized that what I am going to build is a nearly strict superset of the functionality provided by every other rich data structure I know of. I made a picture of the feature overlap, and note that arrows loosely mean: “can be used to implement”:

Data Structure Features Venn Diagram

For example, I need to write generic fancy indexing on the NDFrame, a task largely tackled by DataArray. So rather than reinvent the wheel, I should just co-opt that code (love that BSD license), but then I’ve effectively created a fork (nooooo!). I think having all these different libraries (and leaving users the confusing choice between them) is kind of nuts. Ideally DataArray (homogeneous) should just be a part of pandas (and I’m not opposed to changing the name, though it has stronger branding and far more users than datarray or larry). But once we’ve gone down that route, larry is just a DataArray (homogeneous) with automatic data alignment. We’re all doing the exact same kinds of things. So why not have one library?

Remember: the real world is heterogeneously-typed

Some other casual (and potentially controversial) observations

Here’s just a dumping ground of various thoughts I have on this and related topics:

  • The NumPy type hierarchy (int8, int16, int32, int64, uint8, uint16, …) isn’t that important to me. R and MATLAB don’t really have a type hierarchy and it doesn’t seem to pose a problem. So I haven’t gone down the road of creating Block objects mapping onto the NumPy type hierarchy. If someone wants to do it without complicating the end-user pandas experience, be my guest
  • Tying yourself to the ndarray is too restrictive. This is a major problem with DataArray and larry; they don’t do mixed-type data. So if you want to build something that competes with R you have failed before you have even begun by using a homogeneous-only data structure. Remember, DataFrame can be homogeneous whenever it wants to and getting the underlying ndarray is just a few keystrokes.
  • Structured arrays are probably not the answer. Size mutability (ability to add columns) and the ability to change dtypes are actually a big deal. As is the ability to do row-oriented computations, broadcasting, and all the things that structured arrays are (currently) unable to do. They are a darned convenient way of serializing / deserializing data, though. But memory layout and all that is far, far less important to me than usability / user interface / experience
  • John Marino

    Impressive thinking. You’ll get the benefit of a SQL table with all the math-y functions of numpy. (The paltry list of aggregate functions in SQL has always forced me to offload my data to another program, like SAS, to do the computation.)

    FYI: your comment “[The] incohesive collection of data structures and tools, [... are] confusing as hell for new and existing users.” It’s true of this new user as well. I just lucked out when I stumbled upon your SciPy2010 talk a few months ago and decided “OK, go with pandas because Wes has had many of the same problems I have.” (FWIW, I almost didn’t go with pandas because it’s dependent on so many other modules and I was [am still] afraid of upgrade hell, bleeding edges, etc.)

    Just for fun: http://xkcd.com/927/

    [Reply]

  • Gaël Varoquaux

    Why do DataArrays exists? As far as I am concerned, because DataArrays are just a subclass of ndarrays, and thus do not require me to deal with an additional layer of abstraction in my algorithms.

    [Reply]

    Wes McKinney Reply:

    It’s a fair point. But if the data structure responds to np.asarray (which is easy to accomplish)…? Indeed a lot of the discussion in May down in Austin was around preserving the ndarray interface while also offering rich label-based indexing semantics. I’m not opposed to having ndarray subclass equipped with all the nice metadata stuff–my point was not “kill DataArray” but rather “let’s put all the code in one place because label-based indexing is a completely generic concept”

    [Reply]

  • Anonymous

    re: heterogenous data and structured arrays. these are simply (nd)arrays of structures? does numpy implement a discriminated union of sorts, which might underly a variant data type?

    [Reply]

    Wes McKinney Reply:

    yeah, “structured arrays” are one-dimensional ndarrays where each element is basically a packed C struct. but you can do cool things like have hierarchical columns. The whole NumPy mentality is that ndarrays are just views on a void buffer– so you’re free to interpret the bytes however you like. Striding information, etc., is part of the underlying C struct

    I haven’t looked under the covers at the C code for how it works, but there are a lot of applications where structured arrays are like a hot knife through better. Especially given that you can use np.memmap to access structured arrays that far exceed what you can load into memory (if I’m not mistaken)

    [Reply]

    David W-F Reply:

    Just a small correction: NumPy structured arrays can be arbitrarily dimensioned. It just gets weird to think about them since you then have this extra “dimension” that is the dtype.

    [Reply]

  • Jacob Frelinger

    When you get back into town, I should pick you brain about this. This is a huge issue for flow data in python, and a significant bulk of the code in fcm is approximating this kind of work (and a fair amount of it is pretty hairy code too), esp with regards to intelligently generating subsets of the data.

    [Reply]

    Wes McKinney Reply:

    It definitely may be that you can just pick up and use one of these tools. If you can show me concrete use cases and example data I can tell you pretty quickly. And if pandas, for example, doesn’t currently do what you need that is valuable information for me, too. The feature scope is by no means closed off.

    [Reply]

  • Carlo Hamalainen

    Just out of curiosity, have you worked with the NetCDF? It addresses some of the meta-data issues that you have (e.g. this axis is longitude with these units and this range), but not some of the others. The http://www.scidb.org/ project might also be of interest.

    [Reply]

    Wes McKinney Reply:

    I have not. I have worked with PyTables/HDF5 quite extensively which bears some similarities for storing array-oriented scientific data + metadata. In some sense databases are slightly orthogonal issues as pandas and friends are currently restricted to in-memory data structures and are concerned with munging data in memory, doing ad hoc analysis and data visualization, passing things off into algorithms expecting NumPy arrays and the like. Basically you need a tool that scales well for larger data sets but is nimble and lightweight for the small tasks too. Depends on the application obviously

    [Reply]

  • Anonymous

    Wes,

    Awesome post and I couldn’t agree with you more. I’m very excited to see where pandas and the python scientific community as a whole are going. I really hope others can see your logic about integrating our current tools for the better part of the community. Great stuff!

    [Reply]

  • http://playcg.wordpress.com/2011/07/22/10/ PlayCG

    [...] A Roadmap for Rich Scientific Data Structures in Python [...]

  • http://dhananjaynene.com/2011/07/23/links-for-2011-07-22/ » links for 2011-07-22 (Dhananjay Nene)

    [...] A Roadmap for Rich Scientific Data Structures in Python | Quant Pythonista A Roadmap for Rich scientific data structures in Python http://t.co/vJC8dzB (tags: via:packrati.us) [...]

  • Tim

    Wes,
    You made a good point. But speaking of the gap between python and R, I think the other thing Python needs to catch up is the graphics, R can generate beautiful graphics using ggplot2 etc, and I think matplotlib is far behind.
    what do you think?
    -Tim

    [Reply]

    Wes McKinney Reply:

    I definitely agree with you there. ggplot2 is awesome (!). You *can* make attractive graphics with matplotlib but it definitely requires a lot of tweaking / customization. I’m hopeful that a kind soul will put some work into implementing the Grammer of Graphics for Python (ggpy anyone?). We shall see

    [Reply]

    ChrisJS Reply:

    I haven’t see any clear demonstrations of what ggplot2 can do that the base R graphics package can’t do passably well. And I’m not sure there’s that large a gap between matplotlib and the base R graphics package.

    Could you give any examples of ways that ggplot2 > base R graphics and/or base R graphics > matplotlib?

    [Reply]

    Wes McKinney Reply:

    ggplot2 is mainly about 2 things: a) making plotting easier and b) making attractive plots. Making plots in matplotlib isn’t that bad but making attractive plots is rather difficult. here’s a list of blog posts someone did showing off some ggplot2 stuff (comparing with the lattice CRAN package, too, which can also make some nice looking plots that I wish were easier to do in matplotlib):

    http://gettinggeneticsdone.blogspot.com/2009/07/ggplot2-more-wicked-cool-plots-in-r.html

    Hadley Wickham Reply:

    The big advantage of ggplot2 compared to base graphics is that it has a deep underlying theory. There are two advantages to this – (1) it’s consistent, so you don’t have to remember so many different options and (2) it’s composable, so it’s much easier to create new types of graphics.

  • http://mamatoshi.wordpress.com/2011/07/25/links-for-2011-07-24/ links for 2011-07-24 « Stand on the shoulders of giants

    [...] A Roadmap for Rich Scientific Data Structures in Python (tags: statistics data datastructure numpy df via:zite) [...]

  • http://www.quora.com/What-are-some-good-resources-for-learning-about-statistical-analysis#ans406123 What are some good resources for learning about statistical analysis? – Quora

    [...] McKinney, A Roadmap for Rich Scientific Data Structures in Python: http://wesmckinney.com/blog/?p=77This answer .Please specify the necessary improvements. Edit Link Text Show answer summary [...]

  • Istvan Albert

    Just a quick comment, pandas looks impressive. One thing that worries me are the object methods that provide data analysis – Why do these exists at all? For example data.corr() seems to provide little benefit to invoking it as corr(data)

    The function corr() can now be documented separately, it can detect the various input data and respond accordingly etc. there are many benefits to this approach. I would resist the urge to add methods – IMHO object-oriented approaches are not good for data analysis – functional approaches, transformations are much better ones.

    Edit: same with data.astype(float) could be simply: float(data) and I could go on. Reducing the cognitive overhead and keeping things maximally simple would attract more users than any other approach.

    Great work I will start using pandas immediately.

    [Reply]

    Wes McKinney Reply:

    The general consensus of what’s “Pythonic” is to have instance methods for all the fundamental heavily used stuff. For example, numpy.ndarray has various reductions as instance methods: sum, std, mean, var, so typing arr.sum(0) versus sum(arr, 0) jives better with the rest of the scientific Python ecosystem– and you don’t have to hunt around for the right sum function. In R for example, most “data analysis” methods (as you put it) are actually instance methods on various data structures. So when you type plot(obj) it’s dispatching to plot.data.frame, which is how R does instance methods on classes. So it’s purely syntactic sugar.

    In general I have avoided adding tons of methods: there are actually relatively few functions for computing descriptive statistics compared with data manipulations inherent to the data structures. For example, all of the moving window functions (pandas.rolling_*) are functions which can operate on many kinds of input. But indeed you have to have a place in the code where it says:

    def some_function(obj):
        if isinstance(obj, Series):
            ...
        elif isinstance(obj, DataFrame):
            ...
        elif isinstance(obj, np.ndarray):
            ...

    I’m all for separating analytics from data, but in this case the Zen of Python “practicality beats purity” applies =)

    Also: astype is an ndarray method as Series (1D) is a subclass of numpy.ndarray

    [Reply]

    Istvan Albert Reply:

    But should you be using the numpy framework as guidance when trying to attract the audience that numpy does not seem to be able to properly reach?

    The main problem with the methods are that people try to use them and get an attribute error when the data is the wrong type rather than an type error that the function could catch and tell them what is wrong with invocation/operation.

    I do feel the pain of delegating by object type and there are various solutions – none of which are ideal. I will say though that The “Zen of Python” IMHO needs to apply to the end result how the library/package operates with respect of the end users and not the internal details. I looked at the internals of various so called ‘pythonic’ tools – it aint’ pretty.

    In your descriptive statistics package you will soon hit the problem that more advanced functions have to be standalone – this now leads to confusion – is the function that I want a method or standalone function – now I have to look up the docs etc. –

    usability is a notoriously difficult concept to properly deal with in an open source project – the main developer is usually so close and deeply involved with the project that is unable to properly evaluate it – I am not saying that I am right – do field test and ask a few people what they think, interestingly it was showsn that 3 to 5 people would suffice – especially those that you are trying to reach, then act accordingly.

    [Reply]

    Wes McKinney Reply:

    “usability is a notoriously difficult concept to properly deal with in an open source project”

    Indeed :) I’m lucky to have a rather large established user base who give me lots of feedback. I actually began writing pandas while I was working for a major quant hedge fund (AQR) so it has undergone heavy, heavy dogfooding. That certainly continues to this day. If I were operating in a bubble with no consideration to the end user experience it would be a different matter– but in short usability is essentially my primary focus.

    In short, I agree with your sentiments so don’t expect very many additional analytical methods to start appearing in DataFrame. But {sum, mean, std, skew, quantile, corr, corrwith, cumsum, diff, max, min} are there to stay unless there’s an extremely compelling reason to take any of them out.

    Istvan Albert Reply:

    ok – good luck

  • http://openid.drewfrank.com/ Drew Frank

    Hi Wes,

    Great post! You’re absolutely right that fragmentation has been a serious problem to date, and your ideas here are very promising. This seems like a nice step toward accomplishing the goal of building a statistical computing environment that trounces R :) .

    I’ve recently abandoned MATLAB for Python, but I find myself occasionally pulled toward R instead. I’m more familiar with Python than R, but in certain areas — rich scientific data structures, plotting, and libraries for statistics and machine learning — I find that R generally has stronger offerings. Can you comment on why you settled on Python as the language in which to build your ideal statistical computing environment? Why not start with R, which already has a leg up in these areas, and work to make it even better? This is not a suggestion, I’m just interested in your perspective. Thanks!

    [Reply]

    Wes McKinney Reply:

    I can and should write a whole article on my relationship with Python and R and how I got to where I am. Short story is that I actually started out doing statistical modeling in finance-land in R. I found myself in the situation where I needed to be able to nimbly do new research (which R is generally good at, but some things kind of stink, like data alignment), but simultaneously build robust, maintainable production code that could run every day and not cause a lot of problems. (Now, if you need good networking and other “systems” libraries to integrate with a scientific app and you’re using R: good luck)

    It was this latter task that drove me away from R and to Python: Python really is the sweet spot for being an excellent language for building robust production systems while also having a great set of interactive research and scientific computing tools. In some sense I was lucky to settle on the “right” set of tools, but in retrospect there was no other obvious choice. Python is still the best option, in my (admittedly biased) opinion. But Python still has a long way to go with regard to being competitive as a statistical computing environment. I’m working very hard at the moment to do something about that. statsmodels, scikit-learn, other projects are helping a lot.

    [Reply]

  • Matt Hollingsworth

    Hey Wes,

    Good to see that someone else is dragging out their PhD for this reason :)

    I do research in high energy physics (which is basically just statistical modeling at its core), and typically use ROOT ( http://root.cern.ch ) for my day-to-day analysis. It is really great in many ways, but it has many problems, namely
    a) the overall architecture is pretty terrible and has the steepest learning curve of any analysis toolkit I’ve used before
    b) it’s rather unstable–it seg faults all the time, stuff just doesn’t work as advertised, etc
    c) It’s in C++ which requires way too much typing to do simple things
    d) is overall very frustrating to use since as much time is spent fixing ROOT problems as doing actual data analysis

    I would be using SciPy+python (which I like a lot more), but the problem is that it has one feature that it’s impossible for me to live without, which that it has a data model (TTree) that allows one to make plots in a single command with arbitrary selections on data, etc. For example, if I have some data which is made up of some 1000 entries of x,y,and z, I can create a “TTree”, fill it with the data, and then do things like this:

    TTree* t = new TTree(“T”,”T”);
    Double_t x,y,z;
    t->Branch(“x”,&x,”x/D”)
    t->Branch(“y”,&y,”y/D”)
    t->Branch(“z”,&z,”z/D”)
    for(int i = 0; iGaus(0,10); y = gRandom->Gaus(-10,4);z = gRandom->Gaus(100,3);
    t->Fill();
    }
    t->Draw(“x”);// Draws a 1D histogram
    t->Draw(“y:x”,”",”COLZ”);// Draws a 2D histogram, y vs. x
    t->Draw(“y:Entry$”,”Entry$ Draw(“z:y:TMath::Exp(x)”,”x>1 || y 1 or y Fit(“gaus”) or whatever. You can even fill trees with arbitrary data types, collections, and all kinds of things.

    So, ROOT is great for that, but when it comes to trying to do some “real” analysis which involves getting the raw data out of the tree, doing some sort of operations on them, and then re-storing the data for example, it takes ages because of all of the random seg faults, things not working as advertised, etc. Not to mention everything else about ROOT is very unwieldy.

    I mention this to you because I would really like to work on a toolkit that does this sort of expression-based drawing/fitting/etc using python and scipy but I need a powerful enough data model to base it upon (to replace TTree). I just want to know what you think; do you think this is doable on top of pandas? And do you know anyone else who may be interested in doing something like this, or if something like this already exists?

    -Matt

    [Reply]

    Wes McKinney Reply:

    Interesting. It’s possible that pandas could provide a data model serving all or part of the needs of what you’re talking about. It sounds like TTree is a lot more general, and hence more complicated and difficult to use, than pandas. The pandas data model is pretty simple as it abstracts only a bit from the NumPy data model, which is working with contiguous chunks of memory. Essentially DataFrame for example is just a named collection of potentially hetereogeneously-typed same-length vectors with row labeling information. So in you case, you could easily do:

    from scipy.stats import norm
    df = DataFrame({‘x’ : norm(0, 10).rvs(1000),
    ‘y’ : norm(-10, 4).rvs(1000),
    ‘z’ : norm(100, 3).rvs(1000)})

    Now of course you have a DataFrame with 3 columns:

    In [34]: df
    Out[34]:

    Index: 1000 entries, 0 to 999
    Data columns:
    x 1000 non-null values
    y 1000 non-null values
    z 1000 non-null values
    dtypes: float64(3)

    it would be up to you to create some kind of “formula parser” to take data contained in a DataFrame and a description of a plot like you have and execute the appropriate matplotlib commands. As far as interactivity w.r.t. making changes to the plot– there is the default interaction that matplotlib provides but beyond that you would have to craft your own GUI widgets to suit your needs. It’s largely a user interface question.

    Now, if you use store object arrays in DataFrame (or any other pandas object), you can have collections of arbitrary data structures, giving you more flexibility. It’s quite likely that a problem that ROOT can handle fairly well may just need to be structured in a different way to be approachable with Python, pandas and other tools. But from what you’re saying about ROOT’s usability this might not be a bad thing anyway. Generally speaking, the more abtract / generic you make something, the harder it is to use.

    [Reply]

    Matt Hollingsworth Reply:

    Thanks for the info! And for the example. It sounds like the majority of the functionality needed in the data container is implemented by pandas.

    One last question. Is there any way that some sort of dynamic loading from file could be implemented with pandas objects? This is relavent if my data is N gigabytes big and I can’t load the whole thing into memory (ours is a ~35 TB per day! Okay, we’re not making plots out of all of it at once, but the data we do run over is at least a terabyte most of the time). TTree is really fast and efficient because it can load only what it wants right that second; for example, if I wanted to plot “x” in the above example, the Draw command doesn’t read in y and z, and it also only reads one x at a time, so it seeks in the file to the next place it should read each time it goes to the next event. Do you think this sort of thing could be added with some reasonable amount of difficulty to pandas structures? Or does everything have to be in memory before the DataFrame can exist? I guess my question can be generalized to “Can panda gracefully handle huge datasets?”

    [Reply]

    Wes McKinney Reply:

    I think unfortunately the answer to that (the big data question) at the moment is “no, it doesn’t handle big data gracefully”. However, you could imagine objects with a similar interface / API which point to huge datasets stored on disk. It’s mainly that I haven’t had to work with such huge datasets =) Also some work is being done in Python to improve its ability to easily work with big data. NumPy has a memory-map interface to process large datasets on disk, but what’s really needed as so-called “deferred arrays”– so I could select a column of a DataFrame backed by a large memory-mapped file but that doesn’t actually pull in data until a function I pass it to (e.g. a plotting function) needs it. Peter Wang (see his talk on Metagraph from the recent SciPy 2011 conference) and some others are working on this.

    Matt Hollingsworth Reply:

    Sounds like this is something that could be added later then, probably best to focus on one thing at a time. I think I’ll start looking at implementing the core functionality first and then worry about the scalability of it later–it seems to me that the pandas structures should work quite well for it.

    Thanks again for all the info!

  • http://twitter.com/MySchizoBuddy MySchizo Buddy

    people who use phrases like this “most powerful data analysis stack to ever exist” shouldn’t be taken seriously

    [Reply]

    Wes McKinney Reply:

    So you’re saying it’s wrong to strive to improve the status quo and thus I shouldn’t be taken seriously (your initial comment was “people who use phrases like this “most powerful data analysis stack to ever exist” shouldn’t be taken seriously”) ? I think if you look at what I’ve done (hacked, shipped, written) in the 18 months since writing this blog post you’ll see that I’ve more than delivered on my goal to make Python a compelling choice for data analysis.

    [Reply]

    sfermigier Reply:

    Don’t bother with this bozo, Wes. What you’ve already achieved over the last 18 months is just fantastic. Keep up the good work !

    [Reply]