NYCPython 1/10/2012: A look inside pandas design and development

I had the privilege of speaking last night at the NYCPython meetup group. I’ve given tons of “use pandas!” talks so I thought I would take a slightly different angle and talk about some of the design and implementation work that I’ve done for getting good performance in critical data manipulations. I’ll turn some of this material into some blog articles in the near future.

Wes McKinney: pandas design and development from Adam Klein on Vimeo.

Here’s some more video footable shot by my awesome friend Emily Paup with her HDSLR.

I did a little interactive demo (using the ever-amazing IPython HTML Notebook) on Ashley Williams’s Food Nutrient JSON Database:

Link to PDF output of Demo

Link to IPython Notebook file

If you want to run the code in the IPython notebook, you’ll have to download the food database file above.

The audience and I learned from this demo that if you’re after Tryptophan, “Sea lion, Steller, meat with fat (Alaska Native)” is where to get it (in the highest density).

  • Christopher Lee-Messer

    More great stuff. Thanks Wes.
    I noticed I needed to run from git development (0.7dev) and install bottleneck separately in order to follow along.

    [Reply]

    Wes McKinney Reply:

    I fixed the bottleneck issue (it’s now a soft dependency), only added that in yesterday. I hope to have the 0.7.0 final release out within a week.

    [Reply]

  • Anonymous

    Can you pretty please link to the .ipnb file? I found them tremendously useful! Thanks!

    [Reply]

    Wes McKinney Reply:

    Posted!

    [Reply]

  • vgoklani

    can you please publish the iPython file

    [Reply]

  • Charles Harris

    And now for the important question ;) How do you like the Kinesis Advantage keyboard?

    [Reply]

    Wes McKinney Reply:

    Hard to imagine using Emacs without it. I am a big fan

    [Reply]

  • Mutatedmonkeygenes

    Hi, Is there a simple way of defining column headers from nested JSON structures? For example:

    ‘description’: ‘Cheese, caraway’,
    ‘group’: ‘Dairy and Egg Products’,
    ‘id’: 1008,
    ‘manufacturer’: ”,
    ‘nutrients’: [{'description': 'Protein',
    'group': 'Composition',
    'units': 'g',
    'value': 25.18},
    {'description': 'Total lipid (fat)',
    'group': 'Composition',
    'units': 'g',
    'value': 29.2},

    how would I include the 'description' key from 'nutrients' in the id_keys definition:
    id_keys = ['description', 'group', 'id', 'manufacturer'] => id_keys = ['description', 'group', 'id', 'manufacturer', 'nutrients.[0].description’].

    Do I have to first build a separate inner-DataFrame, and then do a join?

    [Reply]

    Wes McKinney Reply:

    Yes, for now. I would like to build out the capability to convert a JSON structure into a DataFrame or series of DataFrames

    [Reply]

  • http://www.floss4science.com/video-pandas-data-analysis-package-design-and-development/ Video: Pandas data analysis package, design and development

    [...] NYCPython 1/10/2012: A look inside pandas design and development | Quant Pythonista. [...]

  • elypma

    Hi Wes,

    Many thanks for all your efforts. I just discovered pandas so I have a lot to learn. First question:

    Within our company we use a column based datafile format where the first row contains the variable names, the second row contains the units of the variables (as [m] for meters) and the third row down contains the data (separated by spaces).
    Before pandas I used a simple python class which gives me access to the variable names, their units and their data. And now the question:
    What is the most logical way in pandas to include the units in the data structure? For now I combine each variable name with its units and use that as a Series name but this lacks elegance.

    [Reply]

    Wes McKinney Reply:

    This is a very common request. I would like to have a way to include “column metadata” (like units or other variable descriptions) in DataFrame. Unfortunately it’s not very relevant to most of the problems I work on (why it hasn’t gotten implemented by now), so until it is I’m going to have to wait for someone who needs that to get more involved with the project’s development.

    [Reply]

    elypma Reply:

    Hm, I get your point.
    I can see what I can do, but promises are difficult.
    Do you already have some requirements, implementation ideas etc?
    For a start, I think not only column metadata might be an idea, but also table & panel metadata.

    [Reply]

    kdebrab Reply:

    To me, the MultiIndex seems to be very appropriate for handling column metadata.

    An example:

    import pandas as pd
    from numpy.random import randn
    dates = pd.date_range(’1/1/2012′, periods=5, name=’date’)
    parameters_with_units = pd.MultiIndex.from_tuples([('distance', '[m]‘, ‘How far it is’), (‘speed’, ‘[m/s]‘, ‘How fast it is’)], names=['parameter','unit','description'])
    df = pd.DataFrame(randn(5,2), index=dates, columns=parameters_with_units)
    df

    returns:

    parameter distance speed
    unit [m] [m/s]
    description How far it is How fast it is
    date
    2012-01-01 -0.135680 1.467479
    2012-01-02 -1.920341 -0.257415
    2012-01-03 -0.489439 -0.499131
    2012-01-04 -0.473644 -0.137563
    2012-01-05 -0.852378 -1.360605

    One can include any custom defined object in the MultiIndex and thus include whatever column metadata one needs.

    So, maybe development could be limited to adding a few lines of extra documentation…

    [Reply]

    elypma Reply:

    It’s been a while since I looked at this. I tried the MultiIndex (as above) and at first it seems to work fine, but only until I try to filter the resultting table like:

    df[df["speed"]>0]

    Any suggestions?

    elypma Reply:

    When creating a DataFrame with MultiIndex columns it seems not possible to return a single column with a MultiIndex. Instead, an object with an Index is returned:

    import pandas
    dates = np.asarray(pandas.date_range(’1/1/2000′, periods=8))
    _metaInfo = pd.MultiIndex.from_tuples([('AA', '[m]‘), (‘BB’, ‘[m]‘), (‘CC’, ‘[s]‘), (‘DD’, ‘[s]‘)], names=['parameter','unit'])

    df = pandas.DataFrame(randn(8, 4), index=dates, columns=_metaInfo)
    df.get(‘AA’).columns

    Index([[m]], dtype=object)

    where the ‘parameter’ info is missing.

    This problem might also screw up operations like:

    df[df["AA"]>0]

    Or did I miss something?

    [Reply]