Someone recently asked me about creating cross-tabulations and contingency tables using pandas. I wrote a bit about this in October after implementing the pivot_table function for DataFrame. So I thought I would give a few more examples and show R code vs. the equivalent pandas code which will be helpful for those already used to the R way of doing things.
When calling help(xtabs) (or help(table)) in R, you get a number of examples for using table, xtabs, and ftable:
One of the really great things about pandas is that the object produced is still a DataFrame (the R object is of a special class)! So we could do anything with it that we do with normal DataFrame objects:
> arr <- rep(letters, 1000)
> system.time(for (i in 1:1000) foo <- factor(arr))
user system elapsed
1.809 0.023 1.821
> arr <- 1:10000
> system.time(for (i in 1:1000) foo <- factor(arr))
user system elapsed
7.782 0.088 8.261
Here’s the equivalent code using the pandas Factor class (git HEAD):
In [35]: import string
In [36]: arr = np.tile(list(string.letters[:26]), 1000).astype(object)
In [37]: Factor(arr)
Out[37]:
Factor:
array([a, b, c, ..., x, y, z], dtype=object)
Levels (26): [a b c d e f g h i j k l m n o p q r s t u v w x y z]
In [38]: timeit Factor(arr)
1000 loops, best of 3: 869 us per loop
In [39]: arr = np.arange(10000, dtype=object)
In [40]: timeit Factor(arr)
1000 loops, best of 3: 1.91 ms per loop
I just edited this up since R 2.11.1 is a lot slower than 2.13.1 (the latest release) so it wasn’t quite fair to perform comparisons off of that. So there are two cases here:
Few unique values (the 26 letters of the alphabet). R does pretty well here, about 2x slower. This is down from 12x slower in R 2.11.1, so I can only assume someone on the R core team noticed this in the last year or so.
All unique values. R is about 4x slower now.
It’s possible I’m using a different algorithm (and so is R between 2.11 and 2.13!). Maybe Python’s much lauded dict implementation is responsible? I’m more curious than anything. Maybe an R guru could let me know (I’m afraid to look at the GPL source code).
For the record, here’s the Python and Cython code involved. First, the Python code:
A couple of weeks ago in my inaugural blog post I wrote about the state of GroupBy in pandas and gave an example application. However, I was dissatisfied with the limited expressiveness (see the end of the article), so I decided to invest some serious time in the groupby functionality in pandas over the last 2 weeks in beefing up what you can do. So this article is a part show-and-tell, part quick tutorial on the new features. Note that I haven’t added a lot of this to the official documentation yet.
GroupBy primer
GroupBy may be one of the least well-understood features in pandas. Let’s consider a DataFrame (basically a labeled 2D array):
Here, the index (row labels) contains dates and the columns are names for each time series. When performing a groupby operation, we may different goals:
Perform an aggregation, like computing the sum of mean of each group. Functionally this means applying a function to each group and putting the aggregated results into a DataFrame
Slicing the DataFrame into chunks (groups) and then doing something with the slices
Performing a transformation, like standardizing each group (computing the zscore)
So there are two tasks: first, grouping the data; second, doing something with the grouped data. As far as grouping, in a totally abstract sense we have to define some kind of mapping that assigns labels (one of the axes) into group buckets. A mapping could be one of many things:
A Python function, to be called on each label
A dict, containing {label : group_name} mappings
An array the same length as the axis containing the group correspondences. This may often be one of the columns in a DataFrame, but need not be. Every other mapping type gets rendered to this form eventually.
So in the above data, let’s say we wanted to group the data by year. Since the row labels are Python datetime objects, we can access the year attribute when calling groupby
In [21]: grouped = df.groupby(lambda x: x.year)
This returns an object of type GroupBy. With this object, you can do a lot of things including: iteration, aggregation, or transformation. So we could iterate like so:
In [20]: for year, group in grouped:
....: print year
....: print group
....:
2000
A B C D
2000-03-31 00:00:00 0.6304 -1.493 0.2082 0.3378
2000-06-30 00:00:00 -0.1041 1.093 0.1227 -0.2362
2000-09-29 00:00:00 -1.822 0.7325 -0.4964 -1.707
2000-12-29 00:00:00 0.3325 -0.2213 0.5465 1.265
2001
A B C D
2001-03-30 00:00:00 0.8422 -0.1265 0.1234 0.7743
2001-06-29 00:00:00 -0.07774 -0.3906 -0.6328 -1.259
2001-09-28 00:00:00 -0.3417 -0.4209 1.724 0.3069
2001-12-31 00:00:00 0.493 -0.2815 1.314 1.004
2002
A B C D
2002-03-29 00:00:00 1.003 -0.4807 0.7232 1.079
2002-06-28 00:00:00 0.1415 -2.275 0.6178 0.8817
2002-09-30 00:00:00 -0.6855 0.3921 -1.468 1.522
2002-12-31 00:00:00 -1.635 0.3083 0.1209 -0.5606
2003
A B C D
2003-03-31 00:00:00 0.5707 -0.7125 -0.07677 0.5154
2003-06-30 00:00:00 0.3646 -0.8574 -0.562 0.2176
2003-09-30 00:00:00 1.024 -0.8113 0.6771 -1.261
Or aggregate like so:
In [21]: grouped.aggregate(np.sum)
Out[21]:
A B C D
2000 -0.9629 0.1108 0.3811 -0.3405
2001 0.9159 -1.219 2.529 0.8268
2002 -1.176 -2.055 -0.006253 2.921
2003 1.959 -2.381 0.03838 -0.528
Or transform groups like so (here I standardize each group):
Some aggregations are so common that they’re provided as instance methods on the GroupBy object:
In [24]: grouped.sum()
Out[24]:
A B C D
2000 -0.9629 0.1108 0.3811 -0.3405
2001 0.9159 -1.219 2.529 0.8268
2002 -1.176 -2.055 -0.006253 2.921
2003 1.959 -2.381 0.03838 -0.528
In [25]: grouped.mean()
Out[25]:
A B C D
2000 -0.2407 0.02769 0.09526 -0.08512
2001 0.229 -0.3049 0.6324 0.2067
2002 -0.2939 -0.5138 -0.001563 0.7303
2003 0.6531 -0.7938 0.01279 -0.176
For those with SQL experience, a DataFrame can be treated like an SQL table. As such, grouping “by columns” is quite natural:
In [27]: df2
Out[27]:
A B C D
0 foo one -0.7883 0.7743
1 bar one -0.5866 0.06009
2 foo two 0.9312 1.2
3 bar three -0.6417 0.3444
4 foo two -0.8841 -0.08126
5 bar two 1.194 -0.7933
6 foo one -1.624 -0.1403
7 foo three 0.5046 0.5833
In [28]: for name, group in df2.groupby('A'):
....: print name
....: print group
....:
bar
A B C D
1 bar one -0.5866 0.06009
3 bar three -0.6417 0.3444
5 bar two 1.194 -0.7933
foo
A B C D
0 foo one -0.7883 0.7743
2 foo two 0.9312 1.2
4 foo two -0.8841 -0.08126
6 foo one -1.624 -0.1403
7 foo three 0.5046 0.5833
Note the presence of the “nuisance” B column when we group by A. I want this to work:
In [30]: df2.groupby('A').mean()
Out[30]:
C D
bar -0.01137 -0.1296
foo -0.3722 0.4671
More on dealing with this below.
New: Column selection
In my last post I mentioned wanting to be able to do a groupby-operation on a single column of a DataFrame using another column or set of columns as the groupers. So in the above example, what I want is:
grouped = df['C'].groupby(df['A'])
# or
grouped = df['C'].groupby([df['A'], df['B']])
This is too verbose / awkward for my tastes. So now you can do:
In [111]: df2.groupby('A')['C'].mean()
Out[111]:
bar -0.447919901908
foo -0.158016186685
Behind the scenes, this simply passes the C column to a Series GroupBy object along with the already-computed grouping(s).
New: Group by multiple columns / key functions
The ability to group by multiple criteria (just like SQL) has been one of my most desired GroupBy features for a long time. It’s also very hard to implement efficiently. So I bit the bullet and carefully crafted Cython code to speedily do aggregations on data grouped by multiple columns. So now you can do:
In [113]: df2.groupby(['A', 'B']).mean()
Out[113]:
A B C D
0 foo one 0.3558 -0.4818
1 foo two -0.2758 -1.271
2 foo three -0.95 1.709
3 bar one 1.916 -0.5093
4 bar two -2.315 0.1177
5 bar three -0.9447 0.2519
Iteration with multiple groups flattens the key pairs:
In [114]: for k1, k2, group in df2.groupby(['A', 'B']):
.....: print k1, k2
.....: print group
.....:
foo one
A B C D
0 foo one 0.8248 0.3856
6 foo one -0.1133 -1.349
foo two
A B C D
2 foo two -1.175 -1.036
4 foo two 0.6239 -1.506
foo three
A B C D
7 foo three -0.95 1.709
bar one
A B C D
1 bar one 1.916 -0.5093
bar two
A B C D
5 bar two -2.315 0.1177
bar three
A B C D
3 bar three -0.9447 0.2519
As a less trivial example, I loaded some data from the Bureau of Labor Statistics, which I’ll write more about another time:
In [126]: df = read_table('ce.data.44.FinActOtherCurr.txt',
sep='\s+', index_col=None)
In [127]: df
Out[127]:
<class 'pandas.core.frame.DataFrame'>
Index: 32806 entries, 0 to 32805
Data columns:
series_id 32806 non-null values
year 32806 non-null values
period 32806 non-null values
value 32806 non-null values
dtypes: int64(1), float64(1), object(2)
This data set has 118 unique series_id’s and spans 22 years (the period column gives the month of observation), so we’re talking about 2596 unique groups. Let’s say we wanted to compute the mean value for each year, it’s pretty simple with the improved groupby:
In [129]: df.groupby(['series_id', 'year']).mean()
Out[129]:
<class 'pandas.core.frame.DataFrame'>
Index: 2596 entries, 0 to 2595
Data columns:
series_id 2596 non-null values
value 2596 non-null values
year 2596 non-null values
dtypes: int64(1), float64(1), object(1)
I was also pretty impressed with how fast this operation is (the power of Cython!):
In [130]: timeit df.groupby(['series_id', 'year']).mean()
10 loops, best of 3: 22.9 ms per loop
I haven’t compared this with an SQL engine (SQLite, MySQL, Postgres, …) but I hope that it’s competitive. If you call a non-Cythonized function, the speed goes down considerably:
In [132]: timeit df.groupby(['series_id', 'year'])['value'].std()
1 loops, best of 3: 232 ms per loop
Edit: It’s also possible to pass multiple functions instead of column names or to mix and match column names with functions:
It’s pretty common to group by a column and ignore other columns containing non-floating point data. It used to be that you’d get an error, forcing you to first drop the “nuisance” columns, e.g. suppose we only want to group by the A column here:
In [133]: df2
Out[133]:
A B C D
0 foo one 0.8248 0.3856
1 bar one 1.916 -0.5093
2 foo two -1.175 -1.036
3 bar three -0.9447 0.2519
4 foo two 0.6239 -1.506
5 bar two -2.315 0.1177
6 foo one -0.1133 -1.349
7 foo three -0.95 1.709
If you try to compute the mean of the B column, things won’t quite work out:
In [134]: grouped = df2.groupby('A')['B'].mean()
TypeError: unsupported operand type(s) for /: 'str' and 'float'
To cope with this, so-called “nuisance” columns are now automatically dropped. While I don’t like discarding information, I think “practicality beats purity” in this case:
In [135]: df2.groupby('A').mean()
Out[135]:
C D
bar -0.4479 -0.0466
foo -0.158 -0.3593
New: “auto-dispatching” to DataFrame / Series
Since very few functions in DataFrame and Series/TimeSeries are implemented on the GroupBy classes, you may wish to simply invoke a Series instance method on each group, let’s say:
But since we’re got Python at our disposal, it’s relatively straightforward to override __getattribute__ to wrap and dispatch calls to Series/DataFrame instance methods so you can do this:
This has always been around but not well-known by many people. Suppose you want to apply multiple functions to a group and collect all the results. You can pass a dict of functions:
I have a mind to expand on this idea for DataFrame objects to get us closer to the full flexibility of SQL with Pythonic syntax. If anyone has ideas or suggestions please let me know.
Conclusions
GroupBy is certainly not done. If you use these tools and find them useful, please let me know. If you can think of ways to make them better, that would be nice information too. I think it would be great to implement a full SQL engine on top of pandas (similar to the SAS “proc sql”), and this new GroupBy functionality gets us closer to that goal.
So, this post is a bit of a brain dump on rich data structures in Python and what needs to happen in the very near future. I care about them for statistical computing (I want to build a statistical computing environment that trounces R) and financial data analysis (all evidence leads me to believe that Python is the best all-around tool for the finance space). Other people in the scientific Python community want them for numerous other applications: geophysics, neuroscience, etc. It’s really hard to make everyone happy with a single solution. But the current state of affairs has me rather anxious. And I’d like to explain why. For a really quick summary on some of the work I’ve been doing, here’s my SciPy 2010 talk slides:
In the wake of SciPy 2011 I’ve been thinking a lot about the way forward from here in terms of building rich Pythonic data structures for statistics and many, many other fields. By rich I mean: is not just a NumPy ndarray, contains metadata (however we define metadata) and has operations which depend on the metadata, and in general does far more than structured arrays currently do for you. This touches on a great many topics and features that people want (partial list):
Manipulating heterogeneously-typed (what I loosely call “mixed type”) data
Size mutability: can add easily add “columns” or otherwise N-1-dimensional hyperslices without necessarily copying data
Metadata about what each axis represents: axis=3 is less meaningful than axis=’temperature’
Metadata about axis labels (ticks)
Label / tick-based data alignment / reshaping, either automatic or explicit
Label / tick-based (fancy) indexing, both setting and getting
Hierarchical columns
Robust handling of missing / NA data (a là R)
Tons of operations needing heterogeneous data and metadata: group by, filtering, sorting, selection / querying, reindexing (reshaping to conform to a new set of labels), axis selection based on names, etc. etc.
The list goes on and on. I could write a 50-page manuscript on the exact specification of what exact functionality is desired on each of the above bullet points. What I do know is that after using a rich data structure like the ones in pandas, it’s very, very hard to go back to using vanilla ndarrays. To wit, R users coming to Python they have a similar experience: the lack of data.frame and all the functions which operate on data.frames is a gigantic loss of functionality. When I work with MATLAB and R users (especially in the finance industry) and get them up and running with pandas, I get a lot of “where was this tool all my life?”. It’s just that much better. In fact, users can get by with only a very rudimentary understanding of NumPy if the data structures are good enough; I think this is highly desirable. Even for purely interactive data analysis (forget operations which actually utilize the metadata), isn’t this much better:
Of course if data were a structured ndarray you would be completely up a creek (most NumPy functions do not play well with structured arrays). But that’s another topic.
But anyway, to the point of why I’m writing: we have a ton of really talented people with real problems to solve, and lots of great ideas about how to solve them. Last year at SciPy 2010 in Austin we had a Birds of a Feather session led by the venerable Fernando Pérez and myself to talk about the datarray project, pandas, tabular, larry, and other various ideas about data structures that people have kicked around. The topic is important enough that Enthought hosted a gathering this past May in Austin, the DataArray Summit, to talk about these issues and figure out where to go from here. It was a great meeting and we hashed out in gory detail many of the problem areas that we’d like to solve with a richer data structure. So that’s a bit of the backstory.
But even given all these great discussions and work being done, we have a really fundamental problem:
Fragmentation is killing us
There, I said it =) All us NumPy ninjas can navigated the fragmented, incohesive collection of data structures and tools, but it’s confusing as hell for new and existing users. NumPy alone is not good enough for a lot of people (statisticians, financial data analysts, etc.), but they’re left with a confusing choice between pandas, larry, datarray, or something else. Also, these tools have largely not been integrated with any other tools because of the community’s collective commitment anxiety. We talk, hem and haw, and wring our hands. And still no integration. I don’t mean to complain: I just deeply care about making the scientific Python stack the most powerful data analysis stack to ever exist. Seriously. And I can’t do it alone. And I don’t want to make unilateral decisions and shove anything down anyone’s throat. We’ve started working on integration of pandas in statsmodels (which is already going to make a really huge difference), but we need to collectively get our proverbial sh*t together. And soon.
My work on pandas lately and why it matters
On my end, in the 2 months since the DataArray summit, I decided to put my PhD on hold and focus more seriously on Python development, statistics / statsmodels, pandas, and other projects I care deeply about. So I’ve easily invested more time in pandas in the last 2 months than in the previous 2 years. This included heavily redesigned internals (which I’ve extremely pleased with) and tackling various other thorny internal issues which had long been an annoyance for me and an external source of criticism by end users (like, “why are there 2 2-dimensional data structures, DataFrame and DataMatrix?” The answer is that the internal implementation was different, with different performance characteristics depending on use). I’m also trying to clear out my extensive backlog of wishlist features (lots of blog articles to come on these). What’s happened is that I started prototyping a new data structure which I’m calling NDFrame which I think is really going to be a big deal. Basically, the NDFrame is a by-product of the redesigned internal data structure (currently called BlockManager) backing DataFrame and friends. I can and should write an entire post about BlockManager and exactly what it needs to accomplish, but the short story is that BlockManager:
Stores an arbitrary collection of homogeneously-typed N-dimensional NumPy ndarray objects (I call these Block objects or simple blocks)
Has axis labels and is capable of reshaping the “blocks” to a new set of labels
Is capable of consolidating blocks (gluing them together) having the same dtype
Knows how to appropriately introduce missing data, and upcast dtypes (int, bool) when the missing data marker (NaN, currently. Crossing my fingers for a good NA implementation!) needs to be introduced
Can deal with single “item” type casts (ha, try to do that with structured arrays!)
Can accept new blocks without copying data
Maybe this is too meta (heh) for some people. As illustration, here’s literally what’s going on inside DataFrame now after all my latest hacking:
In [40]: df
Out[40]:
b a c e f
0 1.213 1.507 True 1 a
1 -0.6765 0.06237 True 1 b
2 0.3126 -0.2575 False 1 c
3 0.1505 0.2242 True 1 d
4 -0.7952 0.2909 True 1 e
5 1.341 -0.9712 False 2 f
6 0.01121 1.654 True 2 g
7 -0.173 -1.385 False 2 h
8 0.1637 -0.898 False 2 i
9 0.5979 -1.035 False 2 j
In [41]: df._data
Out[41]:
BlockManager
Items: [b a c e f]
Axis 1: [0 1 2 3 4 5 6 7 8 9]
FloatBlock: [b a], 2 x 10, dtype float64
BoolBlock: [c], 1 x 10, dtype bool
IntBlock: [e], 1 x 10, dtype int64
ObjectBlock: [f], 1 x 10, dtype object
The user would of course never be intended to look at this, it’s purely internal. But for example things like this are OK to do:
In [42]: df['e'] = df['e'].astype(float)
In [43]: df._data
Out[43]:
BlockManager
Items: [b a c e f]
Axis 1: [0 1 2 3 4 5 6 7 8 9]
FloatBlock: [b a], 2 x 10, dtype float64
BoolBlock: [c], 1 x 10, dtype bool
ObjectBlock: [f], 1 x 10, dtype object
FloatBlock: [e], 1 x 10, dtype float64
Since in that case there are now multiple float blocks, they can be explicitly consolidated, but if you use DataFrame many operations will cause it to happen automatically (which is highly desirable, especially when you have only one dtype, for doing row-oriented operations):
In [44]: df._data.consolidate()
Out[44]:
BlockManager
Items: [b a c e f]
Axis 1: [0 1 2 3 4 5 6 7 8 9]
BoolBlock: [c], 1 x 10, dtype bool
FloatBlock: [b a e], 3 x 10, dtype float64
ObjectBlock: [f], 1 x 10, dtype object
Now, this is a pretty abstract business. Here’s my point: when I started thinking about NDFrame, a user-facing n-dimensional data structure backed by BlockManager, I realized that what I am going to build is a nearly strict superset of the functionality provided by every other rich data structure I know of. I made a picture of the feature overlap, and note that arrows loosely mean: “can be used to implement”:
For example, I need to write generic fancy indexing on the NDFrame, a task largely tackled by DataArray. So rather than reinvent the wheel, I should just co-opt that code (love that BSD license), but then I’ve effectively created a fork (nooooo!). I think having all these different libraries (and leaving users the confusing choice between them) is kind of nuts. Ideally DataArray (homogeneous) should just be a part of pandas (and I’m not opposed to changing the name, though it has stronger branding and far more users than datarray or larry). But once we’ve gone down that route, larry is just a DataArray (homogeneous) with automatic data alignment. We’re all doing the exact same kinds of things. So why not have one library?
Remember: the real world is heterogeneously-typed
Some other casual (and potentially controversial) observations
Here’s just a dumping ground of various thoughts I have on this and related topics:
The NumPy type hierarchy (int8, int16, int32, int64, uint8, uint16, …) isn’t that important to me. R and MATLAB don’t really have a type hierarchy and it doesn’t seem to pose a problem. So I haven’t gone down the road of creating Block objects mapping onto the NumPy type hierarchy. If someone wants to do it without complicating the end-user pandas experience, be my guest
Tying yourself to the ndarray is too restrictive. This is a major problem with DataArray and larry; they don’t do mixed-type data. So if you want to build something that competes with R you have failed before you have even begun by using a homogeneous-only data structure. Remember, DataFrame can be homogeneous whenever it wants to and getting the underlying ndarray is just a few keystrokes.
Structured arrays are probably not the answer. Size mutability (ability to add columns) and the ability to change dtypes are actually a big deal. As is the ability to do row-oriented computations, broadcasting, and all the things that structured arrays are (currently) unable to do. They are a darned convenient way of serializing / deserializing data, though. But memory layout and all that is far, far less important to me than usability / user interface / experience
I’m a big fan of Interactive Brokers. Cheap trading, great APIs, lightning fast execution with no frills, designed for the professional trader. Perfect for people like me. Here’s the background for this article: I started writing short-dated out-of-the-money call options on my stock and ETF positions last year as a way to hedge my downside risk (while limiting upside) and maybe even make a little extra income on the side in this volatile and frequently sideways market. Recently I was looking over my brokerage statements and trying to get a grasp on my realized and mark-to-market PnL (profit-and-loss) this year and last since we’re talking about equity positions plus options that expired in each month since the beginning of the year. After poring over the various forms of reports I can get from IB I concluded the quickest-and-dirtiest (not to mention the most fun) way was to write a little Python program to do it for me.
I’ll spare you the munging bit of getting the data out of the HTML activity statement using BeautifulSoup and skip to where I’ve scraped the relevant data into a DataFrame. Here’s some faux data for illustration purposes:
In [65]: data
Out[65]:
Symbol Underlying Kind MTM Realized
0 AAPL AAPL Stock 120.5 108.5
1 CTXS CTXS Stock 101.5 91.39
2 NFLX NFLX Stock -193.5 -174.1
3 SPY SPY Stock 67.77 60.99
4 USO USO Stock -3.355 -3.019
5 VMW VMW Stock 39.13 35.21
6 AAPL 18JUN11 350.0 C AAPL Option -52.76 -47.48
7 AAPL 18JUN11 355.0 C AAPL Option 211.2 190.1
8 CTXS 18JUN11 87.5 C CTXS Option 65.17 58.65
9 CTXS 18JUN11 90.0 C CTXS Option -33.16 -29.84
10 NFLX 18JUN11 270.0 C NFLX Option 80.38 72.34
11 NFLX 18JUN11 280.0 C NFLX Option 76.82 69.14
12 SPY 16APR11 132.0 C SPY Option 54.25 48.83
13 USO 16APR11 44.0 C USO Option -122.4 -110.1
14 USO 16APR11 45.0 C USO Option 73.06 65.76
15 USO 18JUN11 40.0 C USO Option 50.96 45.86
16 VMW 18JUN11 100.0 C VMW Option 25.59 23.03
17 VMW 18JUN11 105.0 C VMW Option 91.11 82
So, I want to aggregate the Realized and MTM columns by Underlying and by whether it was an equity or option instrument. Were this data in a SQL table, this operation could be expressed extremely concisely:
But since I like doing interactive data analysis in Python (and loath writing lots of SQL queries), I’d rather do this with pandas and also produce some easier-to-read console output. Here’s what I ended up with:
mtm ={}
realized ={} for kind, group in data.groupby('Kind'):
mtm_by_underlying = group['MTM'].groupby(group['Underlying'])
realized_by_underlying = group['Realized'].groupby(group['Underlying'])
mtm[kind]= mtm_by_underlying.sum()
realized[kind]= realized_by_underlying.sum()
# Convert to DataFrame and fill NA values with 0
mtm = DataFrame(mtm).fillna(0)
realized = DataFrame(realized).fillna(0)
In [79]: mtm.sum(1)
Out[79]:
AAPL 278.941528842
CTXS 133.551244568
NFLX -36.2669453776
SPY 122.022253941
USO -1.68704136356
VMW 155.829013419
It occurred to me after writing this code that, with some changes to the GroupBy functionality in pandas, this could be made *much easier*. Ideally I’d like to write:
I'm a NYC-based Python hacker, statistician, entrepreneur, and technology consultant. I love building powerful, intuitive research- and production-worthy tools for data-centric applications, especially in statistics and quantitative finance. I'm also interested in high performance and distributed computing, databases, GPU computing, and data visualization.
I am working on a new project to be unveiled in 2013 sometime. Stay tuned.