Adventures in Aggregating Data (Group By)

I’m a big fan of Interactive Brokers. Cheap trading, great APIs, lightning fast execution with no frills, designed for the professional trader. Perfect for people like me. Here’s the background for this article: I started writing short-dated out-of-the-money call options on my stock and ETF positions last year as a way to hedge my downside risk (while limiting upside) and maybe even make a little extra income on the side in this volatile and frequently sideways market. Recently I was looking over my brokerage statements and trying to get a grasp on my realized and mark-to-market PnL (profit-and-loss) this year and last since we’re talking about equity positions plus options that expired in each month since the beginning of the year. After poring over the various forms of reports I can get from IB I concluded the quickest-and-dirtiest (not to mention the most fun) way was to write a little Python program to do it for me.

I’ll spare you the munging bit of getting the data out of the HTML activity statement using BeautifulSoup and skip to where I’ve scraped the relevant data into a DataFrame. Here’s some faux data for illustration purposes:

In [65]: data
Out[65]:
     Symbol                 Underlying   Kind     MTM     Realized
0    AAPL                   AAPL         Stock    120.5   108.5
1    CTXS                   CTXS         Stock    101.5   91.39
2    NFLX                   NFLX         Stock   -193.5  -174.1
3    SPY                    SPY          Stock    67.77   60.99
4    USO                    USO          Stock   -3.355  -3.019
5    VMW                    VMW          Stock    39.13   35.21
6    AAPL 18JUN11 350.0 C   AAPL         Option  -52.76  -47.48
7    AAPL 18JUN11 355.0 C   AAPL         Option   211.2   190.1
8    CTXS 18JUN11 87.5 C    CTXS         Option   65.17   58.65
9    CTXS 18JUN11 90.0 C    CTXS         Option  -33.16  -29.84
10   NFLX 18JUN11 270.0 C   NFLX         Option   80.38   72.34
11   NFLX 18JUN11 280.0 C   NFLX         Option   76.82   69.14
12   SPY 16APR11 132.0 C    SPY          Option   54.25   48.83
13   USO 16APR11 44.0 C     USO          Option  -122.4  -110.1
14   USO 16APR11 45.0 C     USO          Option   73.06   65.76
15   USO 18JUN11 40.0 C     USO          Option   50.96   45.86
16   VMW 18JUN11 100.0 C    VMW          Option   25.59   23.03
17   VMW 18JUN11 105.0 C    VMW          Option   91.11   82

 

So, I want to aggregate the Realized and MTM columns by Underlying and by whether it was an equity or option instrument. Were this data in a SQL table, this operation could be expressed extremely concisely:

SELECT Underlying, Kind,
SUM(Realized) AS Realized, SUM(MTM) AS MTM
FROM MTMReport
GROUP BY Underlying, Kind
ORDER BY Underlying

But since I like doing interactive data analysis in Python (and loath writing lots of SQL queries), I’d rather do this with pandas and also produce some easier-to-read console output. Here’s what I ended up with:

mtm = {}
realized = {}
for kind, group in data.groupby('Kind'):
    mtm_by_underlying = group['MTM'].groupby(group['Underlying'])
    realized_by_underlying = group['Realized'].groupby(group['Underlying'])
    mtm[kind] = mtm_by_underlying.sum()
    realized[kind] = realized_by_underlying.sum()

# Convert to DataFrame and fill NA values with 0
mtm = DataFrame(mtm).fillna(0)
realized = DataFrame(realized).fillna(0)

And it outputs exactly what I’m looking for:

In [77]: mtm
Out[77]:
       Option   Stock
AAPL   158.4    120.5
CTXS   32.01    101.5
NFLX   157.2   -193.5
SPY    54.25    67.77
USO    1.668   -3.355
VMW    116.7    39.13

In [79]: mtm.sum(1)
Out[79]:
AAPL    278.941528842
CTXS    133.551244568
NFLX    -36.2669453776
SPY     122.022253941
USO     -1.68704136356
VMW     155.829013419

It occurred to me after writing this code that, with some changes to the GroupBy functionality in pandas, this could be made *much easier*. Ideally I’d like to write:

grouped = frame.groupby('Underlying', 'Kind')
mtm = grouped['MTM'].sum()
realized = grouped['Realized'].sum()

I’ll get to work on this in the pandas repo.

Here’s the Gist of the full script