Monthly Archives: November 2011

Formatting DataFrame as HTML

Recently, a pandas user, Wouter Overmeire, contributed a to_html function to DataFrame in a pull request. It’s very handy:

In [31]: df
Out[31]:
                        ls    lsc   pop   ccode
year cname     agefrom                        
1950 Australia 15       64.3  15.4  558   AUS  
               40       38.9  20.1  555   AUS  
               65       24.7  10.7  273   AUS  
1955 Australia 25       48.4  26.2  705   AUS  
               50       34    16.6  478   AUS  
               75       26.3  11.7  240   AUS  
1960 Australia 35       47.9  24.7  755   AUS  
               60       29.6  13.5  389   AUS  
1965 Australia 20       57.7  31.1  832   AUS  
               45       41.3  20.3  665   AUS  
               70       30.2  12.9  273   AUS  
1970 Australia 30       57.7  31.1  789   AUS  
               55       42.1  19.4  619   AUS  
1975 Australia 15       91.1  23.1  1225  AUS  
               40       56.3  30.9  744   AUS  
               65       48.8  22.5  440   AUS  
1980 Australia 25       74    44.5  1205  AUS  
               50       60.2  32.8  774   AUS  
               75       62.5  30.8  500   AUS  
1985 Australia 35       72.7  42.8  1180  AUS  

In [32]: df.to_html()

which outputs:


ls lsc pop ccode
year cname agefrom
1950 Australia 15 64.3 15.4 558 AUS
1955 Australia 50 34 16.6 478 AUS
1965 Australia 20 57.7 31.1 832 AUS
1970 Australia 55 42.1 19.4 619 AUS
1980 Australia 25 74 44.5 1205 AUS
1985 Australia 60 59.1 30.8 685 AUS
1995 Australia 30 72 45.7 1430 AUS
2000 Australia 65 60 31.5 663 AUS
2010 Australia 35 59.9 39.1 1506 AUS
1950 Austria 70 2.8 1.9 218 AUT
1960 Austria 40 4.5 3 324 AUT
1965 Austria 75 13 8.6 322 AUT
1975 Austria 45 42.1 28 455 AUT
1985 Austria 15 22.4 4.2 626 AUT
1990 Austria 50 47.3 23.4 448 AUT
2000 Austria 20 73.2 63.2 465 AUT
2005 Austria 55 55.1 36.2 486 AUT

It also renders a tidy representation with hierarchical columns:

In [14]: table
Out[14]:
  sex     Female          Male          
  smoker  No      Yes     No      Yes  
day                                    
Fri       0.1653  0.2091  0.138   0.1447
Sat       0.148   0.1638  0.1621  0.1391
Sun       0.1657  0.2371  0.1583  0.174
Thur      0.156   0.1631  0.1657  0.1644


sex
smoker
Female
No
Female
Yes
Male
No
Male
Yes
day
Fri 0.1653 0.2091 0.138 0.1447
Sat 0.148 0.1638 0.1621 0.1391
Sun 0.1657 0.2371 0.1583 0.174
Thur 0.156 0.1631 0.1657 0.1644

Talk at Rice Stats on structured data analysis, pandas, 11/21/2011

I had the privilege of speaking today at Rice at Hadley Wickham’s (of R fame) invitation. I talked broadly about the problems faced in structured data manipulation problems and how I’ve worked to address them in pandas. I enjoyed speaking with Hadley (whom I’d never met before in person) about these problems as we’ve come up with nice solutions to many of them independently, and in different languages (R and Python, respectively). A lot of folks assume that parts of pandas are inspired by his plyr and reshape2 packages, but I’ve only started playing around with them recently. I think we all have a lot we can learn from each other on the road toward building even better high performance, easy-to-use, and expressive data manipulation tools.

pandas talk at PyHPC 2011 workshop in SC11, thoughts on hash tables

Here are the slides from my talk at PyHPC2011. Not really my usual talk for data crowds– a little more nuts and bolts about some of the indexing and GroupBy implementation details. Some people might be interested in the illustrative data alignment benchmarks which show the relative weakness of Python’s dict implementation (in both speed and memory usage) for lookups and alignment. After these benchmarks I think it’s pretty much inevitable that I’m going to end up writing a custom hash table implementation in C for the data alignment on primitive types. Now, if I wanted a threadsafe hash table that I could use OpenMP on, that would be a serious undertaking. Anyone want to help?

The basic problem is that Python dicts are not designed for my use case– namely very large dicts that I use to perform data alignment operations.

PyHPC2011

Filtering out duplicate pandas.DataFrame rows

Sean Taylor recently alerted me to the fact that there wasn’t an easy way to filter out duplicate rows in a pandas DataFrame. R has the duplicated function which serves this purpose quite nicely. The R method’s implementation is kind of kludgy in my opinion (from “The data frame method works by pasting together a character representation of the rows”), but in any case I set about writing a Python version from first principles.

So a drop_duplicates method should be able to either consider a subset of the columns or all of the columns for determining which are “duplicates”. It occurred to me that a reasonably fast and efficient way to do this was to use GroupBy. You have to know a bit about the internals, but it’s fairly straightforward otherwise:

grouped = df.groupby(cols_to_consider)
index = [gp_keys[0] for gp_keys in grouped.groups.values()]
unique_df = df.reindex(index)

For a faster and more direct implementation in Cython, I decided to use a list of tuples and a dict to keep track of whether a row has been “seen” or not:

def duplicated(list values, take_last=False):
    cdef:
        Py_ssize_t i, n
        dict seen = {}
        object row

    n = len(values)
    cdef ndarray[uint8_t] result = np.zeros(n, dtype=np.uint8)

    if take_last:
        for i from n > i >= 0:
            row = values[i]
            if row in seen:
                result[i] = 1
            else:
                seen[row] = None
                result[i] = 0
    else:
        for i from 0 <= i < n:
            row = values[i]
            if row in seen:
                result[i] = 1
            else:
                seen[row] = None
                result[i] = 0

    return result.view(np.bool_)

Aside: the uint8 and casting to np.bool_ thing is there because the buffer interface doesn’t work quite right with boolean arrays in Python 2.5. So as soon as I drop Python 2.5 compatibility (probably around the end of the year), I’ll go through and fix that stuff up. As a second aside, using a dict with dummy keys was coming out a bit faster than using a set in Cython, for reasons unknown to me.

So, on the Python side, the new DataFrame function just takes the boolean vector returned by that Cython function and uses it to select out the rows:

In [21]: df
Out[21]:
    A  B  C
0   a  c  0
1   b  c  1
2   c  b  2
3   d  a  0
4   e  c  1
5   a  a  2
6   b  b  0
7   c  b  1
8   d  b  2
9   e  b  0
10  a  a  1
11  b  a  2
12  c  c  0
13  d  a  1
14  e  c  2

In [22]: df.drop_duplicates('A')
Out[22]:
   A  B  C
0  a  c  0
1  b  c  1
2  c  b  2
3  d  a  0
4  e  c  1

In [23]: df.drop_duplicates(['A', 'B'])
Out[23]:
    A  B  C
0   a  c  0
1   b  c  1
2   c  b  2
3   d  a  0
4   e  c  1
5   a  a  2
6   b  b  0
8   d  b  2
9   e  b  0
11  b  a  2
12  c  c  0

Not passing any particular column or columns is the same as passing all of them.

I was impressed by the performance:

In [26]: from pandas.util.testing import rands

In [27]: k1 = [rands(1) for _ in xrange(100000)]

In [28]: k2 = [rands(1) for _ in xrange(100000)]

In [29]: df = DataFrame({'A' : k1, 'B' : k2, 'C' : np.tile(np.arange(100), 1000)})

In [30]: timeit df.drop_duplicates(['A', 'B'])
100 loops, best of 3: 18.2 ms per loop

In [31]: timeit df.drop_duplicates()
10 loops, best of 3: 49.8 ms per loop

# the groupby method
In [32]: timeit df.reindex([v[0] for v in df.groupby(['A', 'B']).groups.values()])
10 loops, best of 3: 56.5 ms per loop

For reference, it’s roughly twice as fast as the equivalent R code (though I’m still using R 2.11.1…time to upgrade to 2.14!), but perhaps not too shocking as the R function is converting each row into a string first.

N <- 100000

k1 = rep(NA, N)
k2 = rep(NA, N)
for (i in 1:N){
  k1[i] <- paste(sample(letters, 1), collapse="")
  k2[i] <- paste(sample(letters, 1), collapse="")
}
df <- data.frame(a=k1, b=k2, c=rep(1:100, N / 100))
df2 <- data.frame(a=k1, b=k2)

timings <- numeric()
timings2 <- numeric()
for (i in 1:50) {
  gc()
  timings[i] = system.time(deduped <- df[!duplicated(df),])[3]
  gc()
  timings2[i] = system.time(deduped <- df[!duplicated(df[,c("a", "b")]),])[3]
}

> mean(timings)
[1] 0.09312
> mean(timings2)
[1] 0.04312