Filtering out duplicate pandas.DataFrame rows

Sean Taylor recently alerted me to the fact that there wasn’t an easy way to filter out duplicate rows in a pandas DataFrame. R has the duplicated function which serves this purpose quite nicely. The R method’s implementation is kind of kludgy in my opinion (from “The data frame method works by pasting together a character representation of the rows”), but in any case I set about writing a Python version from first principles.

So a drop_duplicates method should be able to either consider a subset of the columns or all of the columns for determining which are “duplicates”. It occurred to me that a reasonably fast and efficient way to do this was to use GroupBy. You have to know a bit about the internals, but it’s fairly straightforward otherwise:

grouped = df.groupby(cols_to_consider)
index = [gp_keys[0] for gp_keys in grouped.groups.values()]
unique_df = df.reindex(index)

For a faster and more direct implementation in Cython, I decided to use a list of tuples and a dict to keep track of whether a row has been “seen” or not:

def duplicated(list values, take_last=False):
    cdef:
        Py_ssize_t i, n
        dict seen = {}
        object row

    n = len(values)
    cdef ndarray[uint8_t] result = np.zeros(n, dtype=np.uint8)

    if take_last:
        for i from n > i >= 0:
            row = values[i]
            if row in seen:
                result[i] = 1
            else:
                seen[row] = None
                result[i] = 0
    else:
        for i from 0 <= i < n:
            row = values[i]
            if row in seen:
                result[i] = 1
            else:
                seen[row] = None
                result[i] = 0

    return result.view(np.bool_)

Aside: the uint8 and casting to np.bool_ thing is there because the buffer interface doesn’t work quite right with boolean arrays in Python 2.5. So as soon as I drop Python 2.5 compatibility (probably around the end of the year), I’ll go through and fix that stuff up. As a second aside, using a dict with dummy keys was coming out a bit faster than using a set in Cython, for reasons unknown to me.

So, on the Python side, the new DataFrame function just takes the boolean vector returned by that Cython function and uses it to select out the rows:

In [21]: df
Out[21]:
    A  B  C
0   a  c  0
1   b  c  1
2   c  b  2
3   d  a  0
4   e  c  1
5   a  a  2
6   b  b  0
7   c  b  1
8   d  b  2
9   e  b  0
10  a  a  1
11  b  a  2
12  c  c  0
13  d  a  1
14  e  c  2

In [22]: df.drop_duplicates('A')
Out[22]:
   A  B  C
0  a  c  0
1  b  c  1
2  c  b  2
3  d  a  0
4  e  c  1

In [23]: df.drop_duplicates(['A', 'B'])
Out[23]:
    A  B  C
0   a  c  0
1   b  c  1
2   c  b  2
3   d  a  0
4   e  c  1
5   a  a  2
6   b  b  0
8   d  b  2
9   e  b  0
11  b  a  2
12  c  c  0

Not passing any particular column or columns is the same as passing all of them.

I was impressed by the performance:

In [26]: from pandas.util.testing import rands

In [27]: k1 = [rands(1) for _ in xrange(100000)]

In [28]: k2 = [rands(1) for _ in xrange(100000)]

In [29]: df = DataFrame({'A' : k1, 'B' : k2, 'C' : np.tile(np.arange(100), 1000)})

In [30]: timeit df.drop_duplicates(['A', 'B'])
100 loops, best of 3: 18.2 ms per loop

In [31]: timeit df.drop_duplicates()
10 loops, best of 3: 49.8 ms per loop

# the groupby method
In [32]: timeit df.reindex([v[0] for v in df.groupby(['A', 'B']).groups.values()])
10 loops, best of 3: 56.5 ms per loop

For reference, it’s roughly twice as fast as the equivalent R code (though I’m still using R 2.11.1…time to upgrade to 2.14!), but perhaps not too shocking as the R function is converting each row into a string first.

N <- 100000

k1 = rep(NA, N)
k2 = rep(NA, N)
for (i in 1:N){
  k1[i] <- paste(sample(letters, 1), collapse="")
  k2[i] <- paste(sample(letters, 1), collapse="")
}
df <- data.frame(a=k1, b=k2, c=rep(1:100, N / 100))
df2 <- data.frame(a=k1, b=k2)

timings <- numeric()
timings2 <- numeric()
for (i in 1:50) {
  gc()
  timings[i] = system.time(deduped <- df[!duplicated(df),])[3]
  gc()
  timings2[i] = system.time(deduped <- df[!duplicated(df[,c("a", "b")]),])[3]
}

> mean(timings)
[1] 0.09312
> mean(timings2)
[1] 0.04312