I’m not making this up:

> arr <- rep(letters, 1000)

> system.time(for (i in 1:1000) foo <- factor(arr))

user system elapsed

1.809 0.023 1.821

> arr <- 1:10000

> system.time(for (i in 1:1000) foo <- factor(arr))

user system elapsed

7.782 0.088 8.261

> system.time(for (i in 1:1000) foo <- factor(arr))

user system elapsed

1.809 0.023 1.821

> arr <- 1:10000

> system.time(for (i in 1:1000) foo <- factor(arr))

user system elapsed

7.782 0.088 8.261

Here’s the equivalent code using the pandas Factor class (git HEAD):

In [35]: import string

In [36]: arr = np.tile(list(string.letters[:26]), 1000).astype(object)

In [37]: Factor(arr)

Out[37]:

Factor:

array([a, b, c, ..., x, y, z], dtype=object)

Levels (26): [a b c d e f g h i j k l m n o p q r s t u v w x y z]

In [38]: timeit Factor(arr)

1000 loops, best of 3: 869 us per loop

In [39]: arr = np.arange(10000, dtype=object)

In [40]: timeit Factor(arr)

1000 loops, best of 3: 1.91 ms per loop

In [36]: arr = np.tile(list(string.letters[:26]), 1000).astype(object)

In [37]: Factor(arr)

Out[37]:

Factor:

array([a, b, c, ..., x, y, z], dtype=object)

Levels (26): [a b c d e f g h i j k l m n o p q r s t u v w x y z]

In [38]: timeit Factor(arr)

1000 loops, best of 3: 869 us per loop

In [39]: arr = np.arange(10000, dtype=object)

In [40]: timeit Factor(arr)

1000 loops, best of 3: 1.91 ms per loop

I just edited this up since R 2.11.1 is a lot slower than 2.13.1 (the latest release) so it wasn’t quite fair to perform comparisons off of that. So there are two cases here:

- Few unique values (the 26 letters of the alphabet). R does pretty well here, about 2x slower. This is down from 12x slower in R 2.11.1, so I can only assume someone on the R core team noticed this in the last year or so.
- All unique values. R is about 4x slower now.

It’s possible I’m using a different algorithm (and so is R between 2.11 and 2.13!). Maybe Python’s much lauded dict implementation is responsible? I’m more curious than anything. Maybe an R guru could let me know (I’m afraid to look at the GPL source code).

For the record, here’s the Python and Cython code involved. First, the Python code:

def unique_with_labels(values):

uniques = Index(_tseries.fast_unique(values))

labels = _tseries.get_unique_labels(values, uniques.indexMap)

return uniques, labels

uniques = Index(_tseries.fast_unique(values))

labels = _tseries.get_unique_labels(values, uniques.indexMap)

return uniques, labels

And the two Cython functions

@cython.wraparound(False)

@cython.boundscheck(False)

def fast_unique(ndarray[object] values):

cdef:

Py_ssize_t i, n = len(values)

list uniques = []

dict table = {}

object val, stub = 0

for i from 0 <= i < n:

val = values[i]

if val not in table:

table[val] = stub

uniques.append(val)

try:

uniques.sort()

except Exception:

pass

return np.asarray(uniques, dtype=object)

@cython.wraparound(False)

@cython.boundscheck(False)

def get_unique_labels(ndarray[object] values, dict idMap):

cdef int i, length

cdef object idx

cdef ndarray[int32_t] labels

length = len(values)

labels = np.empty(length, dtype=np.int32)

for i from 0 <= i < length:

idx = values[i]

labels[i] = idMap[idx]

return labels

@cython.boundscheck(False)

def fast_unique(ndarray[object] values):

cdef:

Py_ssize_t i, n = len(values)

list uniques = []

dict table = {}

object val, stub = 0

for i from 0 <= i < n:

val = values[i]

if val not in table:

table[val] = stub

uniques.append(val)

try:

uniques.sort()

except Exception:

pass

return np.asarray(uniques, dtype=object)

@cython.wraparound(False)

@cython.boundscheck(False)

def get_unique_labels(ndarray[object] values, dict idMap):

cdef int i, length

cdef object idx

cdef ndarray[int32_t] labels

length = len(values)

labels = np.empty(length, dtype=np.int32)

for i from 0 <= i < length:

idx = values[i]

labels[i] = idMap[idx]

return labels