I’m not making this up:
> arr <- rep(letters, 1000)
> system.time(for (i in 1:1000) foo <- factor(arr))
user system elapsed
1.809 0.023 1.821
> arr <- 1:10000
> system.time(for (i in 1:1000) foo <- factor(arr))
user system elapsed
7.782 0.088 8.261
> system.time(for (i in 1:1000) foo <- factor(arr))
user system elapsed
1.809 0.023 1.821
> arr <- 1:10000
> system.time(for (i in 1:1000) foo <- factor(arr))
user system elapsed
7.782 0.088 8.261
Here’s the equivalent code using the pandas Factor class (git HEAD):
In [35]: import string
In [36]: arr = np.tile(list(string.letters[:26]), 1000).astype(object)
In [37]: Factor(arr)
Out[37]:
Factor:
array([a, b, c, ..., x, y, z], dtype=object)
Levels (26): [a b c d e f g h i j k l m n o p q r s t u v w x y z]
In [38]: timeit Factor(arr)
1000 loops, best of 3: 869 us per loop
In [39]: arr = np.arange(10000, dtype=object)
In [40]: timeit Factor(arr)
1000 loops, best of 3: 1.91 ms per loop
In [36]: arr = np.tile(list(string.letters[:26]), 1000).astype(object)
In [37]: Factor(arr)
Out[37]:
Factor:
array([a, b, c, ..., x, y, z], dtype=object)
Levels (26): [a b c d e f g h i j k l m n o p q r s t u v w x y z]
In [38]: timeit Factor(arr)
1000 loops, best of 3: 869 us per loop
In [39]: arr = np.arange(10000, dtype=object)
In [40]: timeit Factor(arr)
1000 loops, best of 3: 1.91 ms per loop
I just edited this up since R 2.11.1 is a lot slower than 2.13.1 (the latest release) so it wasn’t quite fair to perform comparisons off of that. So there are two cases here:
- Few unique values (the 26 letters of the alphabet). R does pretty well here, about 2x slower. This is down from 12x slower in R 2.11.1, so I can only assume someone on the R core team noticed this in the last year or so.
- All unique values. R is about 4x slower now.
It’s possible I’m using a different algorithm (and so is R between 2.11 and 2.13!). Maybe Python’s much lauded dict implementation is responsible? I’m more curious than anything. Maybe an R guru could let me know (I’m afraid to look at the GPL source code).
For the record, here’s the Python and Cython code involved. First, the Python code:
def unique_with_labels(values):
uniques = Index(_tseries.fast_unique(values))
labels = _tseries.get_unique_labels(values, uniques.indexMap)
return uniques, labels
uniques = Index(_tseries.fast_unique(values))
labels = _tseries.get_unique_labels(values, uniques.indexMap)
return uniques, labels
And the two Cython functions
@cython.wraparound(False)
@cython.boundscheck(False)
def fast_unique(ndarray[object] values):
cdef:
Py_ssize_t i, n = len(values)
list uniques = []
dict table = {}
object val, stub = 0
for i from 0 <= i < n:
val = values[i]
if val not in table:
table[val] = stub
uniques.append(val)
try:
uniques.sort()
except Exception:
pass
return np.asarray(uniques, dtype=object)
@cython.wraparound(False)
@cython.boundscheck(False)
def get_unique_labels(ndarray[object] values, dict idMap):
cdef int i, length
cdef object idx
cdef ndarray[int32_t] labels
length = len(values)
labels = np.empty(length, dtype=np.int32)
for i from 0 <= i < length:
idx = values[i]
labels[i] = idMap[idx]
return labels
@cython.boundscheck(False)
def fast_unique(ndarray[object] values):
cdef:
Py_ssize_t i, n = len(values)
list uniques = []
dict table = {}
object val, stub = 0
for i from 0 <= i < n:
val = values[i]
if val not in table:
table[val] = stub
uniques.append(val)
try:
uniques.sort()
except Exception:
pass
return np.asarray(uniques, dtype=object)
@cython.wraparound(False)
@cython.boundscheck(False)
def get_unique_labels(ndarray[object] values, dict idMap):
cdef int i, length
cdef object idx
cdef ndarray[int32_t] labels
length = len(values)
labels = np.empty(length, dtype=np.int32)
for i from 0 <= i < length:
idx = values[i]
labels[i] = idMap[idx]
return labels
