R’s factor function is slow

I’m not making this up:

> arr <- rep(letters, 1000)
> system.time(for (i in 1:1000) foo <- factor(arr))
user system elapsed
1.809 0.023 1.821

> arr <- 1:10000
> system.time(for (i in 1:1000) foo <- factor(arr))
user system elapsed
7.782 0.088 8.261

Here’s the equivalent code using the pandas Factor class (git HEAD):

In [35]: import string
In [36]: arr = np.tile(list(string.letters[:26]), 1000).astype(object)
In [37]: Factor(arr)
array([a, b, c, ..., x, y, z], dtype=object)
Levels (26): [a b c d e f g h i j k l m n o p q r s t u v w x y z]
In [38]: timeit Factor(arr)
1000 loops, best of 3: 869 us per loop

In [39]: arr = np.arange(10000, dtype=object)
In [40]: timeit Factor(arr)
1000 loops, best of 3: 1.91 ms per loop

I just edited this up since R 2.11.1 is a lot slower than 2.13.1 (the latest release) so it wasn’t quite fair to perform comparisons off of that. So there are two cases here:

  • Few unique values (the 26 letters of the alphabet). R does pretty well here, about 2x slower. This is down from 12x slower in R 2.11.1, so I can only assume someone on the R core team noticed this in the last year or so.
  • All unique values. R is about 4x slower now.

It’s possible I’m using a different algorithm (and so is R between 2.11 and 2.13!). Maybe Python’s much lauded dict implementation is responsible? I’m more curious than anything. Maybe an R guru could let me know (I’m afraid to look at the GPL source code).

For the record, here’s the Python and Cython code involved. First, the Python code:

def unique_with_labels(values):
    uniques = Index(_tseries.fast_unique(values))
    labels = _tseries.get_unique_labels(values, uniques.indexMap)
    return uniques, labels

And the two Cython functions

def fast_unique(ndarray[object] values):
        Py_ssize_t i, n = len(values)
        list uniques = []
        dict table = {}
        object val, stub = 0

    for i from 0 <= i < n:
        val = values[i]
        if val not in table:
            table[val] = stub
    except Exception:

    return np.asarray(uniques, dtype=object)

def get_unique_labels(ndarray[object] values, dict idMap):
    cdef int i, length
    cdef object idx
    cdef ndarray[int32_t] labels
    length = len(values)
    labels = np.empty(length, dtype=np.int32)
    for i from 0 <= i < length:
        idx = values[i]
        labels[i] = idMap[idx]

    return labels