Wes on 2015-10-02: The performance issue I found in NumPy has been fixed, but the pandas workaround is still faster by 2x or more.
Getting a 1-dimensional ndarray
of object dtype containing Python tuples is, unless I'm missing something, rather difficult. Take this simple example:
In [1]: tuples = zip(range(100000), range(100000))
In [2]: arr = np.array(tuples, dtype=object, ndmin=1)
In [3]: arr
Out[3]:
array([[0, 0],
[1, 1],
[2, 2],
...,
[99997, 99997],
[99998, 99998],
[99999, 99999]], dtype=object)
In [5]: arr.ndim
Out[5]: 2
OK, that didn't work so well. The only way I've figured out how to get what I want is:
In [6]: arr = np.empty(len(tuples), dtype='O')
In [7]: arr[:] = tuples
In [8]: arr
Out[8]:
array([(0, 0), (1, 1), (2, 2), ..., (99997, 99997), (99998, 99998),
(99999, 99999)], dtype=object)
Yahtzee. But the kids aren't alright:
In [9]: timeit arr[:] = tuples
10 loops, best of 3: 133 ms per loop
Maybe it's just me but that strikes me as being outrageously slow. Someday I'll look at what's going on under the hood, but a quickie Cython function comes to the rescue:
def list_to_object_array(list obj):
'''
Convert list to object ndarray.
Seriously can't believe I had to write this function
'''
cdef:
Py_ssize_t i, n
ndarray[object] arr
n = len(obj)
arr = np.empty(n, dtype=object)
for i from 0 <= i < n:
arr[i] = obj[i]
return arr
You would hope this is faster, and indeed it's about 85x faster:
In [12]: timeit arr = lib.list_to_object_array(tuples)
1000 loops, best of 3: 1.56 ms per loop
Scratching my head here, but I'll take it. I suspect there might be some object copying going on under the hood, anyone know?