I’ve tried to make the file parsing functions in pandas, read_csv and read_table, as robust (they do the right thing) and fast as possible. What do we really care about?
While hacking yesterdays with folks at the Data Without Borders event, I realized that boolean values (“True” and “False”) weren’t resulting in an boolean array in the result. Also, I was not satisfied with the performance of the pure Python code. Since I’ve had great results using Cython to create C extensions, it was the natural choice. The results were great: parsing the data set we were looking at at DWB went from about 8 seconds before to 800ms, a full 10x improvement. I also fixed a number of bugs / corner cases with type handling.
TL;DR pandas.read_csv is a lot faster now
Here was the speed of read_csv before these changes on a fairly big file (46738×54):
CPU times: user 7.99 s, sys: 0.09 s, total: 8.08 s
Wall time: 8.11 s
This obviously will not do. And here post-Cythonization:
1 loops, best of 3: 804 ms per loop
As a point of comparison, R is pretty speedy but about 2x slower.
user system elapsed
1.660 0.000 1.667
In fairness I am 100% sure that read.csv is “doing” a lot more, but it shows that I’m at least on the right track.
I won’t rehash all the code, but there were a number of interesting things along the way.
The basics: working with NumPy arrays in Cython
One of the truly beautiful things about programming in Cython is that you can get the speed of working with a C array representing a multi-dimensional array (e.g. double *) without the headache of having to handle the striding information of the ndarray yourself. Also, you can work with non-contiguous arrays and arrays with dtype=object (which are just arrays of PyObject* underneath) with no code changes (!). Cython calls this the buffer interface:
cdef:
Py_ssize_t i, n
object val, onan
n = len(values)
onan = np.nan
for i from 0 <= i < n:
val = values[i]
if val == '':
values[i] = onan
For multi-dimensional arrays, you specify the number of dimensions in the buffer. and pass multiple indexes (Py_ssize_t is the proper C “index type” to use). I’ll demonstrate this in:
Converting rows to columns faster than zip(*rows)
A cool Python trick to convert rows to column is:
Out[5]:
[(0.39455404791938709, 0.13120015514691319, -0.38366356835950594),
(-0.744567101498121, -0.9189909692195557, 1.3558711696319314),
(-0.20933216711571506, 0.36102965753837235, 0.94614438124063927),
(0.49200559154161844, 0.099177280246717708, -1.2622899429921068),
(-0.48238271158753454, -0.9414862514454051, -1.0257632509581869)]
In [6]: zip(*rows)
Out[6]:
[(0.39455404791938709,
-0.744567101498121,
-0.20933216711571506,
0.49200559154161844,
-0.48238271158753454),
(0.13120015514691319,
-0.9189909692195557,
0.36102965753837235,
0.099177280246717708,
-0.9414862514454051),
(-0.38366356835950594,
1.3558711696319314,
0.94614438124063927,
-1.2622899429921068,
-1.0257632509581869)]
While zip is very fast (a built-in Python function), the larger problem here is that our target data structure is NumPy arrays to begin with. So it would make sense to write out the rows directly to a 2-dimensional object array:
cdef:
Py_ssize_t i, j, n, k, tmp
ndarray[object, ndim=2] result
list row
n = len(rows)
# get the maximum row length
k = 0
for i from 0 <= i < n:
tmp = len(rows[i])
if tmp > k:
k = tmp
result = np.empty((n, k), dtype=object)
for i from 0 <= i < n:
row = rows[i]
for j from 0 <= j < len(row):
result[i, j] = row[j]
return result
And lo and behold, this function is significantly faster than the zip trick:
In [13]: timeit zip(*data)
100 loops, best of 3: 3.66 ms per loop
In [14]: timeit lib.to_object_array(data)
1000 loops, best of 3: 1.47 ms per loop
It’s even more of a big deal if you zip and convert to ndarray:
100 loops, best of 3: 6.72 ms per loop
Numeric conversion: floats, ints, and NA’s, oh my
When converting the Python strings to numeric data, you must:
Unfortunately, code for this sort of thing ends up looking like a state machine 99% of the time, but at least it’s fairly tidy in Cython and runs super fast:
cdef:
Py_ssize_t i, n
ndarray[float64_t] floats
ndarray[int64_t] ints
bint seen_float = 0
object val
float64_t fval
n = len(values)
floats = np.empty(n, dtype='f8')
ints = np.empty(n, dtype='i8')
for i from 0 <= i < n:
val = values[i]
if cpython.PyFloat_Check(val):
floats[i] = val
seen_float = 1
elif val in na_values:
floats[i] = nan
seen_float = 1
elif val is None:
floats[i] = nan
seen_float = 1
elif len(val) == 0:
floats[i] = nan
seen_float = 1
else:
fval = float(val)
floats[i] = fval
if not seen_float:
if '.' in val:
seen_float = 1
else:
ints[i] = <int64_t> fval
if seen_float:
return floats
else:
return ints
Adopting the Python philosophy that it’s “easier to ask forgiveness than permission” if float conversion ever fails, the exception will get raised and the code will just leave the column as dtype=object. And this function would obviously have problems with European decimal format– but I’m not willing to compromise performance in 99% cases for the sake of the 1% cases. It will make sense to write a slower function that also handles a broader variety of formatting issues.
