I’ve tried to make the file parsing functions in pandas, `read_csv` and `read_table`, as robust (they do the right thing) and fast as possible. What do we really care about?

While hacking yesterdays with folks at the Data Without Borders event, I realized that boolean values (“True” and “False”) weren’t resulting in an boolean array in the result. Also, I was not satisfied with the performance of the pure Python code. Since I’ve had great results using Cython to create C extensions, it was the natural choice. The results were great: parsing the data set we were looking at at DWB went from about 8 seconds before to 800ms, a full 10x improvement. I also fixed a number of bugs / corner cases with type handling.

### TL;DR pandas.read_csv is a lot faster now

Here was the speed of `read_csv` before these changes on a fairly big file (46738×54):

CPU times: user 7.99 s, sys: 0.09 s, total: 8.08 s

Wall time: 8.11 s

This obviously will not do. And here post-Cythonization:

1 loops, best of 3: 804 ms per loop

As a point of comparison, R is pretty speedy but about 2x slower.

user system elapsed

1.660 0.000 1.667

In fairness I am 100% sure that `read.csv` is “doing” a lot more, but it shows that I’m at least on the right track.

I won’t rehash all the code, but there were a number of interesting things along the way.

### The basics: working with NumPy arrays in Cython

One of the truly beautiful things about programming in Cython is that you can get the speed of working with a C array representing a multi-dimensional array (e.g. `double *`) without the headache of having to handle the striding information of the ndarray yourself. Also, you can work with non-contiguous arrays and arrays with dtype=object (which are just arrays of `PyObject*` underneath) with no code changes (!). Cython calls this the buffer interface:

cdef:

Py_ssize_t i, n

object val, onan

n = len(values)

onan = np.nan

for i from 0 <= i < n:

val = values[i]

if val == '':

values[i] = onan

For multi-dimensional arrays, you specify the number of dimensions in the buffer. and pass multiple indexes (`Py_ssize_t` is the proper C “index type” to use). I’ll demonstrate this in:

### Converting rows to columns faster than zip(*rows)

A cool Python trick to convert rows to column is:

Out[5]:

[(0.39455404791938709, 0.13120015514691319, -0.38366356835950594),

(-0.744567101498121, -0.9189909692195557, 1.3558711696319314),

(-0.20933216711571506, 0.36102965753837235, 0.94614438124063927),

(0.49200559154161844, 0.099177280246717708, -1.2622899429921068),

(-0.48238271158753454, -0.9414862514454051, -1.0257632509581869)]

In [6]: zip(*rows)

Out[6]:

[(0.39455404791938709,

-0.744567101498121,

-0.20933216711571506,

0.49200559154161844,

-0.48238271158753454),

(0.13120015514691319,

-0.9189909692195557,

0.36102965753837235,

0.099177280246717708,

-0.9414862514454051),

(-0.38366356835950594,

1.3558711696319314,

0.94614438124063927,

-1.2622899429921068,

-1.0257632509581869)]

While `zip` is very fast (a built-in Python function), the larger problem here is that our target data structure is NumPy arrays to begin with. So it would make sense to write out the rows directly to a 2-dimensional object array:

cdef:

Py_ssize_t i, j, n, k, tmp

ndarray[object, ndim=2] result

list row

n = len(rows)

# get the maximum row length

k = 0

for i from 0 <= i < n:

tmp = len(rows[i])

if tmp > k:

k = tmp

result = np.empty((n, k), dtype=object)

for i from 0 <= i < n:

row = rows[i]

for j from 0 <= j < len(row):

result[i, j] = row[j]

return result

And lo and behold, this function is significantly faster than the zip trick:

In [13]: timeit zip(*data)

100 loops, best of 3: 3.66 ms per loop

In [14]: timeit lib.to_object_array(data)

1000 loops, best of 3: 1.47 ms per loop

It’s even more of a big deal if you zip **and** convert to ndarray:

100 loops, best of 3: 6.72 ms per loop

### Numeric conversion: floats, ints, and NA’s, oh my

When converting the Python strings to numeric data, you must:

`['1', '2', '3.5', '4']`where you may have “integer-like” data until you observe a floating point value

Unfortunately, code for this sort of thing ends up looking like a state machine 99% of the time, but at least it’s fairly tidy in Cython and runs super fast:

cdef:

Py_ssize_t i, n

ndarray[float64_t] floats

ndarray[int64_t] ints

bint seen_float = 0

object val

float64_t fval

n = len(values)

floats = np.empty(n, dtype='f8')

ints = np.empty(n, dtype='i8')

for i from 0 <= i < n:

val = values[i]

if cpython.PyFloat_Check(val):

floats[i] = val

seen_float = 1

elif val in na_values:

floats[i] = nan

seen_float = 1

elif val is None:

floats[i] = nan

seen_float = 1

elif len(val) == 0:

floats[i] = nan

seen_float = 1

else:

fval = float(val)

floats[i] = fval

if not seen_float:

if '.' in val:

seen_float = 1

else:

ints[i] = <int64_t> fval

if seen_float:

return floats

else:

return ints

Adopting the Python philosophy that it’s “easier to ask forgiveness than permission” if float conversion ever fails, the exception will get raised and the code will just leave the column as dtype=object. And this function would obviously have problems with European decimal format– but I’m not willing to compromise performance in 99% cases for the sake of the 1% cases. It will make sense to write a slower function that also handles a broader variety of formatting issues.