I've tried to make the file parsing functions in pandas,
read_table, as robust (they do the right thing) and fast as possible. What do we really care about?
- Good performance: can read a CSV file as fast as other statistical computing / data analysis languages, like R
- Proper type handling: if a column of data has integers, the resulting column will be integer dtype. But you can't be too aggressive, values like '1.0' and '2.0' should not get converted to integers.
- Support for converting strings into datetime objects
- Support for returning indexed DataFrame (with a simple or hierarchical index)
- NA value handling. This is tricky as you want to recognize lots of common NA representations ("NA", "NULL", etc.) and also allow the user to specify custom NA value strings.
While hacking yesterdays with folks at the Data Without Borders event, I realized that boolean values ("True" and "False") weren't resulting in an boolean array in the result. Also, I was not satisfied with the performance of the pure Python code. Since I've had great results using Cython to create C extensions, it was the natural choice. The results were great: parsing the data set we were looking at at DWB went from about 8 seconds before to 800ms, a full 10x improvement. I also fixed a number of bugs / corner cases with type handling.
TL;DR pandas.read_csv is a lot faster now
Here was the speed of
read_csv before these changes on a fairly big file (46738×54):
In : %time df = read_csv('data.csv') CPU times: user 7.99 s, sys: 0.09 s, total: 8.08 s Wall time: 8.11 s
This obviously will not do. And here post-Cythonization:
In : timeit df = read_csv('data.csv') 1 loops, best of 3: 804 ms per loop
As a point of comparison, R is pretty speedy but about 2x slower.
> system.time(df <- read.csv('data.csv', header=T)) user system elapsed 1.660 0.000 1.667
In fairness I am 100% sure that
read.csv is "doing" a lot more, but it shows that I'm at least on the right track.
I won't rehash all the code, but there were a number of interesting things along the way.
The basics: working with NumPy arrays in Cython
One of the truly beautiful things about programming in Cython is that you can get the speed of working with a C array representing a multi-dimensional array (e.g.
double *) without the headache of having to handle the striding information of the ndarray yourself. Also, you can work with non-contiguous arrays and arrays with dtype=object (which are just arrays of
PyObject* underneath) with no code changes (!). Cython calls this the buffer interface:
def sanitize_objects(ndarray[object] values): cdef: Py_ssize_t i, n object val, onan n = len(values) onan = np.nan for i from 0 <= i < n: val = values[i] if val == '': values[i] = onan
For multi-dimensional arrays, you specify the number of dimensions in the buffer. and pass multiple indexes (
Py_ssize_t is the proper C "index type" to use). I'll demonstrate this in:
Converting rows to columns faster than zip(*rows)
A cool Python trick to convert rows to column is:
In : rows Out: [(0.39455404791938709, 0.13120015514691319, -0.38366356835950594), (-0.744567101498121, -0.9189909692195557, 1.3558711696319314), (-0.20933216711571506, 0.36102965753837235, 0.94614438124063927), (0.49200559154161844, 0.099177280246717708, -1.2622899429921068), (-0.48238271158753454, -0.9414862514454051, -1.0257632509581869)] In : zip(*rows) Out: [(0.39455404791938709, -0.744567101498121, -0.20933216711571506, 0.49200559154161844, -0.48238271158753454), (0.13120015514691319, -0.9189909692195557, 0.36102965753837235, 0.099177280246717708, -0.9414862514454051), (-0.38366356835950594, 1.3558711696319314, 0.94614438124063927, -1.2622899429921068, -1.0257632509581869)]
zip is very fast (a built-in Python function), the larger problem here is that our target data structure is NumPy arrays to begin with. So it would make sense to write out the rows directly to a 2-dimensional object array:
def to_object_array(list rows): cdef: Py_ssize_t i, j, n, k, tmp ndarray[object, ndim=2] result list row n = len(rows) # get the maximum row length k = 0 for i from 0 <= i < n: tmp = len(rows[i]) if tmp > k: k = tmp result = np.empty((n, k), dtype=object) for i from 0 <= i < n: row = rows[i] for j from 0 <= j < len(row): result[i, j] = row[j] return result
And lo and behold, this function is significantly faster than the zip trick:
In : data = map(list, np.random.randn(10000, 10)) In : timeit zip(*data) 100 loops, best of 3: 3.66 ms per loop In : timeit lib.to_object_array(data) 1000 loops, best of 3: 1.47 ms per loop
It's even more of a big deal if you zip and convert to ndarray:
In : timeit [np.asarray(x, dtype=object) for x in zip(*data)] 100 loops, best of 3: 6.72 ms per loop
Numeric conversion: floats, ints, and NA's, oh my
When converting the Python strings to numeric data, you must:
- Check that the value is not among the set of NA values
- Handle data like
['1', '2', '3.5', '4']where you may have "integer-like" data until you observe a floating point value
Unfortunately, code for this sort of thing ends up looking like a state machine 99% of the time, but at least it's fairly tidy in Cython and runs super fast:
def maybe_convert_numeric(ndarray[object] values, set na_values): cdef: Py_ssize_t i, n ndarray[float64_t] floats ndarray[int64_t] ints bint seen_float = 0 object val float64_t fval n = len(values) floats = np.empty(n, dtype='f8') ints = np.empty(n, dtype='i8') for i from 0 <= i < n: val = values[i] if cpython.PyFloat_Check(val): floats[i] = val seen_float = 1 elif val in na_values: floats[i] = nan seen_float = 1 elif val is None: floats[i] = nan seen_float = 1 elif len(val) == 0: floats[i] = nan seen_float = 1 else: fval = float(val) floats[i] = fval if not seen_float: if '.' in val: seen_float = 1 else: ints[i] = <int64_t> fval if seen_float: return floats else: return ints
Adopting the Python philosophy that it's "easier to ask forgiveness than permission" if float conversion ever fails, the exception will get raised and the code will just leave the column as dtype=object. And this function would obviously have problems with European decimal format— but I'm not willing to compromise performance in 99% cases for the sake of the 1% cases. It will make sense to write a slower function that also handles a broader variety of formatting issues.