Feather: it’s about metadata

Published

April 26, 2016

Summary: Feather’s good performance is a side effect of its design, but the primary goal of the project is to have a common memory layout (Apache Arrow) and metadata (type information) for use in multiple programming languages.

Feather performance

Several people asked me about Matt Dowle’s blog post about fast CSV writing. I say: bravo!

The dirty secret with Feather’s performance is that neither Hadley or I spent much effort on performance optimization. Through the project’s complete git history, you’ll be hard pressed to find anything relating to performance tuning.

Due to Feather’s design (using Arrow’s columnar memory layout and simple metadata using Google’s Flatbuffers library), there was simply no way for it to be slow (ruling out some gross programming error). I was pleased, but not too surprised, to find that the first feather.read_dataframe call in Python nearly saturated my laptop’s IO bandwidth.

Feather does perform a limited amount of conversion between Arrow memory layout and R or Python data frame. As a result, there are several performance optimization opportunities available:

Multi-threaded conversion to/from Arrow (convert multiple columns simultaneously)
Pipeline reads or writes: perform disk IO concurrent with conversion to or from data frames.

We haven’t spent any energy on this, because the performance in the project’s first draft was good enough. Patches welcome, of course.

Metadata and metadata-free file formats

One of my personal goals in working on Feather was to start a broader discussion about what I call metadata-free file formats, with CSV being the most popular one. By “metadata-free”, I mean that in storing data in these formats, you lose type information (for example: factor levels) that may be impossible to recover. You may also lose numerical precision. When you read the files, you have to perform expensive type inference to “guess” the data types of columns. As you can imagine, there are a large number of esoteric edge cases for any type inference engine for CSVs.

That being said, metadata-free formats like CSV are still unfortunately a lowest-common denominator for data exchange in many systems. In both the Python and R communities, we’ve spent an extraordinary amount of time writing fast code for parsing and doing type inference on all of the bizarre delimited text files generated in the real world. In my opinion, this has been time well spent.

I often tell people that one of the things that made pandas successful early on was that pandas.read_csv usually just worked and was fairly fast, and that wasn’t true of any other Python CSV readers at the time.

Summary

Feather’s priorities in order have been:

Interoperable metadata and a shared memory layout
Shared code
Performance

I look forward to more interoperability and more code sharing in the Python and R communities. If we can also make things fast, of course, let’s do that, too.

Feather performance

Metadata and metadata-free file formats

Code sharing

Summary