Summary: I explain the relationship between Feather and Apache Arrow in more technical detail.
Memory representation and file formats
I was recently asked to explain the difference between Apache Arrow (providing a standard in-memory columnar memory representation) and Feather (a file format using Apache Arrow).
Before going deeper into Feather and Arrow, let's look at how memory representations (these are also probably more commonly called data structures) can lead to file formats.
Let's take a simplified array data structure in C:
typedef struct {
/* the number of values in the array */
int64_t length;
/* the enum number indicating the data type of the array */
type_t type;
/* a contiguous block of allocated memory with size at least equal to length
* sizeof(type)
*/
void* data;
} array_t;
This array_t
is an in-memory data structure. I haven't told you how to put
it in a file yet.
Now suppose we have multiple arrays:
array_t my_data[20];
/* populate my_data */
Now you want to write the data to a file. You have a memory representation for this data, but no file format. So now you have a bunch of decisions to make:
- How do you lay out the bytes contained in the
array_t
structs in the file? - How to you encode the metadata (a description that allows code to reconstruct
my_data
)? - Where do you put the metadata in the file?
- How do you verify that you have a valid file?
- What happens when the metadata needs to evolve (file versioning)?
So you have a bunch of work to do if you want to go from data structures to a full blown file format.
Feather: the devil is in the metadata
From Feather's specification document, the structure of a file currently looks like this:
<4-byte magic number "FEA1">
<ARRAY 0>
<ARRAY 1>
...
<ARRAY n - 1>
<METADATA>
<uint32: metadata size>
<4-byte magic number "FEA1">
The ARRAY
blobs are data that originated from in-memory Arrow data
structures. Here is what the data structure representing a single primitive
array looks like. There's some C++ memory ownership stuff you can ignore for
this blog post.
struct PrimitiveArray {
PrimitiveType::type type;
int64_t length;
int64_t null_count;
// For ownership of any memory attached to this array
std::vector<std::shared_ptr<Buffer> > buffers;
// If null_count == 0, treated as nullptr
const uint8_t* nulls;
const uint8_t* values;
// For UTF8 and BINARY, not used otherwise
const int32_t* offsets;
};
We made an arbitrary decision of how to arrange the raw data (the null and
values buffers), but the bigger project was coming up with the metadata. We
used an open source project from Google called Flatbuffers to do this. In
Flatbuffers, the information about PrimitiveArray
looks like this:
table PrimitiveArray {
type: Type;
encoding: Encoding = PLAIN;
/// Relative memory offset of the start of the array data excluding the size
/// of the metadata
offset: long;
/// The number of logical values in the array
length: long;
/// The number of observed nulls
null_count: long;
/// The total size of the actual data in the file
total_bytes: long;
}
When it comes down to it, most of the effort of creating Feather was in defining a metadata specification that works for both Python and R, and implementing code for reading and writing the metadata from in-memory data structures. Copying bytes into R or Python in-memory arrays is one of the simplest parts.