Some comments to Daniel Abadi’s blog about Apache Arrow

apache arrow

databases

Author

Wes McKinney

Published

November 1, 2017

Well-known database systems researcher Daniel Abadi published a blog post yesterday asking Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?.

Despite the somewhat confrontational title, based on his analysis, the answer is “Yes, we do”, but I have a number of issues to discuss, including in part the premise of the article.

Storage and runtime formats, in context

Arrow is not competing with Parquet and ORC. The article’s implied premise is that Arrow is a third technology for columnar storage, in the same category of projects as Parquet and ORC. We have made abundantly clear since the early days of our work that Arrow is a complementary, companion technology to disk-oriented columnar storage formats like Parquet and ORC. Indeed, many Apache Arrow committers are also Apache Parquet committers. It would not be in our interest to be competing with ourselves.

You can certainly utilize Arrow as the data representation for a column store in a database, but that is not the primary objective of the project.

To make this more clear for lay-people:

Parquet and ORC are file formats designed for persistent, long-term storage on disk that require non-trivial decoding and decompression to be able to perform computations on their contents in-memory
Arrow is an ephemeral, runtime in-memory representation that can be processed without any overhead on a per array-cell basis (unless you compress or encode in-memory, more on this later. It is not by design a file format (though you can use it to create file formats) nor is it intended for long-term storage

To give you a small example of what this means in practice. Let’s consider Parquet format, and the semantic values:

[0, 1, null, 3, null, null, 6, 7, null]

Now suppose that you want to write a function to sum a batch of integers. Your native result is:

int64_t sum_int64(const int64_t* values, size_t length) {
  int64_t result = 0;
  for (size_t i = 0; i < length; ++i) {
    result += values[i];
  }
  return result;
}

But wait, this doesn’t handle nulls. So suppose we use some special sentinel value (like INT64_MIN) to encode nulls:

int64_t sum_int64(const int64_t* values, size_t length) {
  int64_t result = 0;
  for (size_t i = 0; i < length; ++i) {
    if (values[i] != NULL_SENTINEL_INT64) {
      result += values[i];
    }
  }
  return result;
}

Now we’ve already decided on a quite opinionated columnar memory format for representing integers with nulls.

But, if the data is in Parquet, we can’t use this function as is. Omitting Parquet-specific details like dictionary encoding and compression, we may have the data represented using the Google Dremel definition level encoding:

definition levels: 1 1 0 1 0 0 1 1 0
values: 0 1 3 6 7

If you want to use your sum_int64 function, you have to allocate memory then decode this data into the appropriate data structure:

void parquet_decode_int64(const int16_t* def_levels, const int64_t* values,
                          size_t num_levels, int64_t* out) {
  for (size_t i = 0; i < num_levels; ++i) {
    if (def_levels[i] == 1) {
      *out++ = *values++;
    } else {
      *out++ = NULL_SENTINEL_INT64;
    }
  }
}

In practice, what happens is:

Systems define their own way to represent in-memory data, including nulls, nested types, strings, and all the complex data types out there
Algorithms written for a particular system are not portable to other environments

What Arrow provides a Parquet or ORC columnar storage user is a standardized in-memory data structure to place decoded data. This provides the additional benefits that any algorithms that understand Arrow memory can process decoded data at no cost. If you don’t use something standardized, in many cases you will have to roll your own “proprietary”, non-portable data structures.

It’s true that some analytic database systems have their own highly specialized query engines with code generation and many kinds of hardware optimizations. The developers of these systems may be unmotivated by the benefits of a standardized in-memory data representation, and that’s fine.

Zero-copy data access and non-sequential reads

One of the most game-changing aspects of Apache Arrow, is its role in data access and data movement. I discussed this extensively as it relates to data science in my September 21 blog post.

Whether Arrow data is placed in RAM, in POSIX shared memory, non-volatile memory, or memory-mapped files on disk, the cost of accessing a single value in a single column is not dependent on the size of the dataset. This means that whether you have 1 megabyte of data or 1 terabyte, the cost to access any portion of the dataset (outside of the cost of performing general IO on RAM or disk) is effectively zero (there is a tiny, fixed overhead relating to inspecting a dataset’s metadata).

Having the option to abstract away the difference between data on-disk and data in-memory from a coding perspective is hugely liberating from an architectural standpoint, particularly when there are multiple processes or multiple runtime environments involved. Two processes can access the same dataset on disk or in shared-memory without any additional overhead. A dataset can be streamed through a high-bandwidth socket and immediately analyzed on receipt without any additional processing overhead.

We have already seen huge benefits from this serialization-free, zero-copy data access:

In fact, one of the reasons I’ve spent less time on things like compression and encodings is that zero-copy data access and transport has more meaningful immediate benefits to downstream use cases.

Some other points

Prof. Abadi writes:

Now we can understand some of the key differences between Apache Parquet/ORC and Apache Arrow. Parquet and ORC, since they are designed for disk-resident data, support high-ratio compression algorithms such as snappy (both), gzip (Parquet), and zlib (ORC) all of which typically require decompression before data processing (and the associated CPU costs). Meanwhile, Arrow, which is designed for memory-resident-data, does not support these algorithms. The only compression currently supported by Arrow is dictionary compression, a scheme that usually does not require decompression before data processing.

See ARROW-300. Support for other kinds of compression has been discussed since the early days of the project, and the main reason it is not yet supported is limited developer bandwidth. In the C++ library, we already ship libraries with built-in codec support for gzip, zlib, Snappy, LZ4, and ZSTD. We could certainly add RLE, bit-packing, or other compression strategies for integers or other highly-compressible data types. We need to add metadata so that Arrow streams or IPC payloads can be transported with compressed vectors.

So far our time has been consumed with fine details of data representation (especially logical types like timestamps and decimals) so that work can proceed with building libraries of algorithms that process fully-decompressed Arrow data natively.

He writes further:

I assume that the Arrow developers will eventually read my 2006 paper on compression in column-stores and expand their compression options to include other schemes which can be operated on directly (such as run-length-encoding and bit-vector compression). I also expect that they will read the X100 compression paper which includes schemes which can be decompressed using vectorized processing. Thus, I expect that Arrow will eventually support an expanded set of compression options beyond just dictionary compression. But it is far less likely that we will see heavier-weight schemes like gzip and snappy in the Apache Arrow library any time soon.

I’m familiar with these papers as are many of the other Arrow PMC members. That we have not implemented these compression schemes is not a function of our ignorance, but rather our available time and engineering resources. We must walk before we can run.

As far as compression libraries like Snappy and GZIP, I see no reason not to support them as there may be use cases for Arrow where network transfer bandwidth is more of a concern, and using Arrow with buffer compression as an alternative to producing the more expensive Parquet format might be an attractive and simple tradeoff for ephemeral data RPC.

Prof. Abadi later writes:

But for main memory, it takes less than ten sequential reads to amortize the cost of the original random read. This enables the batch size of Apache Arrow data to be much smaller than batch sizes of disk-oriented storage formats. Apache Arrow actually fixes batches to be no more 64K records.

It is true that a system may choose to use small batches, but it is not correct that Arrow constrains batch sizes to be no larger than 64K records. A batch can contain 1 record or a billion. The metadata provides for arbitrarily long Arrow vectors, and we leverage this in the C++ implementation.

Conclusions

I’m excited to engage more with luminaries in the analytic database world on Arrow. One of my objectives with helping start the project was to create the environment for cross-pollination between the analytic database community and the data science community. Compared with the types of runtime systems in use by Python and R programmers, analytic databases are, generally speaking, much more sophisticated in performance and scalability.

We certainly have a lot of work to do, and Prof. Abadi points out areas where more work and community investment is needed. As more pieces of Arrow fall into place, and more integrations in other projects get built, things are going to get very exciting for all involved.