This blog posts discusses the design and performance implications of using
bitmaps to mark null values instead of sentinel values (or special values like
`NaN`

).

# How Apache Arrow's columnar format handles null values

Depending on who you talk to, one controversial aspect of the Apache Arrow
columnar format is the fact that it uses a **validity bitmap** to mark value as
null or not null. Here's the basic idea:

- The number of nulls is a property of an array, generally computed once then stored
- If an array has any null values at all, then a bitmap must be allocated
- If the value at bit
*i*is 0, then the value is null, otherwise it is not null

There's some nice benefits of the bitmap approach:

- If an array has no nulls, then the bitmap can be ignored or simply not allocated at all
- Nulls can be propagated in algorithms by using word-wise AND or OR operations, processing 32 or 64 values at a time

Still, for systems like R which use special values to mark nulls, bitmaps can
be controversial. The argument sometimes centers on the performance
implications of having to check bits rather than comparing a value with
`INT32_MIN`

or checking if it is `NaN`

. It is also noted that `NaN`

values are
propagated automatically by the CPU in arithmetic operations, so propagating
nulls in the bitmap case will be necessarily slower.

# The database perspective

From the perspective of databases and data warehousing, reserving certain
values to mark a null (or NA) is widely considered unacceptable. `NaN`

is valid
data, as is `INT32_MIN`

and other common values used as sentinels.

Most SQL systems store the null-ness of value using an extra bit or byte. File formats like Apache Avro and Parquet have a separate null/not-null encoding.

# Are sentinel values faster than bitmaps (for analytics)?

To show that the performance argument against bitmaps is not persuasive, I
wrote some in-memory performance benchmarks (I would like to do some
experiments with large memory-mapped datasets). To keep things simple, let's
consider the `sum`

operation which must aggregate all of the non-null values in
an array. We also will track the non-null count.

We are interested in a few typical cases:

- Data with no nulls
- Data with a small percentage of nulls, say 10%
- Data with a high percentage of nulls, say 50%

To keep track of the results of a sum, I use a simple struct, templated on the type of the values:

template <typename T> struct SumState { SumState() : total(0), valid_count(0) {} T total; int64_t valid_count; };

The naive sum looks like this

template <typename T> struct SumNoNulls { static void Sum(const T* values, int64_t length, SumState<T>* state) { for (int64_t i = 0; i < length; ++i) { state->total += *values++; } state->valid_count += length; } };

Let's consider double precision floating point values, where `NaN`

is a common
sentinel value. So a sum function for `double`

is:

template <typename T> struct SumWithNaN { static void Sum(const T* values, int64_t length, SumState<T>* state) { for (int64_t i = 0; i < length; ++i) { if (*values == *values) { // NaN is not equal to itself state->total += *values; ++state->valid_count; } ++values; } } };

If we represent nulls with bits, we could write the sum naively like so:

#include "arrow/util/bit-util.h" template <typename T> struct SumBitmapNaive { static void Sum(const T* values, const uint8_t* valid_bitmap, int64_t length, SumState<T>* state) { for (int64_t i = 0; i < length; ++i) { if (BitUtil::GetBit(valid_bitmap, i)) { state->total += *values; ++state->valid_count; } ++values; } } };

# Summing much faster: eliminate branching, unroll loops, pre-compute popcounts

It's possible to go much faster using bitmaps with a few optimization techniques:

- Eliminate the "if" statements by using floating point operations to conditionally add the non-null values
- "Unroll" part of the for loop to sum 8 values at a time
- Skip null checking for groups of 8 values with no nulls. This is made faster
still by generating a pre-populated table of bit popcounts for
`uint8_t`

values.

Here's the meat of this new vectorized sum algorithm:

const int64_t whole_bytes = length / 8; for (int64_t i = 0; i < whole_bytes; ++i) { const uint8_t valid_byte = valid_bitmap[i]; if (valid_byte < 0xFF) { // Some nulls state->total += (values[0] * (valid_byte & 1)) + (values[1] * ((valid_byte >> 1) & 1)) + (values[2] * ((valid_byte >> 2) & 1)) + (values[3] * ((valid_byte >> 3) & 1)) + (values[4] * ((valid_byte >> 4) & 1)) + (values[5] * ((valid_byte >> 5) & 1)) + (values[6] * ((valid_byte >> 6) & 1)) + (values[7] * ((valid_byte >> 7) & 1)); state->valid_count += kBytePopcount[valid_byte]; } else { // No nulls state->total += values[0] + values[1] + values[2] + values[3] + values[4] + values[5] + values[6] + values[7]; state->valid_count += 8; } values += 8; }

To give a fair comparison with a sum without nulls, I also wrote similar
versions of this that sums 8 values at a time in the non-nullable case and
`NaN`

sentinel value case. I also added the same benchmark with `int64`

values
using `INT64_MIN`

as the sentinel value in case integer operations perform
differently from floating point operations.

**Complete benchmarking code is here**. Here are the performance results
for the 3 cases there there are 0%, 10%, and 50% nulls, respectively. The size
of the array being summed is 10 million values. The machine is my Xeon E3-1505M
Linux laptop.

It's definitely possible that I'm doing something suboptimal in my implementations. I would love to hear from algorithms experts!

Some observations on these benchmark results:

- The "no nulls" sum functions, both vectorized and not, perform the same for all 3 cases, since it does no null checking at all
- When there are no nulls, there is no performance penality with the optimized bitmap sum
- When there are 10% nulls, the bitmap sum is about 30% faster than the sentinel-based sum. The performance gap widens when the percentage of nulls is higher.
- The vectorized bitmap sum is faster when there are 50% nulls than when there are 10% nulls. I am not sure why this is; I would speculate that because on average many of the terms in the sum are 0 that the processor is able to avoid some floating point operations. It might be possible to further optimize by reducing the number of possible FP operations using a "binary tree" sum.
- Vectorization / batching brings little benefit to the sentinel-based algorithm

# Conclusions

Using a bit- or byte-based masked null representation is a requirement for a common memory format to be accepted by database systems as well as data science libraries, so using sentinel values at all is not really an option in a project like Apache Arrow. But, in many cases, using bitmaps allows faster algorithms to be developed that are not possible with the sentinel value-based null representation used in pandas and R.

I'm not a perfect programmer, so I'll be interested to see what other kinds of optimizations others can cook up. Take a look at the benchmark code!