In this post, I show how Parquet can encode very large datasets in a small file footprint, and how we can achieve data throughput significantly exceeding disk IO bandwidth by exploiting parallelism (multithreading).

## Apache Parquet: Top performer on low-entropy data

As you can read in the Apache Parquet format specification, the format features multiple layers of encoding to achieve small file size, among them:

• Dictionary encoding (similar to how pandas.Categorical represents data, but they aren't equivalent concepts)
• Data page compression (Snappy, Gzip, LZO, or Brotli)
• Run-length encoding (for null indicators and dictionary indices) and integer bit-packing

To give you an idea of how this works, let's consider the dataset:

['banana', 'banana', 'banana', 'banana', 'banana', 'banana',
'banana', 'banana', 'apple', 'apple', 'apple']


Almost all Parquet implementations dictionary encode by default. So the first pass encoding becomes:

dictionary: ['banana', 'apple']
indices: [0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1]


The dictionary indices are further run-length encoded:

dictionary: ['banana', 'apple']
indices (RLE): [(8, 0), (3, 1)]


Working backwards, you can easily reconstruct the original dense array of strings.

In my prior blog post, I created a dataset that compresses very well with this style of encoding. When writing with pyarrow, we can turn on and off dictionary encoding (which is on by default) to see how it impacts file size:

import pyarrow.parquet as pq

pq.write_table(dataset, out_path, use_dictionary=True,
compression='snappy)


With a dataset that occupies 1 gigabyte (1024 MB) in a pandas.DataFrame, with Snappy compression and dictionary encoding, it occupies an amazing 1.436 MB, small enough to fit on an old-school floppy disk. Without dictionary encoding, it occupies 44.4 MB.

## Parallel reads in parquet-cpp via PyArrow

In parquet-cpp, the C++ implementation of Apache Parquet, which we've made available to Python in PyArrow, we recently added parallel column reads.

To try this out, install PyArrow from conda-forge:

conda install pyarrow -c conda-forge


Now, when reading a Parquet file, use the nthreads argument:

import pyarrow.parquet as pq


Since all of the underlying machinery here is implemented in C++, other languages (such as R) can build interfaces to Apache Arrow (the common columnar data structures) and parquet-cpp. The Python bindings are a lightweight wrapper on top of the underlying libarrow and libparquet C++ libraries.