Over the last year, I have been working with the Apache Parquet community to build out parquet-cpp, a first class C++ Parquet file reader/writer implementation suitable for use in Python and other data applications. Uwe Korn and I have built the Python interface and integration with pandas within the Python codebase (pyarrow) in Apache Arrow.

This blog is a follow up to my 2017 Roadmap post.

Design: High performance columnar data in Python

The Apache Arrow and Parquet C++ libraries are complementary technologies that we've been engineering to work well together.

  • Arrow C++ libraries provide memory management, efficient IO (files, memory maps, HDFS), in-memory columnar array containers, and extremely fast messaging (IPC / RPC). I will write more about Arrow's messaging layer in another blog post.

  • The Parquet C++ libraries are responsible for encoding and decoding the Parquet file format. We have implemented a libparquet_arrow library that handles transport between in-memory Arrow data and the low-level Parquet reader/writer tools

  • PyArrow provides a Python interface to all of this, and handles fast conversions to pandas.DataFrame.

One of the primary goals of Apache Arrow is to be an efficient, interoperable columnar memory transport layer.

You can read about the Parquet user API in the PyArrow codebase. The libraries are available from conda-forge at:

conda install pyarrow arrow-cpp parquet-cpp -c conda-forge

Performance Benchmarks: PyArrow and fastparquet

To get an idea of PyArrow's performance, I generated a 512 megabyte dataset of numerical data that exhibits different Parquet use cases. I generated two variants of the dataset:

  • High entropy: all of the data values in the file (with the exception of null values) are distinct. This dataset occupies 469 MB on disk.

  • Low entropy: the data exhibits a high degree of repetition. This data encodes and compresses to a very small size: only 23 MB with Snappy compression. If you write the file with dictionary encoding, it is even smaller. Because decoding such files become more CPU bound than IO bound, you can typically expect higher data throughput from low entropy data files.

I wrote these files for the 3 main compression styles in use: uncompressed, snappy, and gzip. I then compute the wall clock time to obtain a pandas DataFrame from disk.

fastparquet is a newer Parquet file reader/writer implementation for Python users created for use in the Dask project. It is implemented in Python and uses the Numba Python-to-LLVM compiler to accelerate the Parquet decoding routines. I also installed that to compare with alternative implementations.

The code to read a file as a pandas.DataFrame is similar:

# PyArrow
import pyarrow.parquet as pq
df1 = pq.read_table(path).to_pandas()

# fastparquet
import fastparquet
df2 = fastparquet.ParquetFile(path).to_pandas()

The green bars are the PyArrow timings: longer bars indicate faster performance / higher data throughput. Hardware is a Xeon E3-1505 laptop.

I just updated these benchmarks on February 1, 2017 against the latest codebases.

Parquet Python performance

Development status

We are in need of help on Windows builds and packaging. Also, keeping the conda-forge packages up to date is very time consuming. Of course, we're looking for both C++ and Python developers to contribute to the codebases in general.

So far, we have focused on having a production-quality implementation of the file format with strong performance reading and writing flat datasets. We are starting to move onto handling nested JSON-like data natively in parquet-cpp using Arrow as the container for the nested columnar data.

Recently Uwe Korn has just implemented some support for the List Arrow type in conversions to pandas:

In [9]: arr = pa.from_pylist([[1,2,3], None, [1, 2], [], [4]])

In [10]: arr
<pyarrow.array.ListArray object at 0x7f562d551818>

In [11]: arr.type
Out[11]: DataType(list<item: int64>)

In [12]: t = pa.Table.from_arrays([arr], ['col'])

In [13]: t.to_pandas()
0  [1, 2, 3]
1       None
2     [1, 2]
3         []
4        [4]

Benchmarking code

Here is the code

import os
import time

import numpy as np
import pandas as pd
from pyarrow.compat import guid
import pyarrow as pa
import pyarrow.parquet as pq
import fastparquet as fp

def generate_floats(n, pct_null, repeats=1):
    nunique = int(n / repeats)
    unique_values = np.random.randn(nunique)

    num_nulls = int(nunique * pct_null)
    null_indices = np.random.choice(nunique, size=num_nulls, replace=False)
    unique_values[null_indices] = np.nan

    return unique_values.repeat(repeats)

    'float64': generate_floats

def generate_data(total_size, ncols, pct_null=0.1, repeats=1, dtype='float64'):
    type_ = np.dtype('float64')
    nrows = total_size / ncols / np.dtype(type_).itemsize

    datagen_func = DATA_GENERATORS[dtype]

    data = {
        'c' + str(i): datagen_func(nrows, pct_null, repeats)
        for i in range(ncols)
    return pd.DataFrame(data)

def write_to_parquet(df, out_path, compression='SNAPPY'):
    arrow_table = pa.Table.from_pandas(df)
    if compression == 'UNCOMPRESSED':
        compression = None
    pq.write_table(arrow_table, out_path, use_dictionary=False,

def read_fastparquet(path):
    return fp.ParquetFile(path).to_pandas()

def read_pyarrow(path, nthreads=1):
    return pq.read_table(path, nthreads=nthreads).to_pandas()

MEGABYTE = 1 << 20
NCOLS = 16

cases = {
    'high_entropy': {
        'pct_null': 0.1,
        'repeats': 1
    'low_entropy': {
        'pct_null': 0.1,
        'repeats': 1000

def get_timing(f, path, niter):
    start = time.clock_gettime(time.CLOCK_MONOTONIC)
    for i in range(niter):
    elapsed = time.clock_gettime(time.CLOCK_MONOTONIC) - start
    return elapsed


results = []

readers = [
    ('fastparquet', lambda path: read_fastparquet(path)),
    ('pyarrow', lambda path: read_pyarrow(path)),

case_files = {}

for case, params in cases.items():
    for compression in ['UNCOMPRESSED', 'SNAPPY', 'GZIP']:
        path = '{0}_{1}.parquet'.format(case, compression)
        df = generate_data(DATA_SIZE, NCOLS, **params)
        write_to_parquet(df, path, compression=compression)
        df = None
        case_files[case, compression] = path

for case, params in cases.items():
    for compression in ['UNCOMPRESSED', 'SNAPPY', 'GZIP']:
        path = case_files[case, compression]

        # prime the file cache

        for reader_name, f in readers:
            elapsed = get_timing(f, path, NITER) / NITER
            result = case, compression, reader_name, elapsed