Our last report was for March 2019, so I have rolled up two months into one. There have been a number of exciting developments in the project in the last couple of months. See the end of the report for the team's full changelog of patches contributed to Apache Arrow.

Ursa Labs Team growing

Since the last report, Neal Richardson has joined the team, bringing expertise in growing distributed engineering teams and R programming to boot. Joris Van den Bossche from the pandas core team has also come on board on a part-time basis to scale our general Python and pandas-related development efforts in Apache Arrow.

Major C++ Development Initiatives

Generally speaking, the Ursa Labs team is focusing on 4 major projects going forward:

  • Datasets: a common API/framework for reading and writing large, partitioned datasets in a variety of file formats (Memory-mapped Arrow files, Parquet, CSV, JSON, Orc, Avro, etc.) in many different storage systems (local files, HDFS, and cloud storage). This is an essential interface to tie together our file format and filesystem interfaces. A design discussion document is being discussed on the mailing list.
  • Flight RPC: high performance Arrow-based dataset transfer in distributed systems. The purpose of this system is to move Arrow data from one computer to another as fast as possible.
  • DataFrames: a higher-level C++ add-on library for Apache Arrow for manipulating in-memory and larger-than-memory tables resident to a single machine. This interface is to be composed from the lower level RecordBatch and Table data structures in the Arrow C++ library. See our recent discussion document from the mailing list
  • In-memory Query Engine: A database/SQL-style analytical query engine that operates natively on streams of Arrow columnar data. This system will utilize common analytical components from the DataFrames library. See discussion document from the mailing list

Up until this point in the project, our development work has tilted more toward the first two projects but we will be beginning to do some DataFrames work as 2019 advances.

Development Highlights in April and May

Some highlights from our work in Apache Arrow:

  • Buildbot cluster: We are continuing to set up our physical build and test cluster which we'll use to run integration tests, GPU-enabled builds, benchmark comparisons, nightly builds, and other automated tests to help with Arrow development.
  • Mult-threaded JSON file reader: an initial version of a line-delimited JSON reader with type inference is available in C++ and Python for user testing and feedback. We perform parsing and conversion in multiple threads for better performance.
  • Benchmarking utilities: we have begun developing archery, a command-line tool to assist with cross-language benchmark automation and data collection. An initial deliverable is to do comparisons of C++ library performance between codebase revisions on an on-demand basis.
  • Filesystem C++ API: we have begun developing a common high-level API for file-based storage systems including local, network, and cloud storage. This is a necessary part of building the Datasets framework. The local filesystem interface for Windows and POSIX systems has been developed, and we will look into supporting Amazon S3, Google Cloud, HDFS, and other remote filesystems later thi syear
  • Improved Dictionary (Categorical) C++ Support: we addressed some long-standing limitations in the implementation of dictionary-encoded data (which can arise from pandas Categorical types and other places) to permit dictionaries to change as a streaming dataset evolves. This will help with improving Parquet read performance and other future features.
  • Parquet Improvements: the 0.14 release will feature faster file writing (see details in PARQUET-1523). We have also migrated the Parquet C++ library to use common IO and file interfaces used by the rest of the Arrow codebase, which will help us make more performance improvements down the road.

Arrow Flight progress

We've been working since last fall with Two Sigma and Dremio on the new Flight messaging and RPC framework built on top of Google's gRPC library. Flight is an Arrow-native data messaging layer designed for creating high-performance clients and servers that send Arrow-based datasets to each other. By bypassing unnecessary serialization steps, we are able to achieve excellent end-to-end throughput.

One use case for Flight is as an alternative to ODBC or JDBC in database systems.

In the last couple of months we have completed cross-platform support for Flight (i.e. including Windows) and we expect to be able to include Flight support in Python packages in the next release. Flight now also features URI-based locations which can permit different modes of underlying data. transport

Progress on R packaging and CRAN

The Arrow R package is still not available on CRAN but we are investigating various strategies to enable most normal R users to obtain both the Arrow C++ libraries and the Rcpp-based Arrow R library that is easy-to-install and CRAN compatible. We hope to have this sorted out in time for the 0.14 release sometime this summer.

Sponsor Acknowledgements

We are grateful to the support of our sponsors:

  • RStudio
  • NVIDIA AI Labs
  • ODSC Conference
  • Two Sigma Investments

If you or your company would be interested in sponsoring the work of Ursa Labs, please contact us at info@ursalabs.org.

Team Changelog

The team had 103 commits merged into Apache Arrow in April and May. You can click on the ASF JIRA links to learn more about the discussion on a particular issue or the commit hash to see each patch.