Ursa Labs February 2019 Report

ursa labs
work
Author

Wes McKinney

Published

March 6, 2019

The team had a busy 28 days this February. The Apache Arrow community is discussing a 0.13 release toward the end of March, so we spent February helping the project toward the next release milestone. We have been pushing projects on multiple fronts and discuss some of those here.

The Apache Arrow project just had its 3rd birthday, and we are pleased to report that the community is thriving and growing fast after only a short time as a top-level project in The Apache Software Foundation. We’re really looking forward to what the next few years will bring as the Arrow columnar format and cross-language development platform becomes even more widely adopted.

Open Roles in the Team

We have open roles in the team for an Engineering Manager and Senior Software Engineer. Ursa Labs is a rare opportunity to spend 100% of your time on an ambitious and wide-reaching open source systems project. These are full-time remote roles, so if you live in NYC or the Bay Area and would like to move somewhere else (I think Nashville is pretty great) this could be your chance.

Development Highlights

One significant project not represented in the Apache Arrow open source project is setting up a physical build and test cluster for Ursa Labs. NVIDIA has provided us two DGX stations and a Jetson TX2 (Aarch64-based computer). To this we have added a 2018 Mac Mini and will continue to add machines as needed. This build cluster will be used for nightly tests and packaging builds as well as performance benchmarking. The Arrow community has been discussion public daily performance benchmarking and there is a new SQL schema for a proposed benchmark database.

In Apache Arrow, we have been working in several areas:

  • Line-delimited JSON reader: an initial C++ implementation of reading JSON files to the Arrow columnar format. We have more work to do here, but this work will form the basis of utilizing directories of JSON files as a data source for in-memory query processing
  • Arrow Flight, a new RPC / messaging system: we have been collaborating with Two Sigma, one of our gracious sponsors, and Dremio on the development of this new gRPC-based Arrow-native messaging framework. We believe this will form the backbone of future distributed systems powered by Apache Arrow.
  • C++ Arrow Dataset Framework: we have proposed a general purpose C++ framework for interacting with large datasets stored in a number of different formats. This is an essential component for general-purpose in-memory query processing. This work will replace and generalize some of the pure Python code we have already for pyarrow.parquet.ParquetDataset
  • Computational Kernels: to lay the foundations for an Arrow-native in-memory query engine, we have been implementing aggregation functions to enable parallel aggregation of Arrow datasets
  • Gandiva testing and packaging support: we are working diligent to make it possible to ship Gandiva, our LLVM-based expression compiler (for projections and filters), in various package artifacts including Python wheels and conda packages
  • User-defined Extension Types in C++: we have proposed an initial C++ API for defining custom data types in C++ (eventually for Python, too) that are backed by one of the one of the built-in Arrow columnar data types
  • LLVM 7 migration: we have upgraded the project, including the Gandiva, to use the stable LLVM 7 version

Upcoming Focus Areas

In March one of our main priorities will be working with the Arrow community to get the 0.13 release out the door. We will be focusing in several areas to follow on with the above:

  • Getting our build cluster up and running, to help make Arrow developers more productive, and helping set up automated daily performance benchmarks with the Arrow community
  • Working toward getting initial Arrow Flight support into our packages (like conda packages and Python wheels)
  • Continuing to develop and improve our computational kernels
  • New data type additions to the Arrow format: the community is discussing a new-and-improved timedelta or interval type, as well as a “packed C struct” data type. We are interested in helping implement these new data types. See the Arrow developer mailing list for more

Team Changelog

The team had 68 commits merged into Apache Arrow in February 2019. You can click on the ASF JIRA links to learn more about the discussion on a particular issue or the commit hash to see each patch.

(Note of the patches from early in the month have “February 8” commit date due to a rebase)