Ursa Labs January 2019 Report

ursa labs
work
Author

Wes McKinney

Published

February 5, 2019

Ursa Labs had a busy January that went by too quickly. After a high-intensity 3 months of development, we helped release Apache Arrow 0.12 on January 20th. A good chunk of our time was spent fighting fires (in packaging and builds) related to the continued expansion of the project in recent months.

The 0.12 release contains a new merged documentation site where you can expect more project-level documentation to appear this year.

Upcoming Focus Areas

The team is working in a number of areas in the near future:

  • Building out the gRPC-based Flight RPC system
  • Computational kernels in C++ in support of a future Arrow-native in-memory query engine
  • Parquet file performance and memory use improvements. We also plan to work on support for reading and writing nested types, which currently have only limited support
  • Reading line-delimited JSON datasets to the Arrow format
  • Automation and bot-based job triggering in our physical build cluster. We’re hoping to have GitHub bots that we can ask to run various builds in Arrow pull requests
  • Packaging Gandiva (LLVM expression compiler) and gRPC in conda and wheel Python packages
  • Work toward getting the R Arrow package on CRAN

C++ highlights

We made many improvements to our build system and developer tools. Outside of some of these esoteric details, some highlights include:

  • Improve columnar array builder performance
  • Gandiva LLVM compiler support on Windows
  • Refactoring in Parquet C++ to eventually permit direct-to-categorical reads for pandas users
  • Toolchain improvements to support the gRPC-based Flight initiative
  • Alpine Linux support

Hardening the new Flight RPC system for production and making it available to C++, Python, and Java developers is a major area of upcoming development interest.

Python highlights

In Python we are working with the Ray, TensorFlow, and PyTorch communities to resolve some packaging issues related to the manylinux1 standard for wheel binary packages. The outcome there is as yet uncertain.

Some other highlights include:

  • Bindings for buffered input and output stream classes for better performance with high latency file systems like S3 and Google Cloud
  • Support for pandas 0.24.x

R highlights

After a very busy fall, January was a lighter month for R:

  • Bindings for the Arrow C++ CSV parser
  • Multithreaded conversions from arrow::Table to R data.frame.
  • Bindings for compressed input and output streams, which can be used in many contexts

We are working on a plan to get Arrow into CRAN to make it easier for R users to install the software. There are some hurdles including getting the Arrow C++ libraries into Debian, Fedora, and Homebrew. If you could like to help with packaging, we would appreciate the assistance.

Ursa Labs Development Infrastructure

Thanks to generous donations of hardware from NVIDIA, Ursa Labs now has 2 DGX Station machines hosted in Nashville, Tennessee, for the team to develop on. Each has a 20-core Xeon processor, 4 GPUs, and 256 GB of RAM. NVIDIA has also donated a Jetson TX2 dev kit for development and testing on Aarch64.

Our build cluster is growing and we intend to use this hardware to make the Apache Arrow community more productive.

Conference Talks, Blog Posts, and other reading

Wes spoke at two conferences

We also published a blog post about some work in Arrow 0.12

Apache Arrow community notes

As Apache Arrow approaches its third birthday as a top-level Apache project, we have surpassed 3000 stars on GitHub with over 240 unique contributors.

There is a discussion happening about building a benchmark database to test the different Arrow libraries on many different kinds of hardware, including different CPU and GPU architectures

We are just receiving the donation of a Rust-based in-memory query engine, DataFusion.

Team Changelog

The team had 86 commits merged into Apache Arrow in January. You can click on the ASF JIRA links to learn more about the discussion on a particular issue or the commit hash to see each patch.