Ursa Labs January 2019 Report

ursa labs

work

Author

Wes McKinney

Published

February 5, 2019

Ursa Labs had a busy January that went by too quickly. After a high-intensity 3 months of development, we helped release Apache Arrow 0.12 on January 20th. A good chunk of our time was spent fighting fires (in packaging and builds) related to the continued expansion of the project in recent months.

The 0.12 release contains a new merged documentation site where you can expect more project-level documentation to appear this year.

Upcoming Focus Areas

The team is working in a number of areas in the near future:

Building out the gRPC-based Flight RPC system
Computational kernels in C++ in support of a future Arrow-native in-memory query engine
Parquet file performance and memory use improvements. We also plan to work on support for reading and writing nested types, which currently have only limited support
Reading line-delimited JSON datasets to the Arrow format
Automation and bot-based job triggering in our physical build cluster. We’re hoping to have GitHub bots that we can ask to run various builds in Arrow pull requests
Packaging Gandiva (LLVM expression compiler) and gRPC in conda and wheel Python packages
Work toward getting the R Arrow package on CRAN

C++ highlights

We made many improvements to our build system and developer tools. Outside of some of these esoteric details, some highlights include:

Improve columnar array builder performance
Gandiva LLVM compiler support on Windows
Refactoring in Parquet C++ to eventually permit direct-to-categorical reads for pandas users
Toolchain improvements to support the gRPC-based Flight initiative
Alpine Linux support

Hardening the new Flight RPC system for production and making it available to C++, Python, and Java developers is a major area of upcoming development interest.

Python highlights

In Python we are working with the Ray, TensorFlow, and PyTorch communities to resolve some packaging issues related to the manylinux1 standard for wheel binary packages. The outcome there is as yet uncertain.

Some other highlights include:

Bindings for buffered input and output stream classes for better performance with high latency file systems like S3 and Google Cloud
Support for pandas 0.24.x

R highlights

After a very busy fall, January was a lighter month for R:

Bindings for the Arrow C++ CSV parser
Multithreaded conversions from arrow::Table to R data.frame.
Bindings for compressed input and output streams, which can be used in many contexts

We are working on a plan to get Arrow into CRAN to make it easier for R users to install the software. There are some hurdles including getting the Arrow C++ libraries into Debian, Fedora, and Homebrew. If you could like to help with packaging, we would appreciate the assistance.

Ursa Labs Development Infrastructure

Thanks to generous donations of hardware from NVIDIA, Ursa Labs now has 2 DGX Station machines hosted in Nashville, Tennessee, for the team to develop on. Each has a 20-core Xeon processor, 4 GPUs, and 256 GB of RAM. NVIDIA has also donated a Jetson TX2 dev kit for development and testing on Aarch64.

Our build cluster is growing and we intend to use this hardware to make the Apache Arrow community more productive.

Conference Talks, Blog Posts, and other reading

Wes spoke at two conferences

Ursa Labs and Apache Arrow 2019 at PyData Miami 2019
The same talk, tailored for the R community, at RStudio Conference 2019

We also published a blog post about some work in Arrow 0.12

Reducing Python String Memory Use in Apache Arrow 0.12

Apache Arrow community notes

As Apache Arrow approaches its third birthday as a top-level Apache project, we have surpassed 3000 stars on GitHub with over 240 unique contributors.

There is a discussion happening about building a benchmark database to test the different Arrow libraries on many different kinds of hardware, including different CPU and GPU architectures

We are just receiving the donation of a Rust-based in-memory query engine, DataFusion.

Team Changelog

The team had 86 commits merged into Apache Arrow in January. You can click on the ASF JIRA links to learn more about the discussion on a particular issue or the commit hash to see each patch.

2019-01-01: ARROW-3910: [Python] Set date_as_objects=True as default in to_pandas methods (9376d8 by wesm)
2019-01-03: ARROW-4148: [CI/Python] Disable ORC on nightly Alpine builds (6ca8fc by kszucs)
2019-01-03: ARROW-4009: [CI] Run Valgrind and C++ code coverage in different builds (7f1fbf by pitrou)
2019-01-04: ARROW-3760: [R] Support Arrow CSV reader (fba4f3 by romainfrancois)
2019-01-04: PARQUET-690: [C++] Reuse Thrift resources when serializing metadata structures (4057b5 by wesm)
2019-01-04: ARROW-4150: [C++] Ensure allocated buffers have non-null data pointer (1ff797 by pitrou)
2019-01-04: ARROW-4157: [C++] Fix clang documentation warnings on Ubuntu 18.04 (161d00 by wesm)
2019-01-04: ARROW-4149: [CI/C++] Parquet test misses ZSTD compression codec in CMake 3.2 nightly builds (1e9a23 by kszucs)
2019-01-04: ARROW-4158: Allow committers to set ARROW_GITHUB_API_TOKEN for merge script, better debugging output (c322ae by wesm)
2019-01-07: ARROW-4179: [Python] Use more public API to determine whether a test has a pytest mark or not (1aecb9 by wesm)
2019-01-07: ARROW-4125: [Python] Don’t fail ASV if Plasma extension is not built (e.g. on Windows) (b92b1f by wesm)
2019-01-08: ARROW-4178: [C++] Fix TSan and UBSan errors (4f2f53 by pitrou)
2019-01-08: ARROW-4200: [C++/Python] Enable conda_env_python.yml to work on Windows, simplify python/development.rst (090a8c by wesm)
2019-01-08: ARROW-4186: [C++] BitmapWriter shouldn’t clobber data when length == 0 (326015 by pitrou)
2019-01-09: ARROW-3233: [Python] Add prose documentation for CUDA support (bcfaca by pitrou)
2019-01-09: ARROW-4118: [Python] Fix benchmark setup for asv run”” (3330d6 by pitrou)
2019-01-09: ARROW-3997: [Documentation] Clarify dictionary index type (6b496f by pitrou)
2019-01-09: ARROW-4138: [Python] Fix setuptools_scm version customization on Windows (84b221 by wesm)
2019-01-09: ARROW-4177: [C++] Add ThreadPool and TaskGroup microbenchmarks (b29ecd by pitrou)
2019-01-09: ARROW-2968: [R] Multi-threaded conversion from Arrow table to R data.frame (3b6134 by romainfrancois)
2019-01-09: ARROW-3126: [Python] Make Buffered* IO classes available to Python, incorporate into input_stream, output_stream factory functions (7fcad2 by kszucs)
2019-01-10: [Release/Java] Disable Flight test case (76618f by kszucs)
2019-01-10: ARROW-4216: [Python] Add CUDA API docs (5a502d by pitrou)
2019-01-10: ARROW-4210: [Python] Mention boost-cpp directly in the conda meta.yaml for pyarrow (fc7b41 by kszucs)
2019-01-10: ARROW-3819: [Packaging] Update conda variant files to conform with feedstock after compiler migration (9d342e by kszucs)
2019-01-11: ARROW-4229: [Packaging] Set crossbow target explicitly to enable building arbitrary arrow repo (d7a683 by kszucs)
2019-01-12: ARROW-4238: [Packaging] Fix RC version conflict between crossbow and rake (38a628 by kszucs)
2019-01-12: ARROW-4237: [Packaging] Fix CMAKE_INSTALL_LIBDIR in release verification script (06de47 by kszucs)
2019-01-12: ARROW-4241: [Packaging] Disable crossbow conda OSX clang builds (9178ad by kszucs)
2019-01-12: ARROW-4243: [Python] Fix test failures with pandas 0.24.0rc1 (3e97ca by kszucs)
2019-01-14: ARROW-4256: [Release] Fix Windows verification script for 0.12 release (cf047f by wesm)
2019-01-15: [CI] Temporary fix for conda-forge migration (#3406) (143558 by kszucs)
2019-01-15: ARROW-4258: [Python] Safe cast fails from numpy float64 array with nans to integer (18c0e8 by kszucs)
2019-01-15: ARROW-4260: [Python] NumPy buffer protocol failure (09d349 by kszucs)
2019-01-15: ARROW-4246: [Plasma][Python][Follow-up] Ensure plasma::ObjectTableEntry always has the same size regardless of whether built with CUDA support (87ac6f by wesm)
2019-01-15: ARROW-4266: [Python][CI] Disable ORC tests in dask integration test (5a7507 by kszucs)
2019-01-16: ARROW-4270: [Packaging][Conda] Update xcode version and remove toolchain builds (a1a922 by kszucs)
2019-01-16: [Release] Update CHANGELOG.md for 0.12.0 (6c8c0c by kszucs)
2019-01-16: [Release] Update .deb/.rpm changelogs for 0.12.0 (db508e by kszucs)
2019-01-16: [Release] Update versions for 0.12.0 (6fcd91 by kszucs)
2019-01-16: [maven-release-plugin] prepare release apache-arrow-0.12.0 (8ca413 by kszucs)
2019-01-19: ARROW-4273: [Release] Fix verification script to use cf201901 conda-forge label (7a918b by kszucs)
2019-01-19: [Release] Build C++ unit tests in verify-release-candidate.bat (87abfe by wesm)
2019-01-19: ARROW-4254: [C++][Gandiva] Build with Boost from Ubuntu Trusty apt (2e8d38 by wesm)
2019-01-19: [Release] Update versions for 0.13.0-SNAPSHOT (e52c8f by kszucs)
2019-01-19: [Release] Update .deb package names for 0.13.0 (808178 by kszucs)
2019-01-19: [CI] Manually patch version in java/gandiva/pom.xml pending fix for ARROW-4301 (7489d3 by wesm)
2019-01-19: [maven-release-plugin] prepare for next development iteration (a486db by kszucs)
2019-01-20: ARROW-4123: [C++] Enable linting tools to be run on Windows (a665b8 by bkietz)
2019-01-20: ARROW-4252: [C++] Fix missing Status code and newline (9855e9 by fsaintjacques)
2019-01-20: [Docs] Minor fixes to documentation build instructions (349a95 by wesm)
2019-01-21: ARROW-4306: [Release] Update website, write blog post for 0.12.0 release (02864f by wesm)
2019-01-21: [Website] Add link to top-level documentation to nav dropdown (304224 by wesm)
2019-01-22: ARROW-4312: [C++] Only run 2 * os.cpu_count() clang-format instances at once (4d6d7d by bkietz)
2019-01-22: ARROW-4321: [CI] Setup conda-forge channel globally in docker containers (5a75bb by kszucs)
2019-01-22: ARROW-4307: [C++] Fix Doxygen warnings (688f1e by pitrou)
2019-01-23: ARROW-4281: [CI] Use Ubuntu Xenial VMs on Travis-CI (1b8a7b by pitrou)
2019-01-23: ARROW-4234: [C++] Improve memory bandwidth test (d5fe8e by fsaintjacques)
2019-01-23: ARROW-4323: [Packaging] Fix failing OSX clang conda forge builds (372137 by kszucs)
2019-01-23: ARROW-4031: [C++] Refactor bitmap building (d15cb4 by bkietz)
2019-01-24: ARROW-4346: [C++] Fix class-memaccess warning on gcc 8.x (9460bb by wesm)
2019-01-24: ARROW-4349: [C++] Add static linking option for benchmarks, fix Windows benchmark build failures (75c835 by wesm)
2019-01-25: ARROW-4262: [Website] Preview to Spark with Arrow and R improvements (ae4ed3 by javierluraschi)
2019-01-25: ARROW-4373: [Packaging] Travis fails to deploy conda packages on OSX (9a6480 by kszucs)
2019-01-26: ARROW-4364: [C++] Fix CHECKIN warnings (90eeb4 by fsaintjacques)
2019-01-26: ARROW-4381: [CI] Update linter container build instructions (5043d1 by wesm)
2019-01-26: ARROW-4351: [C++] Fix CMake errors when neither building shared libraries nor tests (59c69a by wesm)
2019-01-26: ARROW-4375: [CI] Sphinx dependencies were removed from docs conda environment (bcc100 by kszucs)
2019-01-26: ARROW-3367: [INTEGRATION] Port Spark integration test to the docker-compose setup (23475e by kszucs)
2019-01-27: ARROW-4330: [C++] More robust discovery of pthreads (b42786 by wesm)
2019-01-28: ARROW-4368: [Docs] Fix install document for Ubuntu 16.04 or earlier (3bb244 by kszucs)
2019-01-28: ARROW-4399: [C++] Do not use extern template class with NumericArray and NumericTensor (823dd4 by wesm)
2019-01-28: ARROW-4408: [CPP/Doc] Remove outdated Parquet documentation (93d101 by kszucs)
2019-01-28: ARROW-4401: [Python] Alpine dockerfile fails to build because pandas requires numpy as build dependency (1ba029 by kszucs)
2019-01-28: ARROW-4336: [C++] Change default build type to RELEASE (0ce39c by fsaintjacques)
2019-01-28: ARROW-3761: [R] Bindings for CompressedInputStream, CompressedOutputStream (f576c3 by romainfrancois)
2019-01-29: PARQUET-1508: [C++] Read ByteArray data directly into arrow::BinaryBuilder and BinaryDictionaryBuilder. Refactor encoders/decoders to use cleaner virtual interfaces (3d435e by wesm)
2019-01-29: ARROW-4417: [C++] Fix doxygen build (c4c108 by pitrou)
2019-01-30: ARROW-4414: [C++] Stop using cmake COMMAND_EXPAND_LISTS because it breaks package builds for older distros (6626c0 by kszucs)
2019-01-30: PARQUET-1519: [C++] Hide TypedColumnReader implementation behind virtual interfaces, remove use of extern template class”” (27f60b by wesm)
2019-01-30: ARROW-4407: [C++] Cache compiler for CMake external projects (ea6323 by fsaintjacques)
2019-01-30: ARROW-4268: [C++] Native C type TypeTraits (012f77 by fsaintjacques)
2019-01-31: ARROW-4198: [Gandiva] Added support to cast timestamp (641c69 by pitrou)
2019-01-31: ARROW-4431: [C++] Fixes for gRPC vendored builds (36e26f by wesm)
2019-01-31: ARROW-4430: [C++] Fix untested TypedByteBuffer::Append method (e59bf7 by bkietz)
2019-01-31: ARROW-3846: [Gandiva][C++] Build Gandiva C++ libraries and get unit tests passing on Windows (48de82 by wesm)