Ursa Labs February 2019 Report
The team had a busy 28 days this February. The Apache Arrow community is discussing a 0.13 release toward the end of March, so we spent February helping the project toward the next release milestone. We have been pushing projects on multiple fronts and discuss some of those here.
The Apache Arrow project just had its 3rd birthday, and we are pleased to report that the community is thriving and growing fast after only a short time as a top-level project in The Apache Software Foundation. We’re really looking forward to what the next few years will bring as the Arrow columnar format and cross-language development platform becomes even more widely adopted.
Open Roles in the Team
We have open roles in the team for an Engineering Manager and Senior Software Engineer. Ursa Labs is a rare opportunity to spend 100% of your time on an ambitious and wide-reaching open source systems project. These are full-time remote roles, so if you live in NYC or the Bay Area and would like to move somewhere else (I think Nashville is pretty great) this could be your chance.
Sponsor Acknowledgements
We are grateful to the support of our sponsors:
- RStudio
- NVIDIA AI Lab
- ODSC Conference
- Two Sigma Investments
If you or your company would be interested in sponsoring the work of Ursa Labs, please contact us at info@ursalabs.org.
Development Highlights
One significant project not represented in the Apache Arrow open source project is setting up a physical build and test cluster for Ursa Labs. NVIDIA has provided us two DGX stations and a Jetson TX2 (Aarch64-based computer). To this we have added a 2018 Mac Mini and will continue to add machines as needed. This build cluster will be used for nightly tests and packaging builds as well as performance benchmarking. The Arrow community has been discussion public daily performance benchmarking and there is a new SQL schema for a proposed benchmark database.
In Apache Arrow, we have been working in several areas:
- Line-delimited JSON reader: an initial C++ implementation of reading JSON files to the Arrow columnar format. We have more work to do here, but this work will form the basis of utilizing directories of JSON files as a data source for in-memory query processing
- Arrow Flight, a new RPC / messaging system: we have been collaborating with Two Sigma, one of our gracious sponsors, and Dremio on the development of this new gRPC-based Arrow-native messaging framework. We believe this will form the backbone of future distributed systems powered by Apache Arrow.
- C++ Arrow Dataset Framework: we have proposed a general purpose C++ framework for interacting with large datasets stored in a number of different formats. This is an essential component for general-purpose in-memory query processing. This work will replace and generalize some of the pure Python code we have already for
pyarrow.parquet.ParquetDataset
- Computational Kernels: to lay the foundations for an Arrow-native in-memory query engine, we have been implementing aggregation functions to enable parallel aggregation of Arrow datasets
- Gandiva testing and packaging support: we are working diligent to make it possible to ship Gandiva, our LLVM-based expression compiler (for projections and filters), in various package artifacts including Python wheels and conda packages
- User-defined Extension Types in C++: we have proposed an initial C++ API for defining custom data types in C++ (eventually for Python, too) that are backed by one of the one of the built-in Arrow columnar data types
- LLVM 7 migration: we have upgraded the project, including the Gandiva, to use the stable LLVM 7 version
Upcoming Focus Areas
In March one of our main priorities will be working with the Arrow community to get the 0.13 release out the door. We will be focusing in several areas to follow on with the above:
- Getting our build cluster up and running, to help make Arrow developers more productive, and helping set up automated daily performance benchmarks with the Arrow community
- Working toward getting initial Arrow Flight support into our packages (like conda packages and Python wheels)
- Continuing to develop and improve our computational kernels
- New data type additions to the Arrow format: the community is discussing a new-and-improved timedelta or interval type, as well as a “packed C struct” data type. We are interested in helping implement these new data types. See the Arrow developer mailing list for more
Team Changelog
The team had 68 commits merged into Apache Arrow in February 2019. You can click on the ASF JIRA links to learn more about the discussion on a particular issue or the commit hash to see each patch.
(Note of the patches from early in the month have “February 8” commit date due to a rebase)
- 2019-02-08: ARROW-4431: [C++] Fixes for gRPC vendored builds (5d742d by wesm)
- 2019-02-08: ARROW-4446: [C++][Python] Run Gandiva C++ unit tests in Appveyor, get build and tests working in Python (2b9155 by wesm)
- 2019-02-08: ARROW-4500: [C++] Remove pthread / librt hacks causing linking issues in some Linux environments (6f60e3 by wesm)
- 2019-02-08: ARROW-3606: [Crossbow] Fix flake8 crossbow warnings (5ad1ed by wesm)
- 2019-02-08: ARROW-3422: [C++] Uniformly add ExternalProject builds to the toolchain” target. Fix gRPC EP build on Linux” (4c6e1c by wesm)
- 2019-02-08: ARROW-3903: [Python] Random array generator for Arrow conversion and Parquet testing (d06c66 by kszucs)
- 2019-02-08: ARROW-3972: [C++] Migrate to LLVM 7. Add option to disable using ld.gold (40cfbc by wesm)
- 2019-02-08: ARROW-4430: [C++] Fix untested TypedByteBuffer
::Append method (4a80fd by bkietz) - 2019-02-08: ARROW-4472: [Website][Python] Blog post about string memory use work in Arrow 0.12 (308c0d by wesm)
- 2019-02-08: PARQUET-1521: [C++] Use pure virtual interfaces for parquet::TypedColumnWriter, remove use of ‘extern template class’ (490c6b by wesm)
- 2019-02-08: ARROW-3239: [C++] Implement simple random array generation (fb23ed by fsaintjacques)
- 2019-02-08: ARROW-4440: [C++] Revert recent changes to flatbuffers EP causing flakiness (b117a8 by wesm)
- 2019-02-08: ARROW-4469: [CI] Pin conda-forge binutils version to 2.31 for now (04ad21 by wesm)
- 2019-02-08: ARROW-3846: [Gandiva][C++] Build Gandiva C++ libraries and get unit tests passing on Windows (5232a4 by wesm)
- 2019-02-08: [Website] Edits to Python string blog post (65f37f by wesm)
- 2019-02-09: ARROW-4124: [C++] Draft Aggregate and Sum kernels (9a10ba by fsaintjacques)
- 2019-02-11: ARROW-4499: [CI] Unpin flake8 in lint script, fix warnings in dev/ (9db7a6 by wesm)
- 2019-02-11: ARROW-4498: [Plasma] Fix building Plasma with CUDA enabled (18f9e6 by pitrou)
- 2019-02-11: ARROW-3631: [C#] Add Appveyor configuration (62dd09 by wesm)
- 2019-02-11: ARROW-4434: [Python] Allow creating trivial StructArray (6b78fb by pitrou)
- 2019-02-11: ARROW-331: [Doc] Add statement about Python 2.7 compatibility (4cf1c7 by pitrou)
- 2019-02-11: ARROW-4457: [Python] Allow creating Decimal array from Python ints (1d72a8 by pitrou)
- 2019-02-11: ARROW-4363: [CI] [C++] Add CMake format checks (fc7977 by pitrou)
- 2019-02-12: ARROW-4481: [Website] Remove generated specification docs from site after docs migration (b31845 by wesm)
- 2019-02-12: ARROW-4181: [Python] Fixes for Numpy struct array conversion (a5d8cc by pitrou)
- 2019-02-12: ARROW-3292: [C++] Test Flight RPC in Travis CI (af60c2 by pitrou)
- 2019-02-13: ARROW-47: [C++] Preliminary arrow::Scalar object model (d831e2 by wesm)
- 2019-02-13: ARROW-4558: [C++][Flight] Implement gRPC customizations without UB (69d595 by wesm)
- 2019-02-14: ARROW-4340: [C++][CI] Build IWYU for LLVM 7 in iwyu docker-compose job (2571b0 by fsaintjacques)
- 2019-02-14: ARROW-4563: [Python] Validate decimal128() precision input (b9819e by pitrou)
- 2019-02-14: ARROW-1896: [C++] Do not allocate memory inside CastKernel. Clean up template instantiation to not generate dead identity cast code (47ebb1 by wesm)
- 2019-02-15: ARROW-4529: [C++] Add test for BitUtil::RoundDown (10e894 by fsaintjacques)
- 2019-02-15: ARROW-4576: [Python] Fix error during benchmarks (bf138a by pitrou)
- 2019-02-15: ARROW-3669: [Python] Raise error on Numpy byte-swapped array (40b0c8 by pitrou)
- 2019-02-16: ARROW-4341: [C++] Refactor Primitive builders and BooleanBuilder to use TypedBufferBuilder
(bbca71 by bkietz) - 2019-02-18: ARROW-4546: Update LICENSE.txt with parquet-cpp licenses (d0d810 by fsaintjacques)
- 2019-02-18: ARROW-4531: [C++] Support slices for SumKernel (568004 by fsaintjacques)
- 2019-02-18: ARROW-4565: [R] Fix decimal record batches with no null values (aeb40e by javierluraschi)
- 2019-02-18: ARROW-4420: [INTEGRATION] Make spark integration test pass and test against spark’s master branch (76979c by kszucs)
- 2019-02-19: ARROW-4624: [C++] Fix building benchmarks (707bac by pitrou)
- 2019-02-19: ARROW-4347: [CI][Python] Also run Python builds when Java affected. (6fd507 by wesm)
- 2019-02-19: ARROW-4623: [R] update Rcpp version (bd5770 by romainfrancois)
- 2019-02-19: ARROW-4618: [Docker] Makefile to build dependent docker images (ef28f2 by kszucs)
- 2019-02-19: ARROW-4581: [C++] Do not require googletest_ep or gbenchmark_ep for library targets (09cfd4 by wesm)
- 2019-02-20: ARROW-4562: [C++] Avoid copies when serializing Flight data (6c4118 by pitrou)
- 2019-02-20: ARROW-694: [C++] Initial parser interface for reading JSON into RecordBatches (9c19bb by bkietz)
- 2019-02-20: ARROW-3532: [Python] Emit warning when looking up for duplicate struct or schema fields (d3c5b8 by pitrou)
- 2019-02-21: ARROW-4559: [Python] Allow Parquet files with special characters in their names (671140 by pitrou)
- 2019-02-21: ARROW-3981: [C++] Rename json.h (a97725 by pitrou)
- 2019-02-21: ARROW-4372: [C++] Embed precompiled bitcode in the gandiva library (e8cc48 by kszucs)
- 2019-02-21: ARROW-3985: [C++] Let ccache preserve comments (345b09 by pitrou)
- 2019-02-22: ARROW-4654: [C++] Explicit flight.cc source dependencies (25b566 by fsaintjacques)
- 2019-02-22: ARROW-4643: [C++] Force compiler diagnostic colors (981460 by fsaintjacques)
- 2019-02-23: ARROW-4659: [CI] ubuntu/debian nightlies fail because of missing gandiva files (48f7b3 by kszucs)
- 2019-02-25: ARROW-4638: [R] install instructions using brew (1b78eb by romainfrancois)
- 2019-02-25: ARROW-4192: [CI] Fix broken dev/run_docker_compose.sh script (a9a766 by fsaintjacques)
- 2019-02-25: ARROW-585: [C++] Experimental public API for user-defined extension types and arrays (a79cc8 by wesm)
- 2019-02-26: ARROW-4641: [C++][Flight] Suppress strict aliasing warnings from unsafe” casts in client.cc” (49f1ff by wesm)
- 2019-02-26: ARROW-4520: [C++] use voidified expr to ignore DCHECK() custom messages in NDEBUG (37d9d3 by bkietz)
- 2019-02-26: ARROW-3816: [R] nrow.RecordBatch method (41fc38 by romainfrancois)
- 2019-02-27: ARROW-2392: [C++] Check schema compatibility when writing a RecordBatch (4a084b by pitrou)
- 2019-02-27: ARROW-4672: [CI] Fix clang-7 build entry (e648a7 by fsaintjacques)
- 2019-02-27: ARROW-4657: Don’t build benchmarks in release verify script (b3f3db by fsaintjacques)
- 2019-02-27: ARROW-3361: [R] Also run cpplint on Rcpp source files (d092dd by wesm)
- 2019-02-27: ARROW-4560: [R] array() needs to take single input, not … (2a14c7 by romainfrancois)
- 2019-02-27: ARROW-2627: [Python] Add option to pass memory_map argument to ParquetDataset (f2fb02 by wesm)
- 2019-02-27: ARROW-3121: [C++] Mean aggregate kernel (29aa92 by fsaintjacques)
- 2019-02-28: ARROW-4687: [Python] Stop Flight server on incoming signals (05ce0a by pitrou)