Ursa Labs March 2019 Report
The first quarter of 2019 has now wrapped up. In March we spent a good amount of time focused on getting the 0.13.0 Apache Arrow release out of the door. I will mention a few development highlights from the month and provide the full changelog of patches later in the post.
Development Highlights
We are continuing to set up our physical build and test cluster which we’ll use to run integration tests, GPU-enabled builds, benchmark comparisons, and other automated tests to help with Arrow development.
Some highlights from our work in the Apache Arrow codebase:
- C++ CMake Revamp: we collaborated with Uwe Korn, Kouhei Sutou, and other parts of the Arrow community on a major revamp of the CMake build system for C++, with associated improvements and fixes to the downstream packages
- C++ expression algebra: we have begun prototyping an expression algebra to use for query engine development in C++. This work is loosely modeled after our prior work in the Ibis project.
- Arrow Flight: the C++ build dependencies (including gRPC) for the new Arrow Flight messaging and RPC framework are now available in conda-forge. We have also added a URI library for file paths to our build toolchain to help describe network locations and protocols for Flight. We have been working closely with Two Sigma and Dremio on this important effort.
- C++ Query Engine Discussions: we wrote a 10-page discussion document for the design of an embeddable Arrow-native analytical query engine in C++
- R packaging support: we are working to get the Arrow R package submitted to CRAN for R users
- Gandiva packaging: Gandiva (LLVM expression compiler) is now shipped in the Arrow 0.13.0 Python wheels
We have much work ahead of us and look forward to seeing you on GitHub, JIRA, and the dev@arrow.apache.org
developer mailing list.
Sponsor Acknowledgements
We are grateful to the support of our sponsors:
- RStudio
- NVIDIA AI Labs
- ODSC Conference
- Two Sigma Investments
If you or your company would be interested in sponsoring the work of Ursa Labs, please contact us at info@ursalabs.org.
Team Changelog
The team had 68 commits merged into Apache Arrow in March. You can click on the ASF JIRA links to learn more about the discussion on a particular issue or the commit hash to see each patch.
- 2019-03-01: ARROW-4297: [C++] Fix build error with MinGW-w64 32-bit (d4931c by javierluraschi)
- 2019-03-02: ARROW-4696: Better CUDA detection in release verification script (503502 by fsaintjacques)
- 2019-03-04: ARROW-4707: [C++] moving BitsetStack to BitUtil:: (4e8e07 by bkietz)
- 2019-03-04: ARROW-3123: [C++] Implement Count aggregate kernel (1b30ab by fsaintjacques)
- 2019-03-04: ARROW-4448: [Java][Flight] Disable flaky TestBackPressure (8724c1 by fsaintjacques)
- 2019-03-05: ARROW-3770: [C++] Validate schema for each table written with parquet::arrow::FileWriter (bf7ed4 by bkietz)
- 2019-03-05: ARROW-4766: [C++] Fix empty array cast segfault (f1bc19 by fsaintjacques)
- 2019-03-07: ARROW-3550: [C++] use kUnknownNullCount for the default null_count argument (09466c by bkietz)
- 2019-03-07: ARROW-4710: [C++][R] New linting script skip files with cpp” extension” (0249f1 by romainfrancois)
- 2019-03-08: ARROW-4782: [C++] Prototype array and scalar expression types to help with building an deferred compute graph (08ca13 by wesm)
- 2019-03-08: ARROW-4774: [C++] Fix FileWriter::WriteTable segfault (ef9938 by fsaintjacques)
- 2019-03-08: ARROW-4699: [C++] remove json chunker’s requirement of null terminated buffers (4afd2e by bkietz)
- 2019-03-11: ARROW-4790: [Python/Packaging] Update manylinux docker image in crossbow task (3db579 by kszucs)
- 2019-03-12: ARROW-4837: [C++] Support c++filt on a custom path in the run-test.sh script (0fb9e5 by kszucs)
- 2019-03-12: ARROW-4664: [C++] Do not execute expressions inside DCHECK macros in release builds (082aa4 by wesm)
- 2019-03-12: ARROW-4789: [C++] Deprecate and and later remove arrow::io::ReadableFileInterface (7a539f by wesm)
- 2019-03-13: ARROW-4834: [R] Feature flag when building parquet (95d62c by javierluraschi)
- 2019-03-13: ARROW-1639: [Python] Serialize RangeIndex as metadata via Table.from_pandas instead of converting to a column of integers (86f480 by wesm)
- 2019-03-13: ARROW-4776: [C++] Add DictionaryBuilder constructor which takes a dictionary array (65d0e1 by bkietz)
- 2019-03-13: ARROW-4811: [C++] Fix misbehaving CMake dependency on flight_grpc_gen (e7713a by wesm)
- 2019-03-13: ARROW-4831: [C++] CMAKE_AR is not passed to ZSTD thirdparty dependency (0c4f85 by kszucs)
- 2019-03-13: ARROW-4850: [CI] Ensure integration_test.py returns non-zero on failures (4fefff by fsaintjacques)
- 2019-03-14: ARROW-3364: [Docs] Add docker-compose integration documentation (9198f6 by fsaintjacques)
- 2019-03-14: ARROW-4251: [C++][Release] Add option to set ARROW_BOOST_VENDORED environment variable in verify-release-candidate.sh (954e3f by wesm)
- 2019-03-14: ARROW-4866: [C++] Fix zstd_ep build for Debug, static CRT builds. Add separate CMake variable for propagating compiler toolchain to ExternalProjects (431fc1 by wesm)
- 2019-03-14: ARROW-4673: [C++] Implement Scalar::Equals and Datum::Equals (548e19 by fsaintjacques)
- 2019-03-15: ARROW-4878: [C++] Append to CONDA_PREFIX when using ARROW_DEPENDENCY_SOURCE=CONDA (d65503 by wesm)
- 2019-03-15: ARROW-4751: [C++] Add pkg-config to conda_env_cpp.yml now that it’s available on Windows (be8f94 by wesm)
- 2019-03-15: ARROW-4056: [C++] Unpin boost-cpp in conda_env_cpp.yml (b601a6 by wesm)
- 2019-03-15: ARROW-4873: [C++] Clarify documentation about how to use external ARROW_PACKAGE_PREFIX while also using CONDA dependency resolution (359ba4 by wesm)
- 2019-03-16: ARROW-4867: [Python] Respect ordering of columns argument passed to Table.from_pandas (76e8fe by wesm)
- 2019-03-17: ARROW-4339: [C++][Python] Developer documentation overhaul for 0.13 release (d94a9f by wesm)
- 2019-03-18: Fix markdown syntax in python’s and rust’s readme (#3964) (377104 by kszucs)
- 2019-03-18: ARROW-4855: [Packaging] Generate default package version based on cpp tags in crossbow.py (71c529 by kszucs)
- 2019-03-18: ARROW-4909: [CI] Use hadolint to lint Dockerfiles (410752 by kszucs)
- 2019-03-18: ARROW-3824: [R] Add basic build and test documentation (7a495e by fsaintjacques)
- 2019-03-19: ARROW-4961: [C++] Add documentation note that GTest_SOURCE=BUNDLED is current required on Windows (8abed5 by wesm)
- 2019-03-19: ARROW-4640: [Python] Add docker-compose configuration to build and test the project without pandas installed (50bc9f by kszucs)
- 2019-03-19: ARROW-4413: [Python] Fix pa.hdfs.connect() on Python 2 (8281a5 by pitrou)
- 2019-03-19: ARROW-4869: [C++] Fix gmock usage in compute/kernels/util-internal-test.cc (8d5733 by bkietz)
- 2019-03-19: ARROW-4928: [Python] Fix Hypothesis test failures (ee59aa by pitrou)
- 2019-03-19: ARROW-4954: [Python] Fix test failure with Flight enabled (af8686 by pitrou)
- 2019-03-19: ARROW-4637: [Python] Conditionally import pandas symbols if they are used. Do not require pandas as a test dependency (286bf7 by wesm)
- 2019-03-20: ARROW-4697: [C++] Add URI parsing facility (ca2351 by pitrou)
- 2019-03-20: ARROW-4969: [C++] Set RPATH in correct order for test executables on OSX (bd00f8 by kszucs)
- 2019-03-20: ARROW-549: [C++] Add arrow::Concatenate function to combine multiple arrays into a single Array (43f2a3 by bkietz)
- 2019-03-20: ARROW-3208: [C++] Fix Cast dictionary to numeric segfault (37f898 by fsaintjacques)
- 2019-03-21: ARROW-4951: [C++] Turn off cpp benchmarks in cpp docker images (50e9f6 by kszucs)
- 2019-03-21: ARROW-4862: [C++] Fix gcc warnings in CHECKIN (ad1697 by fsaintjacques)
- 2019-03-21: ARROW-4881: [C++] remove references to ARROW_BUILD_TOOLCHAIN (3bf1e3 by bkietz)
- 2019-03-21: [Release] Apache Arrow JavaScript 0.4.1 (e9cf83 by kszucs)
- 2019-03-24: ARROW-4989: [C++] Find re2 on Ubuntu if asked to (3cd5df by pitrou)
- 2019-03-24: ARROW-4688: [C++][Parquet] Chunk binary column reads at 2^31 - 1 byte boundaries to avoid splitting chunk inside nested string cell (fc7d07 by wesm)
- 2019-03-24: ARROW-3843: [C++][Python] Allow a degenerate” Parquet file with no columns” (080c83 by wesm)
- 2019-03-25: ARROW-4250: [C++] adding explicit epsilon for ApproxEquals and corresponding assert macro (d0626c by bkietz)
- 2019-03-25: ARROW-5006: [R] parquet.cpp does not include enough Rcpp (537bfb by romainfrancois)
- 2019-03-25: ARROW-5003: [R] remove dependency on withr (0536ef by romainfrancois)
- 2019-03-26: ARROW-5012: [C++] Install testing headers (fd8887 by pitrou)
- 2019-03-26: ARROW-4872: [Python] Keep backward compatibility for ParquetDatasetPiece (f70dbd by wesm)
- 2019-03-26: ARROW-5011: [Release] Add support in source release script for custom git hash (52ca07 by fsaintjacques)
- 2019-03-26: ARROW-5010: [Release] Fix source release docker (eb8bc6 by fsaintjacques)
- 2019-03-26: ARROW-4995: [R] Support for winbuilder for CRAN checks (c43a7f by javierluraschi)
- 2019-03-26: ARROW-4645: [C++/Packaging] Ship Gandiva with OSX and Windows wheels (9c174f by kszucs)
- 2019-03-26: ARROW-4952: [C++] Floating-point comparisons should consider NaNs unequal (d2b5b3 by pitrou)
- 2019-03-27: ARROW-4646: [C++/Packaging] Ship gandiva with the conda-forge packages (4e3c73 by kszucs)
- 2019-03-28: ARROW-5041: [C++] add GTest_SOURCE=BUNDLED to verify-release-candidate.bat (b5f842 by bkietz)
- 2019-03-28: ARROW-5031: [Dev] Run CUDA Python tests in release verification script (4b325d by pitrou)
- 2019-03-28: ARROW-5029: [C++] Fix compilation warnings in release mode (97abab by pitrou)