Announcing Ursa Labs: an innovation lab for open source data science

work

ursa labs

apache arrow

Author

Wes McKinney

Published

April 19, 2018

Funding open source software development is a complicated subject. I’m excited to announce that I’ve founded Ursa Labs (https://ursalabs.org), an independent development lab with the mission of innovation in data science tooling.

I am initially partnering with RStudio and Two Sigma to assist me in growing and maintaining the lab’s operations, and to align engineering efforts on creating interoperable, cross-language computational systems for data science, all powered by Apache Arrow.

In this post, I explain the rationale for forming Ursa Labs and what to expect in the future.

Funding Open Source Software: Maintenance and Innovation

In recent years, the world’s businesses have become more dependent than ever on open source software (“OSS”, henceforth). How and why this happened will surely be the subject of future books and research, but at present we are faced with existential challenges as we endeavor to keep making open source “work” for everyone.

In my experience, open source projects feature two dominant archetypal modes: innovation and maintenance. The innovation stage often occurs at the beginning of projects: there are few users and the software changes or evolves rapidly. When a project becomes successful, it can become more conservative. Development shifts to stability, bug fixes, and gradual change and growth. There are many more users, and changes, especially “breaking” ones, can have a high cost to the project’s reputation and future. OSS maintainers, which are often volunteers, routinely “burn out” under the strain of supporting burgeoning user bases who sometimes take a project’s existence and maintenance for granted.

Supporting Maintenance

Some OSS projects become so important that their users consider them to be mission-critical infrastructure software, like Linux or security libraries like OpenSSL. The consequences of under-maintained infrastructure have been studied extensively in recent years in the wake of shocking security vulnerabilities exposed in widely used projects.

Funding OSS maintenance, while challenging, has a clear value-proposition to the world’s organizations, who increasingly view their dependence on OSS as a liability. Companies like RedHat have built their businesses on providing peace-of-mind around mission-critical OSS like Linux.

We are starting to see new business models emerge for funding OSS maintenance, such as Tidelift, which has begun selling a type of “insurance policy” for the package dependency graph of mission-critical OSS frameworks like React and AngularJS. The understanding is that funds from these insurance policies will be paid to the maintainers of projects in the dependency graph to provide timely bug fixes and support the healthy operation of the top-level projects.

Supporting Innovation

Funding the innovation stage of OSS is can be more difficult because of the heightened risk profile. A new project may or may not become successful or widely-used.

Most people know that I struggled for many years to obtain support for developing pandas; in the end I convinced Adam Klein and Chang She to take time away from their well-paying New York finance day jobs (at Goldman Sachs and Citigroup, respectively) to work on the project with me in 2012. I estimate between the three of us pandas cost at least $500,000 in opportunity cost as we did not earn wages during the thousands of hours we invested in the project in 2011 and 2012. If we had refused to build pandas unless we raised enough money to pay for our rent and families’ cost of living, the project likely would not be what it is today.

Open source data science software has become incredibly important to how the world analyzes data and builds production machine learning and AI models. In Google, Facebook, and other industry research labs, Python has become the primary machine learning user interface. If you had told me this in 2008 when I started building pandas, I might not have believed you.

The risks to not funding innovation in OSS for data science are many. The ones I think most about are:

Data scientists’ productivity will suffer, especially as data sizes continue to grow.
Computing costs will remain high as less efficient computing tools are applied as well as possible to process the world’s data.
Organizations continue to rely on less flexible, more expensive proprietary software because they perceive OSS as inadequate.

Traps, and avoiding them

OSS developers have employed various strategies to support their work in lieu of direct funding. Sometimes they work, and sometimes they can be “traps”. I have directly experienced some variant of all of these problems.

The Consulting Trap: project creators hustle for services contracts with users of their software. The contract dealmaking hustle distracts from development, and the services work itself fragments attention from the core development of projects.
The Startup Trap: startups build businesses that monetize the growing use of one or more open source projects. While some of these businesses have succeeded, the creators of open source projects generally must divide their attention between building a business and building a software project. This is obviously a tradeoff: with venture capital and revenue, one can build a larger engineering team. But, startups can have governance conflicts with their user and developer communities. Businesses with hybrid open-source models sometimes must short-change OSS work in favor of work that will grow revenue; the company founders’ desire to invest in OSS may come into conflict with the expectations of the board of directors, who are usually venture capital investors. In some cases, unfortunately, OSS developers are laid off to cut costs.
The Corporate User Trap: a large company that depends on OSS hires or grows developers of those projects to innovate and maintain them going forward. In some cases a company may start a closed-source project, then open source it later. There are many possible problems that arise with this model. Developers may leave a company and fail to find another that will support their work on a project. A company may lose interest in a project and assign the developers to a different project. In some cases, a company will build the new features they need and then “disappear” as they have gotten what they need out of the project. A developer’s ability to grow a larger development team may be limited by budgeting concerns that are out of their control.

2013 to now: DataPad, Cloudera, Two Sigma, and Apache Arrow

Hot on the heels of getting pandas off the ground and publishing my book Python for Data Analysis in 2012, Chang She and I founded DataPad, a venture-funded startup, with the objective of building a data product and later investing R&D budget back into the Python ecosystem. We handed off day-to-day maintenance of pandas to Jeff Reback, Phillip Cloud, and others, who’ve done an amazing job growing the project over the last 5 years.

By mid-2014, at DataPad we found ourselves working on complex systems engineering problems in enterprise analytics that would be more effectively solved in a larger enterprise software company. After the experience of building out pandas and developing the DataPad product, I had accumulated a list of complaints and grievances against pandas’s computational foundations that I summarized infamously in my talk 10 Things I Hate about pandas. In September 2014, the DataPad team and I joined Cloudera to work on these problems and more.

When I arrived at Cloudera, one of my objectives was to form alliances with the big data and analytic database communities to collaborate to solve shared data systems problems for the benefit of the data science world. The two major artifacts of my time at Cloudera were Ibis, a lazy computational expression framework geared toward SQL-style execution engines, and Apache Arrow, a cross-language in-memory data frame format and analytics development platform.

By mid-2016, facing a competitive big data infrastructure market and an arduous path to profitability, Cloudera was not well-positioned to build a team to join me in developing Apache Arrow and improve computational systems for data science. While there was some obvious low-hanging fruit to accelerate Python-on-Spark, overall ROI from investing in Arrow was likely to be several years away and thus was deemed too risky to justify a large budget allocation.

Around this time, I was lucky to connect with Two Sigma, a financial technology and investment management company with a growing OSS development practice and a petascale data warehouse being actively used with Apache Spark and the Python data science stack. I joined Two Sigma in 2016 as a software architect in the analysis tools group, with a plan to make a forward-looking long-term investment in performance and scalability for the Python data stack via the Apache Arrow project. Working with the Two Sigma engineering team, we have helped reach some major Arrow-related milestones. The project has made 11 releases, grown over 130 contributors, and established exciting collaborations with Apache Spark (accelerated data access and Python function execution in Apache Spark), Berkeley RISELab (Fast Python Serialization with Ray and Apache Arrow), and the GPGPU community.

As Apache Arrow has gotten off the ground over the last few years, it has become apparent that the problems we are tackling are much larger in scope than the interests of a single organization or even programming language. As I have argued extensively in talks over the last few years (Data Science Without Borders at JupyterCon, Memory Interoperability for Analytics and Machine Learning at Stanford’s ScaledML, Raising the Tides: Open Source Analytics for Data Science at the Newsweek AI and Data Science Conference), we are solving the same kinds of problems across Python, R, and other languages, and Arrow provides a unifying technology for creating shared computational infrastructure for data science.

After many years collaborating with and learning from the Python, R, JVM, Julia, and other data science communities, I have become convinced that the data science world would benefit from shared computational libraries. I envision a portable, community-standard “data science runtime” that can be utilized for processing native Arrow-based data frames in just about any programming language. This is a huge project. Some of the major areas of work for this include:

Portable C++ shared libraries with bindings for each host language (Python, R, Ruby, etc.)
Portable, multithreaded Apache Arrow-based execution engine for efficient evaluation of lazy data frame expressions created in the host language.
Reusable operator “kernel” containing functions utilizing Arrow format as input and output. This includes pandas-style array functions as well as SQL-style relational operations (joins, aggregations, etc.)
Compilation of operator “subgraphs” using LLVM; optimization of common operator patterns.
Support for user-defined operators and function kernels.
Comprehensive interoperability with existing data representations (e.g. data frames in R, pandas / NumPy in Python).
New front end interfaces for host languages (e.g. dplyr and other “tidy” front ends for R, evolution of pandas for Python)

Enter the Dragon Bear

In light of my experiences building data science software over the last 10 years, I believe the way that I can best serve the open source data science world is by creating an independent organization, Ursa Labs, dedicated to advancing cross-language computational systems for data science. The immediate purpose of this organization is to hire and support developers of data science systems that are part of the burgeoning Apache Arrow ecosystem. The lab will partner with larger organizations to be supported through direct funding and engineering collaborations.

While I am primarily looking for direct funding relationships with companies to grow my development team, I will also be accepting smaller direct donations to the lab which can hopefully support additional developer headcount in time.

Partnering with RStudio

RStudio will be helping me with the administrative side of operating Ursa Labs (HR, benefits, finances, etc., which amounts to a lot of hard work.) I will manage the money raised by the lab, which will primarily be used to pay for salary and benefits for full-time engineers on the Ursa Labs team. The Ursa team and I will operate as a functionally independent engineering group within the RStudio organization and collaborate with other members of RStudio on R-related development work. While it might seem strange to some that I, a long-time Python developer, would be partnering with a company that builds software for R programmers, it actually makes perfect sense.

In 2016, Hadley Wickham and I had a brief collaboration to create the Feather file format, an Arrow-based interoperable binary file format for data frames that can be used from Python and R. The idea of Feather was to socialize the idea of interoperable data technology using Apache Arrow. Many people were surprised to see Hadley and I working together when Python and R are “supposed” to be enemies. The reality is that Hadley and I think the “language wars” are stupid when the real problem we are solving is human user interface design for data analysis. The programming languages are our medium for crafting accessible and productive tools. It has long been a frustration of mine that it isn’t easier to share code and systems between R and Python. This is part of why working on Arrow has been so important for me; it provides a path to sharing of systems code outside of Python by enabling free interoperability at the data level.

R, like Python, faces systems-level problems around fast and scalable in-memory data processing. Since the problems we are solving are so structurally similar, we have long believed that a more extensive collaboration between the communities should happen. It is my goal for the software that I am building to work equally well for R programmers as for Python programmers. As part of the collaboration with RStudio, Hadley Wickham will act as a technical adviser to the work to ensure that we are looking after the needs of R users. We’re all very excited about this.

In the last several years, I have been extremely impressed with the RStudio organization and its founder and CEO, J.J. Allaire. As he, Hadley, and I have gotten to know each other at data science events, I found that we share a passion for the long-term vision of empowering data scientists and building a positive relationship with the open source user community. Critically, RStudio has avoided the “startup trap” and managed to build a sustainable business while still investing the vast majority of its engineering resources in open source development. Nearly 9 years have passed since J.J. started building the RStudio IDE, but in many ways he and Hadley and others feel like they are just getting started.

Partnering with Two Sigma

During my time at Two Sigma, I worked towards a shared vision of data science tools with Matt Greenwood, who heads the company’s Modeling Engineering organization and David Palaitis who manages Two Sigma’s open source efforts. After almost two years, we realized the problems I’m trying to solve with Apache Arrow are bigger than any one company can support. Eventually a project’s scope and needs expand beyond the interests and capabilities of any single organization.

Having access to real problems in data science at massive scale at Two Sigma has informed and validated my vision for Ursa, and my departure to start Ursa Labs does not mean a break on my relationship with the company. By partnering with them as I begin my next venture, I can keep the feedback loop open as they work as early adopters of Arrow software. Two Sigma’s interest in Arrow is part of a larger commitment to creating a productive future for data science, including commitment to communities built around Pandas, Ibis, Jupyter, Spark, Mesos and Tensorflow, among others.

Two Sigma will contribute to Ursa Labs through employee contributions to Ursa Labs projects like Arrow and funding external open sources devs as needed.They will also collaborate on technical advising and rallying support in the community. Matt can help generate the new initiative through his seats on the board of NumFOCUS and TS Ventures. I’ll continue my work with Two Sigma core OSS engineers like Jeff Reback (pandas) and Phillip Cloud (Ibis), and you can look forward to joint talks, such as an upcoming presentation with Jeff Reback at PyData.

Getting involved

We are only now at the beginning of a long journey ahead of us to advance the state of the art in data science tools.

We will soon be posting some full-time engineering positions, so if you are a software engineer and are interested in joining the lab’s mission, please stay tuned. In the meantime, we’d love to have you involved with Apache Arrow.

If you are with an organization in a position to sponsor our work or partner with us in some other way, please reach out to info@ursalabs.org.