Summary: It's finally time we worked as a community to create a reliable, community-governed repository of trusted Python binary package artifacts, just like Linux, R, Java, and many other open source tool ecosystems have already done. Enterprise-friendly platform distributions do play an important role, though. I examine the various nuances within. I also talk about the new conda-forge project which may offer the way forward.
Python environment management hell: a personal story
When I needed to get Python code using "primordial pandas" into production at AQR in 2008, the hardest part by far was the "installation problem". At that time we didn't have a centralized cluster framework for running all production jobs; many processes where Python needed to run were on Windows desktops sitting under people's desks. Later, in 2009, we bought a rack of machines and we wrote a distributed task queue and scheduler (similar to Celery) that ran on these systems, but they were still Windows.
I didn't want to have my efforts to use more Python to be stymied, so I engaged
on a stressful tango with our IT staff to get consistent Python environments
with NumPy, SciPy, matplotlib, and other heavy packages deployed on dozens of
Windows desktops. Part of the difficulty was Windows, but the bigger issue was
that installing a "blessed" Python environment was not as simple as "hey, run
this one weird batch script". It required a lot of typing, clicking and
python.exe (and a large helping of tears and
R, where I had also done a lot of work, comparatively had its act completely
together. Install R, then run
install.packages(c(...)) and you were
done. Like magic. By comparison, especially on Windows, installing anything
containing C extensions or depending on 3rd-party libraries using
Binary package installers: apt, brew, yum, conda, enstaller, brew, and friends
One of the ways to solve the packaging hell is to have a tool that can analyze package dependency graphs and download and install the appropriate binary artifacts from one or more trusted channels. Strong emphasis here must be put on the word trusted.
Installing any software, whether open source or proprietary, is a risky proposition because you are purposefully executing code written by someone else. You may be giving that software access to your networks, data, secrets (security credentials, SSH keys), or even existing production processes. If you are a business working with sensitive data, it is justifiable to be extremely paranoid about the provenance of any x86 instruction executed on hardware under your control. The most sensitive of businesses (e.g. 3-letter US government agencies) may use air-gapping to protect against hackers or malicious code run from within their walls.
The way around the trust issue is to use a packaging tool, like
along with a trusted binary artifact repository. Often the binary artifacts are
provided by a company you trust not to modify the code maliciously
(e.g. inserting telemetry or malware code) in binary builds of otherwise open
source software. The packaging tool verifies a MD5 or SHA1 hash of the
artifacts to make sure that a man-in-the-middle has not tampered with the
compiled code inside. Turns out this is not that unrealistic, as recently
happened with Transmission.
Platform distributions: making open source work for the enterprise
As soon as a collection of related open source projects becomes viable as a solution to a major business problem (obvious examples: Hadoop, Linux, R, Python, Kafka, etc.), a common business strategy is to create a platform distribution, a big bundle of code, with the intent of making using open source software more palatable for use by large companies.
The notion of a platform distribution is appealing to enterprise users for many reasons. The distribution provider is handling a bunch of annoying problems for you:
- Assembling components and all of the correct versions which are known to work well together.
- Packaging components together and making them easy to install.
- Compiling binaries for multiple platforms and running the test suites for individual components to verify a valid build.
- Performing integration testing to verify that the components work well together.
- Providing tools for upgrading components over time
Usually, the distribution gets its own umbrella version number, like "Red Hat Enterprise Linux 5" to indicate the "blessed" collection of software.
Making money from 100% open-source platforms is very difficult. One of the more successful models used is known as "open core" or (increasingly) "hybrid open source", where anyone (individuals or businesses) can download and use the open source components for free, but you can buy services, support, indemnity, and valuable add-on proprietary software from the platform vendor.
One of the most important aspects of paid support for open source is having priority bug fixes and patched builds when something goes wrong in production. All software has bugs, and by the inherently anarchic nature of open source software, paying for peace of mind is something many big companies are willing to do.
The importance of community-governed package channels
In the late 1990s and early 2000s, there were many efforts to create community-led Linux distributions. Red Hat was founded in 1993, and as Red Hat and other enterprise vendors worked to commercialize open source Linux in the enterprise, I suspect the push for community-governed distributions became all the stronger. I won't present a revisionist history for what motivated the creators of Debian, CentOS, and others, but the commercialization of Linux likely played some significant role.
In Linux, like other open source ecosystems, one of the most important components outside of the Linux kernel itself is the package repository. From a minimal kernel installation with networking and a package manager, you can install a complete system. Thus, the stewardship of the source and binary packages is extremely important, including:
- Governance: in general, no single commercial entity can decide what packages can be installed or not installed
- Quality standards: Packages are deemed of acceptable quality and suitable in a production environment
- Build verification: a binary's build has been tested as appropriate for that package
- Trusted distribution: Packages are signed so that package managers and users can trust the provenance of a binary build
- Dependency management: Installing a package also installs its dependencies, which have been similarly verified and known to work together
Each distribution may have different goals. CentOS, for example, aims for
compatibility with Red Hat Enterprise Linux and accordingly uses the
package management tool. Debian and Ubuntu, by contrast, don't target
compatibility with any enterprise distributions, but provide multiple flavors
focusing respectively on long-term stability versus bleeding-edge innovation.
Community-led packaging and distribution is not unique to Linux: R, for example, has CRAN and the CRAN submission policy. It also has R-Forge to provide a community-governed service for posting project builds.
Python: Enterprise distributions and
Anaconda is a freely-available Python platform distribution created by the venture-backed start-up Continuum Analytics (folks I know quite well!), that has grown extremely popular in recent years. It plays a similar role as any other enterprise platform distribution based on open source software, just like Red Hat (Linux), Revolution Analytics (R), or Cloudera (Hadoop. Disclaimer: this where I work) have done.
Anaconda is not the only Python platform distribution, nor the first. Canopy (formerly EPD: Enthought Python Distribution) preceded it, and many of the same people have worked on both projects.
One of the sources of Anaconda's success is that it makes cross-platform Python environment management much easier than it used to be, and Continuum provides trusted builds of multiple Python versions and all of the Python and non-Python library dependencies needed to assemble a complete Python data analysis environment.
At the heart of Anaconda is a new packaging tool
conda. I won't go into the
technical or open-source-politics reasons why Continuum created a new binary
packaging tool for installation and dependency management for Anaconda. The
bottom line is that it:
- Provides an alternative to the pip / distribute / setuptools / virtualenv stack: use one command-line for everything and in general it just works.
- Works well for managing both Python and non-Python (e.g. C/C++) library dependencies
- Works consistently on all major platforms (Windows, OS X, Linux)
- Is freely available (the conda tool)
In practice, conda works extremely well. It's the packaging tool I wish existed in 2008, and I'm glad we got it eventually!
The downside of the Anaconda distribution itself is that ultimately, just like other enterprise platform companies, Continuum will inevitably be faced with decisions that pit the needs of enterprise customers (valuing stability and long-term support) with the community (valuing innovation and community governance). The most obvious source of conflict would be getting new packages included in Anaconda requiring a lot of work (from Continuum employees) to build, test, and package. Another would be providing updated builds for projects that only matter to a small, but passionate subset of users (which may not be paying customers).
Community governance of the code is typically handled through organizations like the Apache Software Foundation. This is a whole different beast of a problem, and I'll write about my thoughts on open source project governance some other time.
conda-forge: Community-led packaging using conda
The point of this is not to say "Anaconda is bad for the Python
community". Quite the contrary: Anaconda has been and continues to be good for
the community, and
conda is an excellent packaging tool. Bigger picture:
acceptance of the Python data stack in the enterprise is existentially
important for the continued succcess and growth of the ecosystem. Just like
Linux needed RHEL and Hadoop needed CDH, PyData needed Anaconda to get where it
has gotten now. High quality projects like pandas and scikit-learn were not
enough by themselves.
Having a community-governed package channel for conda and a community process for submitting, verifying, and storing signed project releases would be ideal. Additionally, there would need to be shared build and continuous integration infrastructure so that we aren't all having to install Visual Studio on our desktops to be able to create reliable Windows builds.
Some will note that there is the anaconda.org, a product created by Continuum for free and paid use (for private builds). The problem with anaconda.org is that it is mainly a place to put binary artifacts.
Given all this, I was incredibly excited when I learned about conda-forge. If we all as a community can throw our weight behind this effort, I believe we can achieve:
- A community-led process for posting trusted binary builds of projects, like CRAN has worked for many years for R.
- Easier integration testing amongst groups of projects, especially on Windows
- Someday, a community-governed Anaconda-like Python distribution
It's complicated, though. Doing this well will require a lot of money and people's time. The R community has sustained itself through the support of academic institutions over the years which the Python data stack doesn't have as much of. NumFOCUS may be able to provide a fiscal conduit for tax-deductible support of conda-forge.
I've long been envious of the community package management infrastructure that the R community has, But many enterprises prefer to use "blessed" platform distributions, e.g. R has Revolution R (now Microsoft R Open). So, we need both, and I encourage readers to help where possible, either through development or money, to help the nascent community effort (i.e. conda-forge) grow.