Feather: it's about metadata


Summary: Feather's good performance is a side effect of its design, but the primary goal of the project is to have a common memory layout (Apache Arrow) and metadata (type information) for use in multiple programming languages.




conda-forge and PyData's CentOS moment


Summary: It's finally time we worked as a community to create a reliable, community-governed repository of trusted Python binary package artifacts, just like Linux, R, Java, and many other open source tool ecosystems have already done. Enterprise-friendly platform distributions do play an important role, though. I examine the various nuances within. I also talk about the new conda-forge project which may offer the way forward.


On Software Demos and Potemkin Villages


Summary: It's much easier to create impressive demos than it is to create dependable, functionally-comprehensive production software. I discuss my thoughts on this topic.


Avoid unsigned integers in C++ if you can


Unsigned integers (size_t, uint32_t, and friends) can be hazardous, as signed-to-unsigned integer conversions can happen without so much as a compiler warning.


Compiling DataFrame code is harder than it looks


Many people have asked me about the proliferation of DataFrame APIs like Spark DataFrames, Ibis, Blaze, and others.

As it turns out, executing pandas-like code in a scalable environment is a difficult compiler engineering problem to enable composable, imperative Python or R code to be translated into a SQL or Spark/MapReduce-like representation. I show an example of what I mean and some work that I've done to create a better "pandas compiler" with Ibis.


Do average consumers still need Dropbox?


TL;DR: At the risk of stating the obvious, manual management of files on disks now in 2016 is increasingly old-fashioned and largely unnecessary, especially among the non-technorati. Encapsulated / managed cloud services and consumer web applications have made it anachronistic for most normal people. Whether this is a good thing can be debated, but it is happening nonetheless.

In this post, I explore this topic in some detail as it relates to my personal experience.


Why pandas users should be excited about Apache Arrow


I'm super excited to be involved in the new open source Apache Arrow community initiative. For Python (and R, too!), it will help enable

  • Substantially improved data access speeds
  • Closer to native performance Python extensions for big data systems like Apache Spark
  • New in-memory analytics functionality for nested / JSON-like data

There's plenty of places you can learn more about Arrow, but this post is about how it's specifically relevant to pandas users. See, for example: