Python for Data Analysis, 2nd Edition shipping in mid-October 2017. Order on Amazon!

About

The views and opinions on this website are my opinions and do not reflect the views of my current or former employers.

Short biography

Since 2007, I have been creating fast, easy-to-use data wrangling and statistical computing tools, mostly in the Python programming language. I am best known for creating the pandas project and writing the book Python for Data Analysis. I am also a contributor to the Arrow, Kudu (incubating), and Parquet projects within the Apache Software Foundation. I was the co-founder and CEO of DataPad. I later spent a couple years leading efforts to bring Python and Hadoop together at Cloudera. I'm now working for Two Sigma in New York.

Open source projects

pandas (website): Python in-memory data wrangling, preparation, and analytics

  • I created pandas and am its Benevolent Dictator for Life

Ibis (blog, code)

  • I created Ibis at Cloudera.

Feather: a language agnostic data frame file format

  • Hadley Wickham and I designed Feather in January 2016 and released it in March.

Apache Arrow

  • I am a committer on Apache Arrow, focusing on the C++ and Python implementations.

Apache Kudu (incubating)

  • I originally created the Python interface to Kudu, using Cython to wrap the C++ API.

Apache Parquet

  • I am a committer and member of the PMC for Apache Parquet. I have been focusing on the C++ implementation

statsmodels (website, code)

  • I worked on time series models (e.g. VAR) and pandas integration.

Long form biography

I'm an American computer programmer working for Cloudera in San Francisco. I studied theoretical mathematics at MIT (graduating in late 2006) before becoming very interested in programming and tools for data analysis, especially for industry use cases, in 2007.

From August 2007 to July 2010, I worked on the front office quant research team at AQR Capital Management, a large quantitative investment manager in Greenwich, CT. During this time, I led a very successful effort to migrate research and production model building and research processes to the Python programming language. I started building pandas on April 6, 2008, as part of a skunkworks effort to reproduce some econometric research in Python. As part of my work, we formed a new Research Development team for the global macro group to drive software innovation in the front office.

I joined the PhD program in the Statistical Science Department at Duke University before taking leave in Summer 2011 to explore ways to develop open source software (such as pandas) in a sustainable way. I discovered that entrepreneurship often makes more sense than consulting to fund open source with more leverage.

From November 2011 through August 2012, I wrote Python for Data Analysis.

In January 2012, I co-founded Lambda Foundry and we explored developing value-add financial software for the Python data stack. Ultimately the team and I went our separate ways.

In January 2013, I co-founded DataPad with Chang She, fellow MIT grad and a former AQR colleague. We were developing a full stack visual analytics product for business users, using the Python data stack for most of our core technology. We raised venture capital from Accel Partners, Google Ventures, SV Angel, Andreessen-Horowitz, and other investors.

In September 2014, DataPad's technology assets were acquired by Cloudera and we joined the engineering team there.

Earlier life

I was born in 1985. I grew up mostly in northeast Ohio, preceded by 4 generations of newspaper men.

From 1998 to 2001, I became very involved in the video game speed run community. I managed the website for the GoldenEye 007 speed run community and competed in the game myself.

I'm an avid traveler and since a young age have had a keen interest in linguistics, accents, and foreign languages. I've spent the most time in Spain and Germany.