SciPy 2011 Conference Highlights

SciPy 2011 was a blast! Intense and fun but of course tiring due to burning the candle at both ends. I was delighted to see a lot of familiar faces after my first SciPy conference last year: the Enthought crew (Travis Oliphant, Eric Jones, et al), Peter Wang (chaco), Fernando Pérez and MinRK / Min Ragan-Kelley (of IPython fame), my statsmodels collaborator Skipper Seabold, Stefan van der Walt, Josh Hemann, and too many others to mention. It was also great to finally meet a lot of other well-known SciPythonistas for the first time: Chris Fonnesbeck (PyMC), John D. Cook, Gaël Varoquaux (scikit-learn, Mayavi), and many others. A 2-day slate of talks seems too little to do justice to all the awesome stuff people are doing!

Here’s partial list of my favorite parts of the conference and general thoughts (in no particular order). Since there were two talk tracks running in parallel I unfortunately could only see half the talks; fortunately videos should be soon posted. I’ll return to this post and add video links once they’re available. I’ll write a separate post about our statsmodels sprint!

Fernando Pérez’s IPython talk: Slides

I always tell people that IPython is one of scientific Python’s killer apps. For proof, watch the video of Fernando’s talk once it’s up. Chris Fonnesbeck wrote a great blog article on the new features in IPython. In essence they’ve taken an already innovative and hugely productivity enhancing tool and rearchitected it to be a) much more easily embedded (see the rich Qt console and the web interface) and b) an extremely powerful parallel computing environment. Truly inspiring work from MinRK, Fernando, Brian Granger, and crew. My hat’s off to you guys.

On my end I started using the rich Qt console (ipython-qtconsole) for interactive work and demos a few months ago. Having inline matplotlib plots while doing tech demos is a huge, huge win!

IPython Rich Qt Console

Peter Wang’s talk on Metagraph: Slides

Fortunately Peter’s slides have notes on them which helps to understand the slides in more detail. He’s working to solve some really major problems here at the core of how we do computation on arrays. The notes do far better justice to the talk, but the nutshell is that he’s building a loop-fusing (a.k.a. “stream fusion”) compiler for expressing array computations and processing streaming data. This is extremely exciting as for the longest time this has been a watershed between NumPy-based tools and work being done in the APL/J/K family and functional languages like Haskell. Especially for big data / streaming data applications, a “fusing” compiler / VM will eliminate loops and enable the program to make only single passes over the data as opposed to multiple passes as commonly happens now. This is related to Python projects like numexpr and Theano (which are also worth checking out).

The extent of what’s possible with these kinds of ideas is a bit hard to grok but I’m very much looking forward to seeing how this project develops over the coming 6 months. As he says “The goal is to make scientific computing Pythonic”. Bringing the full power of array-based and functional languages at the core computational level with a very Pythonic interface could have very serious impact on the direction of scientific computing.

Gaël Varoquaux’s Neuroscience / scikit-learn Talk: Slides

If you’ve ever tried making a presentation in LaTeX, it should be clear that Gaël is at one with the beamer gods. That is one swanky looking deck of slides; it should come as no shock that it’s had nearly 30k views on SlideShare! Aesthetics aside, I’m excited to see how much progress the scikit-learn folks have made on building a really excellent machine learning library. Now if only we can get that much muscle poured into statsmodels! I think generally machine learning is an optimal application area for the scientific Python stack and this talk shows why: solid algos, great data visualization, and excellent task / workflow / big data management tools (e.g. joblib).

As an aside, Gaël’s point that you “cannot develop science and software separately” is highly relevant to all data-driven fields of academic research. Far too often in academia, software development is viewed as a distant second to innovations in methodology. Faculty very rarely get tenure based on their contributions to research software, no matter how impactful. As a result, graduate students and faculty alike are left re-inventing the wheel more often that not, leading to a wasteland of essentially throwaway MATLAB or R code. I am hopeful that this pattern will someday change at a grand scale.

Matthew Goodman’s Lightning Talk: Slides

Here’s another talk where having the video makes a big difference. MG gives a great list of indispensable scientific Python tools. I especially enjoyed the bit about IPython: “If you are not using this tool, you are doing it wrong!”

Hilary Mason‘s keynote talk

Hilary gave a really fun keynote presentation about scientific computing and the work going on at bit.ly. There’s a lot more happening in URL-shortening-land than I thought! When asked about other technologies like R she said, “There’s no way we [bit.ly] will run R in production.” Having experienced the misery associated with trying to run R in production in the quant finance world I must say that I definitely agree with that sentiment.

Python in Finance Panel

Travis Oliphant, Henry Ward (CEO of Second Sight), and I participated in a panel discussion moderated by Peter Wang. We were asked about our experiences using Python for financial applications as well as the institutional and technical challenges associated with using Python to build research and production systems within larger financial enterprises.

It’s a topic for another blog post, but I spoke at length about the role that Python plays in solving the so-called “research-production gap”. That is, often financial firms do research in one language (R and MATLAB are quite popular) and do production implementations in another (Java/C++). I made the argument that, outside of low-latency high frequency trading, there are huge organizational benefits to building a one-language platform in Python. Also, Python may be the only programming language that is high-productivity and suitable for interactive research as well as building production-worthy, robust, and maintainable systems. More and more hedge funds, banks, and other financial firms are realizing this as time goes on.

Travis spoke about many of the challenges of using Python within larger organizations like investment banks (something I’ve no experience with, but Enthought has done significant consulting work in that space). He had a number of other valuable perspectives– many of which are escaping me (will need to go back and watch the video!).

Henry Ward, who is leading a startup effort to bring quantitative investment tools (such as portfolio optimization) to retail investors, spoke about his experiences building scalable backend systems in Python which also need to carry out extensive statistical computations and data manipulations, for example using pandas. Since Python has great web application tools like Django, rich RPC protocols (e.g. Thrift), and scientific computing libraries, it’s possible to build a pure Python architecture which can scale very naturally on the cloud. Very cool stuff. He also mentioned starting out the project in Ruby and concluded that “Friends don’t let friends do science in Ruby.” Well put, sir.

Other links of interest

I’ll add stuff here when I think of it (and when I get a chance to see videos for the talks I missed).