Why I’m not on the Julia bandwagon (yet)

EDIT (2013-07-11): The following article may or may not been accurate anymore since it’s been more than a year. I recommend running your own benchmarks on array expressions with the latest Julia builds and draw your own conclusions.

I apologize in advance for writing what might be the first neutral to negative article about Julia that you’ve read. For those of you who don’t know, Julia is a new dynamic language with an LLVM-based JIT compiler designed for technical computing. It’s intent is to try to bridge the gap between high level languages and low level ones in terms of giving you the power, flexibility, and general-purpose usefulness of a Lisp or Python-like language but with low-level performance. It has MATLAB-like syntax but has a lot of the features of Lisp (macros, metaprogramming) and Python that we’ve all come to love. Before anything else, I’d like to say that I’m very impressed with the project, the team, and overall their technology choices.

Many people are very excited about Julia, and one of the main reasons is because of a number of very impressive benchmarks. This is great and I’m also impressed with the benchmarks, but they are in my opinion fairly misleading and don’t represent the reality of a lot of technical computing.

I compiled Julia in advance of Stefan Karpinski’s (one of the creators) talk in New York this week. I ran a few benchmarks and then proceeded to raise hell on the Julia mailing list.

TL;DR: if you’re an R, MATLAB, or scientific Python user, it’s probably not time to start programming in Julia if the highest performance code is your goal for minimum effort. It may be in a couple of years when the JIT-compiler improves. However, Julia may soon be a reasonable free computational alternative to MATLAB. I say this even in a perfect world where Julia had all the libraries that Python and R have; as far as a sole language for building large scientific production applications in industry right now, I think Python + Cython are pretty much the only game in town.

Why am I complaining about the benchmarks? Well, one of the first things I did was look at the performance of array operations. In scientific and technical computing, we work a lot with arrays, right? In fact, convention wisdom is “don’t write loops”. Many people have told me that I should be unrolling array operations into loops and my experience is that this is the way to get the best performance. Keep in mind that I am a performance-obsessed tool maker, and in libraries like pandas I am in control of the vast majority of the areas of people’s applications that would otherwise be bottlenecks.

Here’s a simple benchmark: compute the inner product of two arrays, in other words, the sum of the element-wise product. Julia code (I disabled the garbage collector to remove its effect from the equation):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x = randn(10000000)
y = randn(10000000)
function inner(x, y)
result = 0.0
for i=1:length(x)
result += x[i] * y[i]
end
result
end
gc_disable()
tic()
for i=1:50 result = sum(x .* y); end
timing = toc();
println("Array operations took ", (timing / 50) * 1000, " ms")
tic()
timing = @time for i=1:50 result = inner(x, y); end
timing = toc();
println("inner took ", (timing / 50) * 1000, " ms")

And the result:

1
2
3
11:28 ~/code/repos/julia (master)$ ./julia bench.jl
Array operations took 104.71534252166748 ms
inner took 36.00101947784424ms

Lastly, you might recognize that an array inner product can be expressed as a matrix operation and carried out by BLAS in Fortran:

1
2
3
4
5
6
7
8
9
x = randn(1, 10000000)
y = randn(10000000, 1)
tic()
timing = @time for i=1:50 result = x * y; end
timing = toc();
println("BLAS took ", (timing / 50) * 1000, "ms")
=> BLAS took 50.72967529296875ms

So most of you know me as a Python hacker. So I’ll run the same array benchmark with NumPy:

1
2
3
4
5
6
In [9]: x = randn(10000000)
In [10]: y = randn(10000000)
In [11]: timeit (x * y).sum()
10 loops, best of 3: 57.8 ms per loop

OK, so NumPy is faster than Julia for array operations, temporary array and all. But how much faster can you go by eliminating temporaries like Julia’s inner function? Here is Cython code:

1
2
3
4
5
6
def inner(ndarray[float64_t] x, ndarray[float64_t] y):
cdef Py_ssize_t i, n = len(x)
cdef float64_t result = 0
for i in range(n):
result += x[i] * y[i]
return result

And the timing:

1
2
3
4
5
6
7
8
9
10
11
In [1]: x = randn(10000000)
y
In [2]: y = randn(10000000)
In [3]: from pandas._sandbox import inner
In [4]: inner(x, y)
Out[4]: -3117.5608515640943
In [5]: timeit inner(x, y)
100 loops, best of 3: 14.5 ms per loop

So Julia is about 2.5 times slower than Cython in this trivial benchmark (straight C wouldn’t that much faster). You can do even better by using the Intel MKL in NumPy:

1
2
In [7]: timeit np.dot(x, y)
100 loops, best of 3: 9.43 ms per loop

So anyway, Julia is not a magic solution. It _is_ pretty good though, especially if you never want to have to compile code. Having to unroll array expressions into loops is still not good and I don’t think it’s a habit that should be encouraged. And the JIT will get better over time. The Julia devs are spending a lot of their time I believe more fruitfully on language design, libraries and features, and “big science” topics like distributed computing and parallel processing problems which I think will be very impactful. But if you’re looking for a high performance alternative to R for implementing scientific algorithms with low effort, the sweet spot (at least in my opinion) between performance and fast development time is Cython. But if a perfect marriage of Python, Julia, and Cython is made, perhaps we can all have our cake and eat it too.