Why I’m not on the Julia bandwagon (yet)

I apologize in advance for writing what might be the first neutral to negative article about Julia that you’ve read. For those of you who don’t know, Julia is a new dynamic language with an LLVM-based JIT compiler designed for technical computing. It’s intent is to try to bridge the gap between high level languages and low level ones in terms of giving you the power, flexibility, and general-purpose usefulness of a Lisp or Python-like language but with low-level performance. It has MATLAB-like syntax but has a lot of the features of Lisp (macros, metaprogramming) and Python that we’ve all come to love. Before anything else, I’d like to say that I’m very impressed with the project, the team, and overall their technology choices.

Many people are very excited about Julia, and one of the main reasons is because of a number of very impressive benchmarks. This is great and I’m also impressed with the benchmarks, but they are in my opinion fairly misleading and don’t represent the reality of a lot of technical computing.

I compiled Julia in advance of Stefan Karpinski’s (one of the creators) talk in New York this week. I ran a few benchmarks and then proceeded to raise hell on the Julia mailing list.

TL;DR: if you’re an R, MATLAB, or scientific Python user, it’s probably not time to start programming in Julia if the highest performance code is your goal for minimum effort. It may be in a couple of years when the JIT-compiler improves. However, Julia may soon be a reasonable free computational alternative to MATLAB. I say this even in a perfect world where Julia had all the libraries that Python and R have; as far as a sole language for building large scientific production applications in industry right now, I think Python + Cython are pretty much the only game in town.

Why am I complaining about the benchmarks? Well, one of the first things I did was look at the performance of array operations. In scientific and technical computing, we work a lot with arrays, right? In fact, convention wisdom is “don’t write loops”. Many people have told me that I should be unrolling array operations into loops and my experience is that this is the way to get the best performance. Keep in mind that I am a performance-obsessed tool maker, and in libraries like pandas I am in control of the vast majority of the areas of people’s applications that would otherwise be bottlenecks.

Here’s a simple benchmark: compute the inner product of two arrays, in other words, the sum of the element-wise product. Julia code (I disabled the garbage collector to remove its effect from the equation):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x = randn(10000000)
y = randn(10000000)
function inner(x, y)
result = 0.0
for i=1:length(x)
result += x[i] * y[i]
end
result
end
gc_disable()
tic()
for i=1:50 result = sum(x .* y); end
timing = toc();
println("Array operations took ", (timing / 50) * 1000, " ms")
tic()
timing = @time for i=1:50 result = inner(x, y); end
timing = toc();
println("inner took ", (timing / 50) * 1000, " ms")

And the result:

1
2
3
11:28 ~/code/repos/julia (master)$ ./julia bench.jl
Array operations took 104.71534252166748 ms
inner took 36.00101947784424ms

Lastly, you might recognize that an array inner product can be expressed as a matrix operation and carried out by BLAS in Fortran:

1
2
3
4
5
6
7
8
9
x = randn(1, 10000000)
y = randn(10000000, 1)
tic()
timing = @time for i=1:50 result = x * y; end
timing = toc();
println("BLAS took ", (timing / 50) * 1000, "ms")
=> BLAS took 50.72967529296875ms

So most of you know me as a Python hacker. So I’ll run the same array benchmark with NumPy:

1
2
3
4
5
6
In [9]: x = randn(10000000)
In [10]: y = randn(10000000)
In [11]: timeit (x * y).sum()
10 loops, best of 3: 57.8 ms per loop

OK, so NumPy is faster than Julia for array operations, temporary array and all. But how much faster can you go by eliminating temporaries like Julia’s inner function? Here is Cython code:

1
2
3
4
5
6
def inner(ndarray[float64_t] x, ndarray[float64_t] y):
cdef Py_ssize_t i, n = len(x)
cdef float64_t result = 0
for i in range(n):
result += x[i] * y[i]
return result

And the timing:

1
2
3
4
5
6
7
8
9
10
11
In [1]: x = randn(10000000)
y
In [2]: y = randn(10000000)
In [3]: from pandas._sandbox import inner
In [4]: inner(x, y)
Out[4]: -3117.5608515640943
In [5]: timeit inner(x, y)
100 loops, best of 3: 14.5 ms per loop

So Julia is about 2.5 times slower than Cython in this trivial benchmark (straight C wouldn’t that much faster). You can do even better by using the Intel MKL in NumPy:

1
2
In [7]: timeit np.dot(x, y)
100 loops, best of 3: 9.43 ms per loop

So anyway, Julia is not a magic solution. It _is_ pretty good though, especially if you never want to have to compile code. Having to unroll array expressions into loops is still not good and I don’t think it’s a habit that should be encouraged. And the JIT will get better over time. The Julia devs are spending a lot of their time I believe more fruitfully on language design, libraries and features, and “big science” topics like distributed computing and parallel processing problems which I think will be very impactful. But if you’re looking for a high performance alternative to R for implementing scientific algorithms with low effort, the sweet spot (at least in my opinion) between performance and fast development time is Cython. But if a perfect marriage of Python, Julia, and Cython is made, perhaps we can all have our cake and eat it too.

  • Dan Becker

    Can you comment on how you expect Julia to compare to R and Python in terms of ease-of-development?

    [Reply]

    Wes McKinney Reply:

    Right now it’s the wild west with no debugger and no profiler. I prefer Python’s somewhat more concise and readable syntax and being 0-indexed (being 1-indexed like MATLAB and R seems like a mistake repeated to me). I don’t know how well Julia supports object-oriented programming which is one of Python’s big wins in a lot of applications.

    [Reply]

  • http://www.facebook.com/people/Bill-Hart/676964474 Bill Hart

    It’s not the Jit which needs work. The Jit that is used is the LLVM Jit, which produces basically the same code as the Clang compiler, which is roundly as fast as GCC. It’s Julia’s use of the Jit that needs improving.

    The microbenchmarks that you chose make use of the Julia library which is very immature. In this case the speed issues are probably more to do with the Julia compiler than problems in the library itself.

    I guess this does emphasise what you stated, it is too early to switch to Julia yet. But of course it is. It is at version 0.0.0 prerelease.

    [Reply]

    Wes McKinney Reply:

    You’re right, by “the JIT needs work” I meant “the Julia to LLVM IR generation needs work”

    [Reply]

    Bill Hart Reply:

    Thanks for the clarification. It would be an interesting problem to track down this performance problem and try to fix it. Hopefully it is not a problem with the way Julia represents the data, or with the type inference. I believe Jeff has at least had a look to see what IR is output by the Julia compiler.

    [Reply]

  • http://karpinski.org/ Stefan Karpinski

    Hi, Wes!

    Thanks for the post — we need constructive criticism even more than praise right now, so this post is really greatly appreciated.

    There’s already a “dot” function predefined in Julia for computing the dot product. (Mathy nit: there are many different inner products — the dot product is a specific inner product with respect to a chosen basis; that’s why we chose the name “dot” rather than “inner”.) The standard way to write this benchmark in Julia using “dot” would be this:

    x = randn(10000000);
    y = randn(10000000);

    timing = @elapsed for i=1:50 result = dot(x,y); end
    println(“dot took “, (timing/50)*1000, ” ms”)

    (Also note the use of the @elapsed macro for an easier way to get the timing value.)

    On my system, I get a timing of 35ms for this version, but I don’t have Cython installed, so I can’t benchmark on my system for a fair comparison. Maybe you could try this code on your machine?

    We support using MKL as well, if you have a licensed copy of it (e.g. from a Matlab install). There’s a branch in progress to make switching to using MKL trivial: https://github.com/JuliaLang/julia/tree/mkl. It’s currently doable, but not quite as easy as it ought to be.

    [Reply]

    Yann Le Du Reply:

    On my system

    [Reply]

    Joshua Adelman Reply:

    In some ways I feel like when I use Cython I’m engaging in less wizardry than I often do when I’m doing crazy vectorization/broadcasting/indexing tricks in Numpy to eek out performance. Coming from a background in Matlab, I’m pretty adept at this, but it often results in code that is difficult for others to understand (or me for that matter given enough temporal separation). I’ve started writing more Cython for this reason for both performance and readability. Most of the time there’s a bit of difference in the method declaration, but otherwise it looks like Python code where I’m explicitly looping over numpy array indices.

    Having left Matlab for Python a couple of years back, I wish they would have dispensed with the 1-base indexing and terminating for-loops with ‘end’ statements. I may change my mind in the future, but for now I’m happy with the Python/Cython, with the promise of Numba on the horizon.

    [Reply]

    Hernan Grecco Reply:

    I agree that vectorization of certain operations to make use of NumPy speed is very awkward sometimes. Cython provides a nice alternative but it deviates from Python too much. That’s why I find Numba so attractive https://github.com/numba/numba

    Wes McKinney Reply:

    I also get ~13.2 ms with dot(x, y) on the same machine, so that appears to be the fastest way to do it right now in Julia.

    [Reply]

  • http://www.holehouse.org/programming/julia-an-exciting-time-for-statistical-programming/ Julia – an exciting time for statistical programming

    [...] benchmarks Stefan showed and the feedback from the developers who presented was pretty impressive. Wes McKinney blogged about benchmarking some pretty trivial array operations more slowly than you’d expect given the [...]

  • http://lebigot.pip.verisignlabs.com/ Eric O LEBIGOT (EOL)

    Thank you for this interesting point of view.

    I looked at a couple of Python benchmarks from the Julia site, and some benchmarks are arguably not Pythonic and, as a consequence, much slower than they have to be (the pi_sum code is an example: a for loop is used, ala C, where Python coders would certainly use the built-in sum() function with an iterator).

    So, my take is the Julia is indeed fast, but that at least some Python benchmarks are arguably not representative and, therefore, useful.

    [Reply]

  • Arjun Variar

    If you replace the sum(x.*y) with x’*y it performs way better, at least it does so in the juliarepl. Can someone verify this on their machine.

    [Reply]

    Wes McKinney Reply:

    That’s likely true since it uses BLAS– I imagine that Julia has gotten quite a bit faster in the last 6 months too

    [Reply]

  • Anonymous

    Is there a numpy equivalent in julia?

    [Reply]