Python, R, and the allure of magic

R is much more magical than Python. What do I mean by this? In R, things like this are a part of everyday life:

> a <- rnorm(10)
> b <- rnorm(10)
> cbind(a, b)
               a          b
 [1,]  0.8729978  0.5170078
 [2,] -0.6885048 -0.4430447
 [3,]  0.4017740  1.8985843
 [4,]  2.1088905 -1.4121763
 [5,]  0.9375273  0.4703302
 [6,]  0.5558276 -0.5825152
 [7,] -2.1606252  0.7379874
 [8,] -0.7651046 -0.4534345
 [9,] -4.2604901  0.9561077
[10,]  0.3940632 -0.8331285

If you’re a seasoned Python programmer, you might have the sort of visceral negative reaction that I do to this. Seriously, just where in the hell did those variable names come from? So when I say magic here I’m talking about abusing the language’s parser. There is nothing special about R that makes the above behavior possible, but rather taking a fundamentally different design philosophy to, say, Python. As any Python programmer knows: Explicit is better than implicit. I happen to agree. There is also a bit of a semantic difference in R versus Python in that assignment in R typically copies data, whereas variables in Python are simply references (labels) for a particular object. So you could make the argument that the names a and b above are more strongly linked to the underlying data.

While building pandas over the last several years, I occasionally grapple with issues like the above. Maybe I should just break from Python ethos and embrace magic? I mean, how hard would it be to get the above behavior in Python? Python gives you stack frames and the ast module after all. So I went down the rabbit hole and wrote this little code snippet:

While this is woefully unpythonic, it’s also kind of cool:

In [27]: merge(a, b)
Out[27]:
            a         b      
2000-01-03 -1.35      0.8398
2000-01-04  0.999    -1.617  
2000-01-05  0.2537    1.433  
2000-01-06  0.6273   -0.3959
2000-01-07  0.7963   -0.789  
2000-01-10  0.004295 -1.446

This can even parse and format more complicated expressions (harder than it looks, because you have to walk the whole AST):

In [30]: merge(a, np.log(b))
Out[30]:
            a        np.log(b)
2000-01-03  0.6243   0.7953  
2000-01-04  0.3593  -1.199    
2000-01-05  2.805   -1.059    
2000-01-06  0.6369  -0.9067  
2000-01-07 -0.2734   NaN      
2000-01-10 -1.023    0.3326

Now, I am *not* suggesting we do this any time soon. I’m going to prefer the explicit approach (cf. the Zen of Python) any day of the week:

In [32]: DataFrame({'a' : a, 'log(b)' : np.log(b)})
Out[32]:
            a        log(b)
2000-01-03  0.6243   0.7953
2000-01-04  0.3593  -1.199  
2000-01-05  2.805   -1.059  
2000-01-06  0.6369  -0.9067
2000-01-07 -0.2734   NaN    
2000-01-10 -1.023    0.3326
  • http://dirk.eddelbuettel.com Dirk Eddelbuettel

    The advantage of *optional* behaviour is that you’re not force to use it. If you prefer to be explicit, just do “newDF <- data.frame(myNewName=a, myOtherName=b)" instead. Better?

    [Reply]

    Wes McKinney Reply:

    Indeed, you certainly have the option of being explicit in R. My point was rather that enabling / encouraging “parser abuse” (or “S-expression magic” in Lisp parlance, if you will) leads to badly (or confusingly) designed software. R doesn’t even do a consistent job of it:

    > cbind(a, exp(b))
    a
    [1,] 1.3246856 2.3932937
    [2,] 2.2343485 0.6771349
    [3,] -0.3247855 1.3599159
    [4,] -0.2181146 0.3034037
    [5,] 1.5601856 0.9904395
    [6,] -1.3391083 2.1795645
    [7,] 0.7141858 0.2738150
    [8,] 0.7324488 0.6660562
    [9,] -0.9724181 0.6383835
    [10,] -0.9690034 0.4292268

    So I guess maybe when variables get passed R keeps track of their bound variable names, but not if they are “unbound”.

    [Reply]

    Dirk Eddelbuettel Reply:

    Still a non-issue as no experienced R code would use cbind to create lasting data structures for later (human) consumption. Different strokes for different folks…

    [Reply]

    Joshua Ulrich Reply:

    In the case of cbind, you could force explicitness by setting deparse.level=0. “Abuse” is a bit of a harsh description considering the behavior is documented.

    Your cbind example is consistent with the documentation, which says deparse.level=1 will only assign a column name if it is “sensible” (a valid name/symbol), which “exp(b)” is not:

    > make.names(“exp(b)”)
    [1] “exp.b.”

    Set deparse.level=2 and R then “does a consistent job of it”. But you will have difficulty using the result of cbind(a,exp(b),deparse.level=2) in a model because the second argument isn’t a valid name. Trade-offs…

    [Reply]

    Wes McKinney Reply:

    Like I said, this kind of stuff is deeply entrenched in the R ethos– you’re welcome to it. I realize that it’s there for domain specific reasons (making it easier to munge vectors and data.frames together without having to manually assign names to 1d objects). All I was saying is that I reject it as an acceptable design pattern in Python. To me this falls into the category of “give someone an inch and they’ll take a mile”.

    I realize I’m being a bit cheeky calling it “parser abuse”, but I mean seriously, using information from the previous stack frame (or worse, the global namespace)? Thinking about it makes me want to wash my hands :)

  • Joon Ro

    Maybe

    [Reply]

  • Joon Ro

    How about making that behavior a little bit more explicit with something like “merge(a, b, use_argname_as_label=True)”?
    (Of course, with prettier argument name ..)

    I must admit that using dictionary is more explicit. :)

    [Reply]

  • Jan

    One place where such magic could be really great would be in the statsmodule:

    df = df_full.dropna()
    y = df["var1"]
    X = df.ix[:,["Var2", "Var3", "Var4"]]
    result = sm.OLS(y, X).fit()
    print (result.summary())

    -> the variable names are not shown, but just some “y” and “x1″ :-(

    It would be even cooler if statsmodel/pandas would support some kind of model object like R does:

    m = Model(“var1 ~ var2 + ln(var3) + var4″)
    sm.OLS(m, df).fit().summary()

    [Reply]

    Wes McKinney Reply:

    Some pandas integration in statsmodels (e.g. variable names in the output) is coming in the next release / as soon as I can find some time to hack on the project. We’ll eventually have a formula system too

    [Reply]