R is much more magical than Python. What do I mean by this? In R, things like this are a part of everyday life:
> a <- rnorm(10)
> b <- rnorm(10)
> cbind(a, b)
a b
[1,] 0.8729978 0.5170078
[2,] -0.6885048 -0.4430447
[3,] 0.4017740 1.8985843
[4,] 2.1088905 -1.4121763
[5,] 0.9375273 0.4703302
[6,] 0.5558276 -0.5825152
[7,] -2.1606252 0.7379874
[8,] -0.7651046 -0.4534345
[9,] -4.2604901 0.9561077
[10,] 0.3940632 -0.8331285
If you're a seasoned Python programmer, you might have the sort of visceral negative reaction that I do to this. Seriously, just where in the hell did those variable names come from? So when I say magic here I'm talking about abusing the language's parser. There is nothing special about R that makes the above behavior possible, but rather taking a fundamentally different design philosophy to, say, Python. As any Python programmer knows: Explicit is better than implicit. I happen to agree. There is also a bit of a semantic difference in R versus Python in that assignment in R typically copies data, whereas variables in Python are simply references (labels) for a particular object. So you could make the argument that the names a
and b
above are more strongly linked to the underlying data.
While building pandas over the last several years, I occasionally grapple with issues like the above. Maybe I should just break from Python ethos and embrace magic? I mean, how hard would it be to get the above behavior in Python? Python gives you stack frames and the ast module after all. So I went down the rabbit hole and wrote this little code snippet:
from pandas.util.testing import set_trace
import pandas.util.testing as tm
from pandas import *
import ast
import inspect
import sys
def merge(a, b):
f, args, _ = parse_stmt(inspect.currentframe().f_back)
return DataFrame({args[0] : a,
args[1] : b})
def parse_stmt(frame):
info = inspect.getframeinfo(frame)
call = info[-2][0]
mod = ast.parse(call)
body = mod.body[0]
if isinstance(body, (ast.Assign, ast.Expr)):
call = body.value
elif isinstance(body, ast.Call):
call = body
return _parse_call(call)
def _parse_call(call):
func = _maybe_format_attribute(call.func)
str_args = []
for arg in call.args:
if isinstance(arg, ast.Name):
str_args.append(arg.id)
elif isinstance(arg, ast.Call):
formatted = _format_call(arg)
str_args.append(formatted)
return func, str_args, {}
def _format_call(call):
func, args, kwds = _parse_call(call)
content = ''
if args:
content += ', '.join(args)
if kwds:
fmt_kwds = ['%s=%s' % item for item in kwds.iteritems()]
joined_kwds = ', '.join(fmt_kwds)
if args:
content = content + ', ' + joined_kwds
else:
content += joined_kwds
return '%s(%s)' % (func, content)
def _maybe_format_attribute(name):
if isinstance(name, ast.Attribute):
return _format_attribute(name)
return name.id
def _format_attribute(attr):
obj = attr.value
if isinstance(attr.value, ast.Attribute):
obj = _format_attribute(attr.value)
else:
obj = obj.id
return '.'.join((obj, attr.attr))
a = tm.makeTimeSeries()
b = tm.makeTimeSeries()
df = merge(a, b)
While this is woefully unpythonic, it's also kind of cool:
In [27]: merge(a, b)
Out[27]:
a b
2000-01-03 -1.35 0.8398
2000-01-04 0.999 -1.617
2000-01-05 0.2537 1.433
2000-01-06 0.6273 -0.3959
2000-01-07 0.7963 -0.789
2000-01-10 0.004295 -1.446
This can even parse and format more complicated expressions (harder than it looks, because you have to walk the whole AST):
In [30]: merge(a, np.log(b))
Out[30]:
a np.log(b)
2000-01-03 0.6243 0.7953
2000-01-04 0.3593 -1.199
2000-01-05 2.805 -1.059
2000-01-06 0.6369 -0.9067
2000-01-07 -0.2734 NaN
2000-01-10 -1.023 0.3326
Now, I am *not* suggesting we do this any time soon. I'm going to prefer the explicit approach (cf. the Zen of Python) any day of the week:
In [32]: DataFrame({'a' : a, 'log(b)' : np.log(b)})
Out[32]:
a log(b)
2000-01-03 0.6243 0.7953
2000-01-04 0.3593 -1.199
2000-01-05 2.805 -1.059
2000-01-06 0.6369 -0.9067
2000-01-07 -0.2734 NaN
2000-01-10 -1.023 0.3326