Even easier frequency tables in pandas 0.7.0

I put in a little work on a new crosstab function in the main pandas namespace. It’s basically a convenient shortcut to calling pivot_table to make it easy to compute cross-tabulations for a set of factors using pandas DataFrame or even vanilla NumPy arrays!

Here’s an example:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
In [1]: from pandas import *
In [2]: df = DataFrame({'a' : np.random.randint(0, 2, size=20),
...: 'b' : np.random.randint(0, 5, size=20),
...: 'c' : np.random.randint(0, 3, size=20)})
In [3]: df
Out[3]:
a b c
0 0 1 2
1 1 4 1
2 1 4 2
3 1 2 2
4 0 0 0
5 0 0 2
6 1 2 2
7 1 2 0
8 1 1 1
9 1 2 0
10 1 4 1
11 1 1 2
12 0 1 2
13 1 1 0
14 1 4 1
15 1 4 0
16 0 4 1
17 0 3 1
18 1 0 1
19 0 4 1
In [4]: crosstab(df['b'], [df['a'], df['c']])
Out[4]:
a 0 1
c 0 1 2 0 1 2
b
0 1 0 1 0 1 0
1 0 0 2 1 1 1
2 0 0 0 2 0 2
3 0 1 0 0 0 0
4 0 2 0 1 3 1

This makes it very easy to produce an easy-on-the-eyes frequency table. crosstab can also take NumPy arrays. Suppose we had 1 million draws from a normal distribution, and we wish to produce a histogram-like table showing the number of draws whose absolute values fall into the bins defined by [0, 1, 2, 3]. Also, let’s divide things up by sign. Using crosstab this becomes dead simple.

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
In [19]: arr = np.random.randn(1000000)
In [20]: buckets = [0, 1, 2, 3]
In [22]: crosstab(np.digitize(np.abs(arr), buckets), np.sign(arr),
....: rownames=['bucket'], colnames=['sign'])
....:
Out[22]:
sign -1.0 1.0
bucket
1 341678 340498
2 136104 135999
3 21424 21607
4 1334 1356

Of course since the result is a DataFrame you can customize the indexes:

Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
In [28]: table = crosstab(np.digitize(np.abs(arr), buckets),
....: np.sign(arr), rownames=['bucket'],
....: colnames=['sign'])
....:
In [29]: table.index = Index(['[0, 1)', '[1, 2)', '[2, 3)',
....: '[3, inf)'], name="bin")
....:
In [30]: table.columns = Index(['negative', 'positive'], name="sign")
In [31]: table
Out[31]:
sign negative positive
bin
[0, 1) 341678 340498
[1, 2) 136104 135999
[2, 3) 21424 21607
[3, inf) 1334 1356