Specialized Tools for Aggregation, Summary, Analysis and Reporting¶
I use a wide variety of tools for aggregation ranging from relatively simple to complex and fully featured. Once I have filtered and extracted the data I want, I have a set of core tools I use for most summarization. Some are small, single-file scripts that I just copy to a new machine if needed, while others are full-blown packages.
Note that I won’t really talk about actual databases, as the data import/export process usually ends up
being a significant factor (except for csv data via csvquery
)
The ones that I tend to use most are:
stats¶
stats
is a compantion script to (scut)[https://github.com/hjmangalam/scut]
that prints descriptive statistics for a single column of numeric inputs.
https://github.com/hjmangalam/scut/blob/master/stats
seq 10 | stats
Sum 55
Number 10
Mean 5.5
Median 5.5
Mode FLAT
NModes No # was represented more than once
Min 1
Max 10
Range 9
Variance 9.16666666666667
Std_Dev 3.02765035409749
SEM 0.957427107756338
95% Conf 3.62344286879758 to 7.37655713120242
(for a normal distribution - see skew)
Skew 0
(skew is 0 for a symmetric dist)
Std_Skew 0
Kurtosis -1.40181818181818
(K=3 for a normal dist)
csvstat¶
csvstat
is another part of the great csvkit
toolbox. It prints descriptive
stats from csv files. You can also process a single numeric column without a
header by using the -H flag, although it’s not particularly fast. Here’s an example of using it to generate some stats on a single column, although it will do similar goodness for a csv file with multiple columns. Naturally, for non-numeric columns, there will not be as many
jot -r 1500 1 999 | csvstat -H
/usr/local/lib/python2.7/site-packages/agate/table/from_csv.py:88: RuntimeWarning: Column names not specified. "('a',)" will be used as names.
1. "a"
Type of data: Number
Contains null values: False
Unique values: 764
Smallest value: 1
Largest value: 999
Sum: 746,595
Mean: 497.73
Median: 494.5
StDev: 288.637
Most common values: 370 (7x)
994 (7x)
770 (6x)
316 (5x)
488 (5x)
Row count: 1500
Here’s an example of getting stats on a simple 2-column csv file. Notice that the error message has gone away because we added the headers.
(seq 10; jot -r -c 10 97 120 ) | column -c 20 | csvformat -t | ( echo "num,letter"; cat;) | tee sample.csv
num,letter
1,g
2,k
3,a
4,p
5,m
6,q
7,l
8,w
9,r
10,d
Now, generate stats on all of the columns:
csvstat sample.csv
1. "num"
Type of data: Number
Contains null values: False
Unique values: 10
Smallest value: 1
Largest value: 10
Sum: 55
Mean: 5.5
Median: 5.5
StDev: 3.028
Most common values: 1 (1x)
2 (1x)
3 (1x)
4 (1x)
5 (1x)
2. "letter"
Type of data: Text
Contains null values: False
Unique values: 10
Longest value: 1 characters
Most common values: a (1x)
c (1x)
f (1x)
h (1x)
k (1x)
Row count: 10
datamash¶
do computation and stats on the command line
quick grouped stats with datamash¶
For very simple summary stats, I often turn to the stats
command or
csvstat
. However, if I want to do a bit more, like aggregate by one or more
columns, gnu datamash
is very useful. It handles large streamed data sets very quickly, and has a
variety of statistical functions available. By default it breaks on tabs. Use
the -W option to break on whitespace. Convert CSV data to TSV transforming data
via csvformat -T
if you have csv input data.
From the datamash manual:
# NOTE: apparently the sample data uses tabs, while this example uses whitespace
cat > scores.txt << EOF
Name Subject Score
Bryan Arts 68
Isaiah Arts 80
Gabriel Health-Medicine 100
Tysza Business 92
Zackery Engineering 54
EOF
datamash -W –sort –headers –group 2 mean 3 sstdev 3 < scores.txt
GroupBy(Subject) mean(Score) sstdev(Score)
Arts 68.9474 10.4215
Business 87.3636 5.18214
Engineering 66.5385 19.8814
Health-Medicine 90.6154 9.22441
Life-Sciences 55.3333 20.606
Social-Sciences 60.2667 17.2273
Cross tables/pivot tables with datamash¶
cat <<EOF | csvformat -T > data2.tsv
year,state,name,amount
2017,CA,"Foo, Bar",100
2017,CA,"Boo, Baz",200
2016,NJ,"Hoo, Dat",75
2016,CO,"Why, Not",33
EOF
Create a pivot table that has the year column vs. the state column, summing
the amount
column for each cell.
cat data2.tsv | datamash -H crosstab 1,2 sum 4
GroupBy(year) GroupBy(state) sum(amount)
CA CO NJ
2016 N/A 33 75
2017 300 N/A N/A