Specialized Tools for Aggregation, Summary, Analysis and Reporting

I use a wide variety of tools for aggregation ranging from relatively simple to complex and fully featured. Once I have filtered and extracted the data I want, I have a set of core tools I use for most summarization. Some are small, single-file scripts that I just copy to a new machine if needed, while others are full-blown packages.

Note that I won’t really talk about actual databases, as the data import/export process usually ends up being a significant factor (except for csv data via csvquery)

The ones that I tend to use most are:

stats

stats is a compantion script to (scut)[https://github.com/hjmangalam/scut] that prints descriptive statistics for a single column of numeric inputs.

https://github.com/hjmangalam/scut/blob/master/stats

seq 10 | stats
 Sum       55
 Number    10
 Mean      5.5
 Median    5.5
 Mode      FLAT  
 NModes    No # was represented more than once
 Min       1
 Max       10
 Range     9
 Variance  9.16666666666667
 Std_Dev   3.02765035409749
 SEM       0.957427107756338
 95% Conf  3.62344286879758 to 7.37655713120242
           (for a normal distribution - see skew)
 Skew      0
           (skew is 0 for a symmetric dist)
 Std_Skew  0
 Kurtosis  -1.40181818181818
            (K=3 for a normal dist)

csvstat

csvstat is another part of the great csvkit toolbox. It prints descriptive stats from csv files. You can also process a single numeric column without a header by using the -H flag, although it’s not particularly fast. Here’s an example of using it to generate some stats on a single column, although it will do similar goodness for a csv file with multiple columns. Naturally, for non-numeric columns, there will not be as many

jot -r 1500 1 999 | csvstat -H
 /usr/local/lib/python2.7/site-packages/agate/table/from_csv.py:88: RuntimeWarning: Column names not specified. "('a',)" will be used as names.
   1. "a"
 
  Type of data:          Number
  Contains null values:  False
  Unique values:         764
  Smallest value:        1
  Largest value:         999
  Sum:                   746,595
  Mean:                  497.73
  Median:                494.5
  StDev:                 288.637
  Most common values:    370 (7x)
                         994 (7x)
                         770 (6x)
                         316 (5x)
                         488 (5x)
 
 Row count: 1500

Here’s an example of getting stats on a simple 2-column csv file. Notice that the error message has gone away because we added the headers.

(seq 10; jot -r -c 10 97 120 ) | column -c 20 | csvformat -t | ( echo "num,letter"; cat;) | tee sample.csv
 num,letter
  1,g
  2,k
  3,a
  4,p
  5,m
  6,q
  7,l
  8,w
  9,r
  10,d

Now, generate stats on all of the columns:

csvstat sample.csv
    1. "num"
  
   Type of data:          Number
   Contains null values:  False
   Unique values:         10
   Smallest value:        1
   Largest value:         10
   Sum:                   55
   Mean:                  5.5
   Median:                5.5
   StDev:                 3.028
   Most common values:    1 (1x)
                          2 (1x)
                          3 (1x)
                          4 (1x)
                          5 (1x)
  
    2. "letter"
  
   Type of data:          Text
   Contains null values:  False
   Unique values:         10
   Longest value:         1 characters
   Most common values:    a (1x)
                          c (1x)
                          f (1x)
                          h (1x)
                          k (1x)
 Row count: 10 

datamash

do computation and stats on the command line

quick grouped stats with datamash

For very simple summary stats, I often turn to the stats command or csvstat. However, if I want to do a bit more, like aggregate by one or more columns, gnu datamash is very useful. It handles large streamed data sets very quickly, and has a variety of statistical functions available. By default it breaks on tabs. Use the -W option to break on whitespace. Convert CSV data to TSV transforming data via csvformat -T if you have csv input data.

From the datamash manual:

# NOTE: apparently the sample data uses tabs, while this example uses whitespace
cat > scores.txt << EOF
Name        Subject          Score
Bryan       Arts             68
Isaiah      Arts             80
Gabriel     Health-Medicine  100
Tysza       Business         92
Zackery     Engineering      54
EOF

datamash -W –sort –headers –group 2 mean 3 sstdev 3 < scores.txt

GroupBy(Subject)   mean(Score)   sstdev(Score)
Arts               68.9474       10.4215
Business           87.3636       5.18214
Engineering        66.5385       19.8814
Health-Medicine    90.6154       9.22441
Life-Sciences      55.3333       20.606
Social-Sciences    60.2667       17.2273

Cross tables/pivot tables with datamash

cat <<EOF | csvformat -T > data2.tsv
year,state,name,amount
2017,CA,"Foo, Bar",100
2017,CA,"Boo, Baz",200
2016,NJ,"Hoo, Dat",75
2016,CO,"Why, Not",33
EOF

Create a pivot table that has the year column vs. the state column, summing the amount column for each cell.

cat data2.tsv | datamash -H crosstab 1,2 sum 4
 GroupBy(year)   GroupBy(state)  sum(amount)
         CA  CO  NJ
 2016    N/A 33  75
 2017    300 N/A N/A