# Specialized Tools for Aggregation, Summary, Analysis and Reporting

I use a wide variety of tools for aggregation ranging from relatively simple to
complex and fully featured. Once I have filtered and extracted the data I want,
I have a set of core tools I use for most summarization. Some are small,
single-file scripts that I just copy to a new machine if needed, while others
are full-blown packages.

Note that I won't really talk about actual databases, as the data import/export process usually ends up
being a significant factor (except for csv data via `csvquery`)

The ones that I tend to use most are:

## stats

`stats` is a compantion script to (scut)[https://github.com/hjmangalam/scut]
that prints descriptive statistics for a single column of numeric inputs. 

https://github.com/hjmangalam/scut/blob/master/stats

    seq 10 | stats
     Sum       55
     Number    10
     Mean      5.5
     Median    5.5
     Mode      FLAT  
     NModes    No # was represented more than once
     Min       1
     Max       10
     Range     9
     Variance  9.16666666666667
     Std_Dev   3.02765035409749
     SEM       0.957427107756338
     95% Conf  3.62344286879758 to 7.37655713120242
               (for a normal distribution - see skew)
     Skew      0
               (skew is 0 for a symmetric dist)
     Std_Skew  0
     Kurtosis  -1.40181818181818
                (K=3 for a normal dist)

## csvstat

`csvstat` is another part of the great `csvkit` toolbox. It prints descriptive
stats from csv files. You can also process a single numeric column without a
header by using the -H flag, although it's not particularly fast. Here's an example of using it to generate some stats on a single column, although it will do similar goodness for a csv file with multiple columns. Naturally, for non-numeric columns, there will not be as many 

    jot -r 1500 1 999 | csvstat -H
     /usr/local/lib/python2.7/site-packages/agate/table/from_csv.py:88: RuntimeWarning: Column names not specified. "('a',)" will be used as names.
       1. "a"
     
      Type of data:          Number
      Contains null values:  False
      Unique values:         764
      Smallest value:        1
      Largest value:         999
      Sum:                   746,595
      Mean:                  497.73
      Median:                494.5
      StDev:                 288.637
      Most common values:    370 (7x)
                             994 (7x)
                             770 (6x)
                             316 (5x)
                             488 (5x)
     
     Row count: 1500

Here's an example of getting stats on a simple 2-column csv file. Notice that
the error message has gone away because we added the headers.

    (seq 10; jot -r -c 10 97 120 ) | column -c 20 | csvformat -t | ( echo "num,letter"; cat;) | tee sample.csv
     num,letter
      1,g
      2,k
      3,a
      4,p
      5,m
      6,q
      7,l
      8,w
      9,r
      10,d

Now, generate stats on all of the columns:

    csvstat sample.csv
        1. "num"
      
       Type of data:          Number
       Contains null values:  False
       Unique values:         10
       Smallest value:        1
       Largest value:         10
       Sum:                   55
       Mean:                  5.5
       Median:                5.5
       StDev:                 3.028
       Most common values:    1 (1x)
                              2 (1x)
                              3 (1x)
                              4 (1x)
                              5 (1x)
      
        2. "letter"
      
       Type of data:          Text
       Contains null values:  False
       Unique values:         10
       Longest value:         1 characters
       Most common values:    a (1x)
                              c (1x)
                              f (1x)
                              h (1x)
                              k (1x)
     Row count: 10 

## datamash

do computation and stats on the command line

### quick grouped stats with datamash

For very simple summary stats, I often turn to the `stats` command or
`csvstat`. However, if I want to do a bit more, like aggregate by one or more
columns, [gnu datamash](https://www.gnu.org/software/datamash/manual/datamash.html)
is very useful. It handles large streamed data sets very quickly, and has a
variety of statistical functions available. By default it breaks on tabs. Use
the -W option to break on whitespace. Convert CSV data to TSV transforming data
via `csvformat -T` if you have csv input data.

From the [datamash manual](https://www.gnu.org/software/datamash/manual/datamash.html):

```
# NOTE: apparently the sample data uses tabs, while this example uses whitespace
cat > scores.txt << EOF
Name        Subject          Score
Bryan       Arts             68
Isaiah      Arts             80
Gabriel     Health-Medicine  100
Tysza       Business         92
Zackery     Engineering      54
EOF
```
datamash -W --sort --headers --group 2 mean 3 sstdev 3 < scores.txt

```
GroupBy(Subject)   mean(Score)   sstdev(Score)
Arts               68.9474       10.4215
Business           87.3636       5.18214
Engineering        66.5385       19.8814
Health-Medicine    90.6154       9.22441
Life-Sciences      55.3333       20.606
Social-Sciences    60.2667       17.2273
```

### Cross tables/pivot tables with datamash

    cat <<EOF | csvformat -T > data2.tsv
    year,state,name,amount
    2017,CA,"Foo, Bar",100
    2017,CA,"Boo, Baz",200
    2016,NJ,"Hoo, Dat",75
    2016,CO,"Why, Not",33
    EOF

Create a pivot table that has the year column vs. the state column, summing
the `amount` column for each cell.

    cat data2.tsv | datamash -H crosstab 1,2 sum 4
     GroupBy(year)   GroupBy(state)  sum(amount)
             CA  CO  NJ
     2016    N/A 33  75
     2017    300 N/A N/A