Frequency Counts and Distributions¶
get a frequency count of items, or find common items¶
# pipe into head or tail to get the most or least frequent items
cat data | sort | uniq -c | sort -nr
Find the n most common items¶
# find top 7 most common items
cat data | sort | uniq -c | sort -nr | head -7
Better frequency counts¶
https://github.com/wizzat/distribution
Lets take some random words, repeat each one a random number of times, and then see what the frequency distribution of the most common words look like:
cat randwords | perl -ne'$a=$_; print $a for 0..int(rand(25))' | distribution.py
Key|Ct (Pct) Histogram
accelerator|25 (6.05%) ------------------------------------------------------
absterge|25 (6.05%) ------------------------------------------------------
Acanthodini|25 (6.05%) ------------------------------------------------------
Abramis|25 (6.05%) ------------------------------------------------------
acclimation|24 (5.81%) ---------------------------------------------------
acatharsy|24 (5.81%) ---------------------------------------------------
accelerative|23 (5.57%) -------------------------------------------------
acanthopterous|23 (5.57%) -------------------------------------------------
Abraham|22 (5.33%) -----------------------------------------------
Abipon|21 (5.08%) ---------------------------------------------
accentuable|18 (4.36%) ---------------------------------------
Acanthocephala|18 (4.36%) ---------------------------------------
abduce|17 (4.12%) -------------------------------------
Abba|17 (4.12%) -------------------------------------
absume|14 (3.39%) ------------------------------
Histogram of values¶
https://github.com/bitly/data_hacks
pip install data_hacks
Generate 1000 random values from 0-300 and generate a histogram:
perl -E'say rand(300) for 1..1000' | histogram.py
# NumSamples = 1000; Min = 0.12; Max = 299.94
# Mean = 148.416700; Variance = 7602.103173; SD = 87.190041; Median 151.018961
# each ∎ represents a count of 1
0.1198 - 30.1021 [ 115]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
30.1021 - 60.0845 [ 94]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
60.0845 - 90.0668 [ 104]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
90.0668 - 120.0491 [ 98]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
120.0491 - 150.0315 [ 84]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
150.0315 - 180.0138 [ 92]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
180.0138 - 209.9962 [ 109]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
209.9962 - 239.9785 [ 118]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
239.9785 - 269.9608 [ 103]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
269.9608 - 299.9432 [ 83]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
Note: I removed some of the bucket indicators to make the lines shorter.