Grouping Data¶
Find distinct items, removing duplicates¶
cat data | sort -u
cat data | sort | uniq
Find distinct items without sorting¶
The recipe above relies on a sort. However, sometimes you know that you dataset will fit in memory, and just want to remove any duplicate items without the sort. Here are four approaches, from SO
cat data | awk '!seen[$0]++' file.txt
cat data | perl -ne'print unless $seen{$_}++'
These first to work by constructing a hashmap that is incremented as each new entry is added, but the magic lies in the autovivification and post-increment behavior. The ‘seen’ hash is checked, and only AFTER check is the value incremented. Both Awk and perl set the initial value of a hash to zero-equivalent value.
The next approach relies on datamash which is described later under datamash
cat data | datamash rmdup 1
Another tool that claims to be faster:
cat data | huniq
Find duplicate items¶
cat data | sort | uniq -d
Find duplicate items without sorting¶
This is almost the inverse of the perl approach above:
cat data | perl -ne'print if ++seen{$_} == 2'
In this case, we increment the value BEFORE we check it.
Find lines that are in one file, but not in another¶
Sometimes I have a list of all items, and then a list of items that I want to remove, and so I need to exclude (subtract) the rejects and work on the rest.
If both files are sorted (or can be sorted), then you can use either the comm
utility, or diff
.
comm takes two sorted files, and reports lines that are in a, b or both.
So, to show items that are in all
, but not in reject
, tell comm
to suppress
the lines in the second file (reject, -2
), and in both files (-3
):
comm -2 -3 all reject
See also: join
, which gives more control on a column by column basis.
You can also grep the output of diff, which is most useful if you want to get a
bit of context around missing lines. Lines removed from the first file are
prefixed with a ‘ <’, or ‘-’ in unified (-u) mode. You’ll need to do a little
post-processing on the output to remove the diff characters, so comm
is often
an easier choice.
diff -u all keep | egrep '^-'
If it’s not practical to sort the files, then you may need to do a little actual coding to put the lines from one file into a dictionary or set, and remove the lines from the other.
Split data into files based on a field¶
awk has a really simple way to split data into separate files based on a field.
{print > $1}
, which prints the line into a file named in the first column.
Here’s a more concrete example, using this technique in conjunction with a program called average
that, not surprisingly, computes the average of its inputs. The input is a request rate, and the date portion of a timestamp extracted from a log file:
The input date:
10:50:41 $ head -5 /tmp/a
2017-11-22 17918
2017-11-22 22122
2017-11-22 23859
2017-11-22 24926
2017-11-22 25590
Put each rate into a file named for the day:
10:51:12 $ awk '{ print $2>$1}' /tmp/a
Verify the files:
10:51:30 $ ls
2017-11-22 2017-11-23 2017-11-24 2017-11-25 2017-11-26 2017-11-27
To add a suffix or prefix, use this awk syntax:
10:51:12 $ awk '{ print $2>($1 ".txt" )}' /tmp/a
Finally, compute an average based on a special purpose program (or your own one-liner)
10:51:39 $ for f in 2017*;do echo -n "$f "; average $f; done;
2017-11-22 28623.5339943343
2017-11-23 32164.1470966969
2017-11-24 41606.0775438271
2017-11-25 44660.3379886831
2017-11-26 43758.5492501466
2017-11-27 43080.1879794521
Naturally, there are other ways to do this specific computation,
e.g. cat /tmp/a | datamash -W --group 1 mean 2
,
but sometimes it’s useful to split the files for later processing.
See http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html for some more examples.