Grouping Data¶
Find distinct items, removing duplicates¶
cat data | sort -u
cat data | sort | uniq
Find distinct items without sorting¶
The recipe above relies on a sort. However, sometimes you know that you dataset will fit in memory, and just want to remove any duplicate items without the sort. Here are four approaches, from SO
cat data | awk '!seen[$0]++' file.txt
cat data | perl -ne'print unless $seen{$_}++'
These first to work by constructing a hashmap that is incremented as each new entry is added, but the magic lies in the autovivification and post-increment behavior. The ‘seen’ hash is checked, and only AFTER check is the value incremented. Both Awk and perl set the initial value of a hash to zero-equivalent value.
The next approach relies on datamash which is described later under datamash
cat data | datamash rmdup 1
Another tool that claims to be faster:
cat data | huniq
Find duplicate items¶
cat data | sort | uniq -d
Find duplicate items without sorting¶
This is almost the inverse of the perl approach above:
cat data | perl -ne'print if ++seen{$_} == 2'
In this case, we increment the value BEFORE we check it.
Find lines that are in one file, but not in another¶
Sometimes I have a list of all items, and then a list of items that I want to remove, and so I need to exclude (subtract) the rejects and work on the rest.
If both files are sorted (or can be sorted), then you can use either the comm utility, or diff.
comm takes two sorted files, and reports lines that are in a, b or both.
So, to show items that are in all, but not in reject, tell comm to suppress
the lines in the second file (reject, -2), and in both files (-3):
comm -2 -3 all reject
See also: join, which gives more control on a column by column basis.
You can also grep the output of diff, which is most useful if you want to get a
bit of context around missing lines. Lines removed from the first file are
prefixed with a ‘ <’, or ‘-’ in unified (-u) mode. You’ll need to do a little
post-processing on the output to remove the diff characters, so comm is often
an easier choice.
diff -u all keep | egrep '^-'
If it’s not practical to sort the files, then you may need to do a little actual coding to put the lines from one file into a dictionary or set, and remove the lines from the other.
Split data into files based on a field¶
awk has a really simple way to split data into separate files based on a field.
{print > $1}, which prints the line into a file named in the first column.
Here’s a more concrete example, using this technique in conjunction with a program called average that, not surprisingly, computes the average of its inputs. The input is a request rate, and the date portion of a timestamp extracted from a log file:
The input date:
10:50:41 $ head -5 /tmp/a
2017-11-22 17918
2017-11-22 22122
2017-11-22 23859
2017-11-22 24926
2017-11-22 25590
Put each rate into a file named for the day:
10:51:12 $ awk '{ print $2>$1}' /tmp/a
Verify the files:
10:51:30 $ ls
2017-11-22 2017-11-23 2017-11-24 2017-11-25 2017-11-26 2017-11-27
To add a suffix or prefix, use this awk syntax:
10:51:12 $ awk '{ print $2>($1 ".txt" )}' /tmp/a
Finally, compute an average based on a special purpose program (or your own one-liner)
10:51:39 $ for f in 2017*;do echo -n "$f "; average $f; done;
2017-11-22 28623.5339943343
2017-11-23 32164.1470966969
2017-11-24 41606.0775438271
2017-11-25 44660.3379886831
2017-11-26 43758.5492501466
2017-11-27 43080.1879794521
Naturally, there are other ways to do this specific computation,
e.g. cat /tmp/a | datamash -W --group 1 mean 2,
but sometimes it’s useful to split the files for later processing.
See http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html for some more examples.