# Grouping Data ## Find distinct items, removing duplicates cat data | sort -u cat data | sort | uniq ## Find distinct items without sorting The recipe above relies on a sort. However, sometimes you know that you dataset will fit in memory, and just want to remove any duplicate items without the sort. Here are four approaches, from [SO](https://stackoverflow.com/questions/1444406/how-to-delete-duplicate-lines-in-a-file-without-sorting-it-in-unix) cat data | awk '!seen[$0]++' file.txt cat data | perl -ne'print unless $seen{$_}++' These first to work by constructing a hashmap that is incremented as each new entry is added, but the magic lies in the autovivification and post-increment behavior. The 'seen' hash is checked, and only AFTER check is the value incremented. Both Awk and perl set the initial value of a hash to zero-equivalent value. The next approach relies on [datamash](https://www.gnu.org/software/datamash/) which is described later under [datamash](project:specialized-data-tools.md#datamash) cat data | datamash rmdup 1 Another tool that [claims to be faster](https://github.com/koraa/huniq): cat data | huniq ## Find duplicate items cat data | sort | uniq -d ## Find duplicate items without sorting This is almost the inverse of the perl approach above: cat data | perl -ne'print if ++seen{$_} == 2' In this case, we increment the value BEFORE we check it. ## Find lines that are in one file, but not in another Sometimes I have a list of all items, and then a list of items that I want to remove, and so I need to exclude (subtract) the rejects and work on the rest. If both files are sorted (or can be sorted), then you can use either the `comm` utility, or `diff`. comm takes two **sorted** files, and reports lines that are in a, b or both. So, to show items that are in `all`, but not in `reject`, tell `comm` to suppress the lines in the second file (reject, `-2`), and in both files (`-3`): comm -2 -3 all reject See also: `join`, which gives more control on a column by column basis. You can also grep the output of diff, which is most useful if you want to get a bit of context around missing lines. Lines removed from the first file are prefixed with a ' <', or '-' in unified (-u) mode. You'll need to do a little post-processing on the output to remove the diff characters, so `comm` is often an easier choice. diff -u all keep | egrep '^-' If it's not practical to sort the files, then you may need to do a little actual coding to put the lines from one file into a dictionary or set, and remove the lines from the other. ## Split data into files based on a field awk has a really simple way to split data into separate files based on a field. `{print > $1}`, which prints the line into a file named in the first column. Here's a more concrete example, using this technique in conjunction with a program called `average` that, not surprisingly, computes the average of its inputs. The input is a request rate, and the date portion of a timestamp extracted from a log file: The input date: 10:50:41 $ head -5 /tmp/a 2017-11-22 17918 2017-11-22 22122 2017-11-22 23859 2017-11-22 24926 2017-11-22 25590 Put each rate into a file named for the day: 10:51:12 $ awk '{ print $2>$1}' /tmp/a Verify the files: 10:51:30 $ ls 2017-11-22 2017-11-23 2017-11-24 2017-11-25 2017-11-26 2017-11-27 To add a suffix or prefix, use this awk syntax: 10:51:12 $ awk '{ print $2>($1 ".txt" )}' /tmp/a Finally, compute an average based on a special purpose program (or your own one-liner) 10:51:39 $ for f in 2017*;do echo -n "$f "; average $f; done; 2017-11-22 28623.5339943343 2017-11-23 32164.1470966969 2017-11-24 41606.0775438271 2017-11-25 44660.3379886831 2017-11-26 43758.5492501466 2017-11-27 43080.1879794521 Naturally, there are other ways to do this specific computation, e.g. `cat /tmp/a | datamash -W --group 1 mean 2`, but sometimes it's useful to split the files for later processing. See http://www.theunixschool.com/2012/06/awk-10-examples-to-split-file-into.html for some more examples.