# Solutions: Putting it all together

The following examples combine multiple recipes and techniques to solve real-world problems 
I have encountered.

Note that, as always, there are **many** approaches to each of these, and
I often try to work through each one quickly and iteratively with
small, simple transformations.


## Find available data by user, event class and date

This is a more detailed exploration of the example I showed in the [Introduction](project:introduction.md).

Recently, I got a question about what data was available for a machine learning
project. The data was in S3, and organized by event code, user and date for a
period of several weeks. However, data ingestion had begun at different times
for different groups, and so my coworker wanted to know the available date
ranges for various event codes. 

I was able to quickly list the files using an `s3 ls --recursive` command
and store that in a local file for quick summarization.

Each line had a file creation date, a size, and then the file path that encodes:

- an event class
- a user id
- a date
- the actual filename for a data segment

```
2024-09-22 139769207 data/class=3002/user=aaa/date=2024-09-21/part-00005
2024-10-04   8235665 data/class=1007/user=aaa/date=2024-10-03/part-00009
2024-10-04     26936 data/class=103006/user=bbb/date=2024-10-03/part-00008
2024-09-13  92385199 data/class=1010/user=aaa/date=2024-09-12/part-00003
2024-10-03  51873127 data/class=1007/user=aaa/date=2024-10-02/part-00001
2024-09-10     52364 data/class=3001/user=aaa/date=2024-09-09/part-00003
2024-09-11  53877463 data/class=1007/user=aaa/date=2024-09-10/part-00001
2024-10-05  36197597 data/class=1007/user=bbb/date=2024-10-04/part-00005
2024-09-14     54249 data/class=103006/user=aaa/date=2024-09-13/part-00000
2024-09-03 146174053 data/class=3002/user=aaa/date=2024-09-02/part-00001
```

I wanted to see the first and last dates available for each event code and
user, and have a rough idea if there were any big gaps by counting the number
of dates between the first and last.

Here's the first few rows of  output, with the user, event class, first and
last date, and count of files.

```
user=aaa  class=1007    date=2024-09-05  date=2024-10-05  31
user=aaa  class=3001    date=2024-08-25  date=2024-10-05  42
user=aaa  class=3002    date=2024-08-25  date=2024-10-05  42
user=bbb  class=1007    date=2024-09-30  date=2024-10-05  6
user=bbb  class=3001    date=2024-09-25  date=2024-10-05  11
user=bbb  class=3002    date=2024-09-25  date=2024-10-04  10
```

and here is the transformation I used to summarize the data:

```shell
cat /tmp/files 
    | awk '{print $NF}' \
    | grep user= 
    | perl -pe's/part-.*//'  
    | perl -pe's/.*data\///' 
    | tr / '\t'  
    | sort --uniq
    | datamash --group 2,1 first 3 last 3 count 3
    | column -t
    | sort
```

I'll work through the small (10 line) file list from above for conciseness and
reproducibility. As a result, the final counts will not match the output shown
above that used the entire dataset.

So, step by step, the process is as follows: 

1. Take the filenemame (the last field) from each line

```
| awk '{print $NF}' 
```

produces

```
data/class=1007/user=aaa/date=2024-09-10/part-00001
data/class=1007/user=aaa/date=2024-10-02/part-00001
data/class=1007/user=aaa/date=2024-10-03/part-00009
data/class=1007/user=bbb/date=2024-10-04/part-00005
data/class=1010/user=aaa/date=2024-09-12/part-00003
data/class=103006/user=aaa/date=2024-09-13/part-00000
data/class=103006/user=bbb/date=2024-10-03/part-00008
data/class=3001/user=aaa/date=2024-09-09/part-00003
data/class=3002/user=aaa/date=2024-09-02/part-00001
data/class=3002/user=aaa/date=2024-09-21/part-00005
```

See [Printing the last column, awk and perl](project:extraction.md#printing-the-last-column-awk-and-perl)

2. Next, I filter for lines that contain only the string 'user='. It turned out
   that there were other files, metadata files etc... that were also present in
   the s3 output, but they were not specific to a user and so I wanted to
   filter those out. Looking for the `user=` string eliminated them. This is an
   example of "selecting in".

```
| grep user=
```

See [Filter and Select](project:filter-and-select.md)

3. remove the trailing filename

```
| perl -pe's/part-.*//'
```

```
data/class=1007/user=aaa/date=2024-09-10/
data/class=1007/user=aaa/date=2024-10-02/
data/class=1007/user=aaa/date=2024-10-03/
data/class=1007/user=bbb/date=2024-10-04/
data/class=1010/user=aaa/date=2024-09-12/
data/class=103006/user=aaa/date=2024-09-13/
data/class=103006/user=bbb/date=2024-10-03/
data/class=3001/user=aaa/date=2024-09-09/
data/class=3002/user=aaa/date=2024-09-02/
data/class=3002/user=aaa/date=2024-09-21/
```


Each date may have multiple files, but I don't care about individual files . So I'll
trim it. In reality, I have a small script I've written called `dirnames` that
simple strips off the final path portion of each line, from the last `/`, but I
wanted to rely on standard tools for this example, so I remove everything after 'part-'

See [Transformations](project:transformation.md)


4. Remove the `data/` prefix

Remove everything upto and including `data/`. This is a relatively geneneral transformation, and I used the
regular expression because in the actual data I had a few more path components to remove.


```
| perl -pe's/.*data\///' 
```

```
class=1007/user=aaa/date=2024-09-10/
class=1007/user=aaa/date=2024-10-02/
class=1007/user=aaa/date=2024-10-03/
class=1007/user=bbb/date=2024-10-04/
class=1010/user=aaa/date=2024-09-12/
class=103006/user=aaa/date=2024-09-13/
class=103006/user=bbb/date=2024-10-03/
class=3001/user=aaa/date=2024-09-09/
class=3002/user=aaa/date=2024-09-02/
class=3002/user=aaa/date=2024-09-21/
```

This is an example of "selecting out". See [Transformations](project:transformation.md)

5. split up the directory components by tabs 

I want to have the user, class and date as separate "words" so I can do additional processing. Converting the directory
separator `/` into a tab allows for tools (like datamash) that separate fields by whitespace or tabs.

```
| tr / '\t'  
```

```
class=1007	user=aaa	date=2024-09-10	
class=1007	user=aaa	date=2024-10-02	
class=1007	user=aaa	date=2024-10-03	
class=1007	user=bbb	date=2024-10-04	
class=1010	user=aaa	date=2024-09-12	
class=103006	user=aaa	date=2024-09-13	
class=103006	user=bbb	date=2024-10-03	
class=3001	user=aaa	date=2024-09-09	
class=3002	user=aaa	date=2024-09-02	
class=3002	user=aaa	date=2024-09-21	
```

Naturally, I could have also used perl for this, but `tr` is shorter to type and automatically converts all matches!

Also note that steps 3, 4 and 5 could have all been done by a single, more complicated match and replacement, but I was
working quickly and wanted simple steps I could easily iterate through.

See [Create several simple filters rather than one complicated one](project:transformation.md#create-several-simple-filters-rather-than-one-complicated-one)

6. Sort, and keep only one line per day

```
| sort -u
```

```
class=1007	user=aaa	date=2024-09-10	
class=1007	user=aaa	date=2024-10-02	
class=1007	user=aaa	date=2024-10-03	
class=1007	user=bbb	date=2024-10-04	
class=1010	user=aaa	date=2024-09-12	
class=103006	user=aaa	date=2024-09-13	
class=103006	user=bbb	date=2024-10-03	
class=3001	user=aaa	date=2024-09-09	
class=3002	user=aaa	date=2024-09-02	
class=3002	user=aaa	date=2024-09-21	
```

Because, in step 3 above, I removed the filename, I now had multiple directory lines per day. So, clean that up and
sort the data for the next step.

7. Group by user, and event class, then show the days and counts

[Gnu Datamash](https://www.gnu.org/software/datamash/) makes it easy to do some quick grouping and aggregations on delimited data, and provides a few useful
text transformations like count, first, last and collapse.

So, because I actually am interested in which users have what data is available, I'm going to change the grouping order. I'll group by the first two columns, and then
do some "aggregation" operations on the third: first, last, and count all work on the grouped elements.

```
| datamash --group 2,1 first 3 last 3 count 3
```

```
user=aaa	class=1007	date=2024-09-10	date=2024-10-03	3
user=bbb	class=1007	date=2024-10-04	date=2024-10-04	1
user=aaa	class=1010	date=2024-09-12	date=2024-09-12	1
user=aaa	class=103006	date=2024-09-13	date=2024-09-13	1
user=bbb	class=103006	date=2024-10-03	date=2024-10-03	1
user=aaa	class=3001	date=2024-09-09	date=2024-09-09	1
user=aaa	class=3002	date=2024-09-02	date=2024-09-21	2
```

Note that datamash prints out items separated by tabs, so the output doesn't line up with my default tab settings.

See [datamash](project:specialized-data-tools.md#datamash)

8. Pretty print the data into nicer columns for easy viewing/copypaste

```
| column -t
```

```
user=aaa  class=1007    date=2024-09-10  date=2024-10-03  3
user=bbb  class=1007    date=2024-10-04  date=2024-10-04  1
user=aaa  class=1010    date=2024-09-12  date=2024-09-12  1
user=aaa  class=103006  date=2024-09-13  date=2024-09-13  1
user=bbb  class=103006  date=2024-10-03  date=2024-10-03  1
user=aaa  class=3001    date=2024-09-09  date=2024-09-09  1
user=aaa  class=3002    date=2024-09-02  date=2024-09-21  2
```

9. Finally, sort the data correctly so it shows up as desired.

```
| sort
```

```
user=aaa  class=1007    date=2024-09-10  date=2024-10-03  3
user=aaa  class=1010    date=2024-09-12  date=2024-09-12  1
user=aaa  class=103006  date=2024-09-13  date=2024-09-13  1
user=aaa  class=3001    date=2024-09-09  date=2024-09-09  1
user=aaa  class=3002    date=2024-09-02  date=2024-09-21  2
user=bbb  class=1007    date=2024-10-04  date=2024-10-04  1
user=bbb  class=103006  date=2024-10-03  date=2024-10-03  1
```

10. Final steps

After summarizing the data, I determines that the data began arriving on
different days for not only different users, but even within different event
classes for the same user.

I did a bit more cleanup (removed the `<field>=`), get a list of all days for each line using the datamash `collapse` aggregator, etc...

Then I was ready to not only answer my coworker's question, but also send a
message to the data ingestion team about the missing days.

At each step in the process, I would typically pipe the output to `head` so I was only getting the top 10 results. That
was enough for me to check each step before processing onto the next, and allowed me to rapidly iterate through the examples.