Transformation¶
perl is my tool of choice for many line-oriented transformations. It’s worth learning a few tricks, and investing some time with one or more of perl, sed or awk.
General transformation with perl -pE and -nE¶
The perl -p option turns on filter mode. Any changes made by the expression argument (-e or -E) will be applied, and then each line will be printed. Using regular expressions is a good way to remove parts of the line, or add to it.
If you don’t know anything about regular expressions, this will all seem very mysterious, but if you do much text (or log) munging, it’s worthwhile to learn the basics.
Here’s an example of removing the date prefix of a timestamp from a log file, which I have done when I want to compare activity at various times across several days. My expression is a substitution: s/XXX/YYY/ and I’m replacing everything up to the first T with the empty string:
printf "2017-11-01T12:14:22.12352 ERROR critical" \
| perl -pe's/^.*?T//'
12:14:22.12352 ERROR critical
You can also extract portions of the line by matching against the entire line. Here’s a moderately complicated regular expression that extracts the hour:minute pair, and the log level (ERROR or FATAL). This might be the first step in analyzing errors per minute.
printf "2017-11-01T12:14:22.12352 ERROR critical" \
| perl -pe's/^.*?T(\d\d:\d\d):\S+ (ERROR|FATAL) .*$/$1 $2/'
12:14 ERROR
I tend to prefer to only take the parts I want, rather than replacing the
entire line. perl’s -n
flag loops over all the input, but doesn’t print
anything. The -E flag is an updated version of the -e flag that just makes some
of the more modern perl features available. I use -E
mostly so that I can use
say $var
instead of print "$var\n"
, because say is shorter and automatically
adds a trailing newline.
printf "2017-11-01T12:14:22.12352 ERROR critical" \
| perl -nE'/T(\d\d:\d\d):\S+ (ERROR|FATAL)/ and say "$1 $2"'
12:14 ERROR
Create several simple filters rather than one complicated one¶
Like any other part of your pipeline, it’s fine to clean up your output progressively with multiple smaller, simpler filters. I often do this because it’s easier to apply fixes than to get one large regex just right. Naturally, if you’re building a high-volume or production pipeline, it’s probably worthwhile to take the time to get it right in fewer steps.
Here’s the filter from the previous recipe broken down into several steps.
Note that in this example I’m using tee /dev/stderr
to give some diagnostic
output at each stage in the pipeline so you can see how the line is
progressively refined. You would only want to do that for debugging or
development. The output appears at the end, and I’ve added some blank lines \
just to visually separate the steps.
printf "2017-11-01T12:14:22.12352 ERROR critical" \
| cut -d ' ' -f 1,2 \
| tee /dev/stderr \
\
| perl -pe's/^.*?T//' \
| tee /dev/stderr \
\
| perl -pe's/\.\S+//' \
| tee /dev/stderr \
\
| perl -pe's/:\d\d / /'
2017-11-01T12:14:22.12352 ERROR
12:14:22.12352 ERROR
12:14:22 ERROR
12:14 ERROR
Another advantage of several simple filters is that you don’t have to spend time looking up the particular syntax for a more complicated regular expression.
Recently, I’ve been dealing with billions of records in blocks of 10 million or so. In the logfiles for these tools, I use numbers with comma separation so it’s a little easier to quickly see the exact magnitude of the numbers. However, the comma format isn’t as easy for doing math.
Here’s a partial log line. I typically use key=value format in my log as well, as it is both clear and easy to parse.
2017-11-20T15:33.16 DEBUG component.func line=9,241,821 per_sec=22,142
I wanted to get the average of these per_second values, and so I wrote a little filter to extract the number:
head log -1 | perl -nE'/per_sec=(\S+)/ and say $1'
22,142
However that gave me output like “22,124”, which wasn’t yet ready for averaging. So I spent a minute or two fiddling with the filter and ended up with the following:
cat log | perl -nE'/per_sec=(\S+)/ and do { ( $a =$1 ) =~ s/,//g; say $a}'
Not too bad, and with a bit more golfing I could have gotten it down to something shorter. However, splitting this procedure up into two separate transformations would have made it much easier:
head -1 log | perl -nE'/per_sec=(\S+)/ and say $1' | tr -d ','
22142
The advantage of this approach is that I don’t have to remember some special
perl syntax. I spent too much time messing around, trying to get it right when
it would have been simpler to refine it in two minimal filters that I could
write correctly the first time (or with 5 seconds looking at the tr manpage
after tr ',' ""
didn’t work.
To conclude this recipe, to compute the average, I just piped the resulting
values into the stats
script, referenced below. Again, I could have written
some more perl to aggregate values and then print the results at eof or in an
END block, but I’d have spent a bit more time fiddling (or googling), and I
really just had a simple 60 second question to see if the average was above
or below the last few values I saw in the logfile.
collapse or replace spaces and newlines¶
perl has a few character classes in regular expressions that are worthwhile:
\s is for general whitespace (spaces, newlines and tabs)
\h is for horizontal whitespace (spaces and tabs, NOT newlines)
\v is for vertical whitespace (newlines)
\t is for tabs
So, to collapse multiple spaces or tabs into a single character (“:” in this case):
printf "a b c\nfoo bar baz\n" | column -t | tee data
a b c
foo bar baz
To collapse the (horizontal) spaces and tabs:
perl -pE's/\h+/:/g' data
a:b:c
foo:bar:baz
To collapse the vertical newlines:
perl -pE's/\v+/:/g' data # or \n
a b c:foo bar baz:
To collapse both spaces and newlines:
perl -pE's/\s+/:/g' data
a:b:c:foo:bar:baz:
convert spaces to newline with tr or perl¶
tr is a simple solution here, as long as you only want one to one replacement:
echo "1 2 3" | tr " " '\n'
1
2
3
Collapse multiple spaces with the squeeze (-s) option:
echo "1 2 3" | tr -s " " '\n'
1
2
3
As described in the perl section above, you can use perl to replace spaces with newlines:
echo "1 2 3" | perl -pe's/ +/\n/g'
1
2
3
Or via autosplit + join:
echo "1 2 3" | perl -anE'say join "\n", @F'
remove newlines with perl¶
Replace newlines, or vertical whitespace (a bit more cross-platform):
seq 10 | perl -pe's/\v/ /g'
1 2 3 4 5 6 7 8 9 10
Keep the final newline:
seq 10 | perl -pe's/\v/ / unless eof'
1 2 3 4 5 6 7 8 9 10
reshape text with rs¶
A little tool I discovered recently is rs
:
http://manpages.ubuntu.com/manpages/xenial/man1/rs.1.html
https://github.com/chneukirchen/rs
appears built-in on mac
seq 12 | rs 3 4 1 2 3 4 5 6 7 8 9 10 11 12
or to transpose row and column order with -t
:
seq 11 | rs -t 3 4
1 4 7 10
2 5 8 11
3 6 9
Or to collapse. Use 0 to automatically determine the other value:
seq 6 | rs 1 0
1 2 3 4 5 6
And just to show it is smart about collapsing arrays:
seq 6 | rs 2 3 | rs 1 0
1 2 3 4 5 6
seq 6 | rs 2 3 | rs 0 5
1 2 3 4 5
6
I’m just playing with rs
a bit, there are a lot more options, and it doesn’t
appear to be widely available. However, it’s nice if you want to specify the
columns and rows.
merge sort multiple files of sorted data¶
Sometimes I have data that has already been sorted by another process. GNU sort
is very powerful and has a variety of features like parallel sorting and large
file sorting (more on sort below). It also provides merge sorting of pre-sorted
files via the --merge
switch.
sort --merge sorted1 sorted2 sorted3
paste: add files side by side¶
seq 1 5 > a
seq -w 0 .5 2 > b
seq 6 10 > c
paste a b c
1 0.0 6
2 0.5 7
3 1.0 8
4 1.5 9
5 2.0 10
Note that you can also use paste to transform a single stream into multiple
columns by including the desired number of stdin reads, paste - -
or
paste - - - - -
:
seq 10 | paste - - -
1 2 3
4 5 6
7 8 9
10
join: intersect two files¶
join
is use to match rows or items in one file with another.
It can be used, much like a database, to join rows that match other rows based on a field from each file. It requires each file to be sorted on the join field, however.
Other than basic lookup, I also use join to fill in missing values in a sequence. For example, I have a file with per-minute error counts, and I want to see both the minutes with errors, and the minutes without errors (which are not present in the input). The errors data could be the output of a frequency count pipeline, described below.
cat > errors <<EOM
12 12:31
12 12:34
19 12:32
23 12:36
99 12:37
EOM
And we’ll generate all the minutes in our range. See the generation section or the section on gnu parallel for some additional ideas, but here’s an example of using the seq command with a template. tee prints the output both to a file, and to the screen.
seq -f "12:%02.0f" 31 37 | tee minutes
12:31
12:32
12:33
12:34
12:35
12:36
12:37
join
requires that the both input files are pre-sorted by the join key, and so
we will have to re-sort the errors table before joining. I’ll sort inplace, by
telling sort to output to a file (the input file) when complete. Note that unlike
a normal output redirect, sort is smart enough to only create/rename the output
file when the sort operation is complete. However, a temp file would work just
as well. I’m sorting on field #2, and sort uses whitespace as column delimiters
by default. We’ll skip the items table, since it’s already sorted.
sort -o errors -k 2 errors
Now we’re ready to do the join. We need to specify what fields we want to join
on from each file. Since we want the first (and only) field from minutes, we can
omit it. -a 2
tells join to show missing matches from the minutes file,
and -1 2
tells join to use the join key from file 1, field 2.
join -a 2 -1 2 errors minutes
12:31 12
12:32 19
12:33
12:34 12
12:35
12:36 23
12:37 99
With an additional filter, we could add a default value of zero, but it is now clear in context which values are missing.
See also: comm
Concatenate files, skipping header line¶
Often I want to combine multiple files that already have headers, most commonly with CSV data. However, sometime I have data with a comment block at the top.
the csvstack
command (pip install csvkit
) is ideal if the data is csv.
csvstack f1.csv f2.csv f3.csv
Here’s a simple script to join files with headers:
#!/usr/bin/env python
"""join files with a header line"""
from __future__ import print_function
import fileinput
for line in fileinput.input():
if not fileinput.isfirstline() or fileinput.lineno() == 1:
print(line, end="")
Remove the first n lines of a file with tail¶
Tail is typically used to display the last n lines of a file, e.g. get the final (largest) 5 values with sort data | tail -5
However, it can also skip lines if you provide a positive offset, e.g.
tail +10
or tail -n +10
The catch is that the number you provide is where
it will start printing, one less than the number of lines will be skipped.
# starts on line 3
seq 5 | tail +3
3
4
5
Note: If you actually want the top 5 values from a dataset, it’s more common to
reverse the sort and take the first values, e.g. sort -nr data | head -5
which should be just as fast to sort, and avoids reading through the entire
file just to get the last few values, which is what tail must do when reading
from a pipe.
Sort a file with a header¶
Sometimes you have a file that has a multiline header, and you’d like to sort the data but keep the header. One nice technique is to print the header to stderr, and then process the rest of the file before displaying it. This is also a nice way to provide users with a variety of options to sort the output, assuming it is in easily sortable form.
It’s pretty easy to do this with head and tail, although you have to remember that the offsets are off by one:
cat > data <EOM
# Here's some data
# with a header
20
5
1
15
EOM
Now, I’d like to sort the data in descending order, but the header gets sorted as well.
sort -nr data
20
13
5
1
# with a header
# Here's some data
Instead, if we put the header on stderr, and then sort the rest, we’ll get what we want printed to the terminal.
NOTE This will NOT put the header into the output pipeline. See the bottom of this recipe for an alternative.
head -2 data > /dev/stderr; tail +3 data | sort -nr
# Here's some data
# with a header
20
13
5
1
Now, with this small behead script, we can sort only the data portion of a file:
#!/usr/bin/env perl
use Getopt::Std;
my $opt_n = 1;
getopts('n:');
while(<>) {
if ($. <= $opt_n ) {
print STDERR;
} else {
print
}
}
behead -2 data | sort -nr
# Here's some data
# with a header
20
15
5
1
Another interesting use of behead is to quickly see both the first and last value of some input, like a range query.
IMPORTANT NOTE: behead prints output to stderr, so this isn’t suitable for piping to another command, but can be useful just to see in the terminal.
seq -w 1 10 | behead | tail -1
01
10
If you want to capture both the header and the sorted body, use the command-block/subshell technique:
( head -2 data; tail +3 data | sort -nr ) > data.sorted
put data into a specific number of columns with pr¶
The pr command is used to format text files for printing, and it has a large
set of options. It can also be used to do some useful things for display.
Unlike rs
, it is standard on nearly every linux system.
Note that the -t option is required to skip the unwanted page header that is intended for print output.
Three columns. Fill down columns first.
seq 10 | pr -t -3
1 5 9
2 6 10
3 7
4 8
Three columns. Fill across rows first.
seq 10 | pr -t -3 -a
1 2 3
4 5 6
7 8 9
10
making data tables with column¶
Sometime you have unevenly spaced fields (or words), and you’d like to turn it into nice white-space separated columns. The column command has a table mode for just this. I often use this to either pretty print a command’s fields, or for pretty printing parts of a log file (note that it’s not good for the entire line, as it works best when there are a limited (and constant) number of columns.
cat > mytxt <<EOF
the quick brown fox
jumped over the lazy
dogs and it was
so very, very funny
EOF
Use the table mode -t
to turn it into variable width columns, each sized
according to the largest item.
column -t mytxt
the quick brown fox
jumped over the lazy
dogs and it was
so very, very funny
Use column to create a flexible number of columns to fill the width.¶
pr is useful when you know how many columns you want to create. column
also
has a mode to create columns, but unlike pr, you set the width you want, and
column will create an appropriate number of columns to fill it.
Fill down columns first.
seq 30 | column -c 40
1 7 13 19 25
2 8 14 20 26
3 9 15 21 27
4 10 16 22 28
5 11 17 23 29
6 12 18 24 30
Same as above, but fill across first.
seq 30 | column -c 40 -x
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
21 22 23 24 25
26 27 28 29 30
joining all lines with xargs or paste¶
If you just want all the items on the same line, xargs
is quick and dirty,
joining with spaces. xargs echo
can also be used.
seq 20 | xargs
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
To have a specific number of columns, still space separated, have xargs break
it up for you. here’s we’re choosing 5 at a time, and notice that the alignment
isn’t very good. See column
above for how to align columns.
seq 20 | xargs -n5 echo
1 2 3 4 5
6 7 8 9 10
11 12 13 14 15
16 17 18 19 20
See below for the section on xargs.
paste also provides a way to join all lines, the -s
option. By default, paste
joins using a tab, but you can change that with the -d
option.
seq 5 | paste -s -
1 2 3 4 5
And to create comma-separated lists, provide a delimiter with the -d
flag:
seq 5 | paste -s -d, -
1,2,3,4,5
joining/transforming all except the last line with perl¶
Here’s another way to join a set of lines, but keep the last one intact. Useful if you want to create a comma-separated.
Use a perl transformation like perl -pe's/\n/:/ unless eof'
to join with other characters.
seq 5 | perl -pe’s/\n/:/ unless eof’ 1:2:3:4:5
This recipe can be generalized to apply any transformation to all except the last line.
Transform one column at a time¶
Sometimes I want to work on only one column of a multi-column file. A very common case is transforming the timestamp of some data, or a time-based ID to a timestamp. Other possible cases include doing some lookup, hashing, and data obfuscation.
If the transformation is quite simple, perl or awk can be used:
cat <<EOF | tee data
1 2018-04-01 foo
2 2018-04-01 bar
3 2018-04-02 baz
4 2018-04-03 cat
EOF
Here’s an example of using two simple awk filters to increment column 1, and uppercase column 3:
cat data | awk '$1=$1+7' | awk '$3=toupper($3)'
8 2018-04-01 FOO
9 2018-04-01 BAR
10 2018-04-02 BAZ
11 2018-04-03 CAT
A relatively complicated way to do the same thing with perl’s autosplit:
perl -anE'$F[0]+= 7; $F[-1] = uc($F[-1]); say join(" ", @F)' data
8 2018-04-01 FOO
9 2018-04-01 BAR
10 2018-04-02 BAZ
11 2018-04-03 CAT
Split, transform and recombine columns¶
However, sometimes the transformation is more complicated than I’d want to try
inline, or I have an existing tool or filter that will work on a column of
data. One technique is to split the input into separate files by column,
process each column separately, and then recombine the column files with the
paste
command. Note that this most useful when the number of columns is
relatively small. Here I’m doing it with the same 3-column file.
Create one file per column:
awk '{print $1}' data > a.01
awk '{print $2}' data > a.02
awk '{print $3}' data > a.03
Transform a column. Here we use a tempfile and would overwrite the original file only if the command succeeds:
cat a.02 | tr -d 'a' > tmp
mv tmp a.02
Verify that the files still have the same number of lines:
wc -l a.0*
4 a.01
4 a.02
4 a.03
12 total
Recombine the columns using paste. Separate with a space to match the input file format.
paste -d " " a.01 a.02 a.03
1 2018-04-01 foo
2 2018-04-01 br
3 2018-04-02 bz
4 2018-04-03 ct
Note that there are some problems with this approach. The most significant is that your transformation script must return a single line of output for every line of input, or the columns will become misaligned. Also, you end up re-reading the input file several times, which may be ok with a small number of columns, or a small file. Naturally, column extraction could be scripted with a for loop, or a dedicated, smarter tool, but then things begin to get complicated.
Loop to extract columns 1,2 and 3:
for i in `seq 3`; do awk "{print \$$i}" data > a.$i ; done
Finally, awk and paste are best for simple, whitespace delimited files, but you
could use something like csvcut
(from csvkit
, described below) and paste -d,
to handle CSV files.
Update:
Here’s a cool solution for many columns via stack overflow
(https://stackoverflow.com/a/41863438). It makes use of the GNU split command
to group every Nth line together into a separate file without multiple passes through the
input. WARNING: The built-in split on OSX DOESN’T have the required
functionality, so you’ll need to install the GNU version. e.g. brew install coreutils
. Under that same stack overflow question is a pure-awk solution, but
it’s also a bit complicated.
Create a sample file with 10 columns:
seq 50 | xargs -n10 | tee data
1 2 3 4 5 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20
21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36 37 38 39 40
41 42 43 44 45 46 47 48 49 50
Create one file per column by first collapsing the input rows into one cell per line. These lines are split into 10 files, with every line going round-robin into a separate (per-column) file. Then transform the 9th column, and paste the files back together:
cat data | tr ' ' '\n' | gsplit -nr/10 -d - /tmp/transform.
perl -ni -E'say $_*10' /tmp/transform.08
paste -d ' ' /tmp/transform.*
1 2 3 4 5 6 7 8 90 10
11 12 13 14 15 16 17 18 190 20
21 22 23 24 25 26 27 28 290 30
31 32 33 34 35 36 37 38 390 40
41 42 43 44 45 46 47 48 490 50
The gsplit
(or split
on linux) command magic is as follows:
extract every 10th line into its own file, round robin style
-nr/10
create numeric filename suffixes
-d
(default to 2 digits. Starts with 00)read from STDIN
-
write output files with a prefix
/tmp/transform.
Note that the split files start at 00, so we transformed the 9th column in
/tmp/transform.08
. We did another trivial transformation (multiply the value
by 10).