Extraction¶
Extraction is a subset of transformation, but it is important enough to have its own section.
perl, sed and awk are all common tools for both selection and extraction
Extracting one or more columns with awk¶
one trivial but common use of awk is to extract column(s) from text with
variable whitespace, like formatted text or the output of a command like ls -l
:
ls -l | tail +2 | awk '{print $5}'
To print multiple columns, remember to join with ,
, not
:
ls -l | tail +2 | awk '{print $2,$1}'
Field extraction via perl -anE¶
Perl also has an autosplit mode, -a, which can break up each input line by whitespace and put it into an array @F. Index the array to pick out columns. Note that the fields are zero-indexed.
ls -l | perl -anE'say $F[1]' # the second field
Printing the last column, awk and perl¶
Sometimes you just want to print the last column, when you don’t know (or don’t want to count) how many columns there are. This is also useful if you have a variable number of columns.
In awk, use the number of fields $NF
variable:
ls -l | tail +2 | awk '{print $NF}'
Because perl allows negative array indexing to pick elements, I often use this to select the last field (or N from the last field), combining it with the autosplit function.
Print the 2nd from the last column:
ls -l | perl -anE'say $F[-2]'
Extract simple fields via cut¶
cut is designed to extract fields from a line, given a single character
delimiter or position list. It will not split on patterns or multi-character
delimiters. Use awk
or one of the tools described below if you have more
complicated data. By default, it splits fields on a single tab character, but
you can easily specify something else with the `-d’ option.
cat > data <<EOF
foo:bar:baz
fun:stuff:today
EOF
cat data | cut -d: -f 1,3
foo:baz
fun:today
Note that cut apparently ignores field order:
cat data | cut -d: -f 3,1
foo:baz
fun:today
Extract by character position with cut¶
cat > alpha <<EOF
abcdef
gehijk
EOF
cat alpha | cut -c '2-4,6'
bcdf
ehik
Extract fixed-width fields with awk¶
Sometimes you will have fields of known (but possibly different) widths, AKA fixed width format.
awk can be used to extract the fields (you may need to install gnu awk). Found this via https://stackoverflow.com/a/28562381
echo aaaaBBcccccDe | awk '$1=$1' FIELDWIDTHS='4 2 5 1 1' OFS=,
aaaa,BB,ccccc,D,e
The $1=$1
just forces awk to re-parse each line using the FIELDWIDTHDS you’ve
specified. OFS
is the output field separator.
Note that this recipe gets every field, and every character between each field. See the cut recipe above, or the in2csv approach if you only part of each line, or only some fields.
Extract fixed-width fields with in2csv¶
You can also use in2csv
(part of csvkit) to
convert fixed with to csv files with headers. in2csv requires a schema file, so
this is probably most practical if you find yourself doing this frequently for
the same schema, or there are a LOT of fields and you probably need to iterate
through them anyway.
Define a schema file:
cat > fields.csv <<EOM
column,start,length
name,1,4
id,5,2
field1,7,5
section,12,1
category,13,1
EOM
And then use the schema file to extract fixed-width fields:
echo aaaaBBcccccDe | in2csv -s fields.csv
name,id,field1,section,category
aaaa,BB,ccccc,D,e
Convert whitespace-delimited columns to csv¶
This can be relatively simple. If you know that you data doesn’t contain any additional commas, you can do a simple substitution:
# squeeze multiple spaces
tr -s ' ' ,
perl -pE's/\h+/,/g'; # horizontal space, no newline
perl -anE'say join(",", @F)'
If you can’t be sure that there won’t be commas in the fields, you’ll want to do proper quoting. Use a real csv tool. Here’s a way to convert to tsv:
perl -anE'say join("\t", @F)' | csvformat -t -H
tr -s ' ' '\t' | csvformat -t -H
cutting columns, other tools¶
Cutting columns of text is such a common operation that it’s worthwhile to have a few other tools on hand.
f - trivial field extractor¶
f is a tool with a laser-sharp focus: Extract a single column from a whitespace
delimited file. If you find yourself going often to awk for something like awk '{print $3}'
, then add f to your arsenal.
# quickly extract one column
printf "the quick brown fox\nand so it goes" | f 3
brown
it
Note that supports negative indexes, and can also be used to print the last column.
seq 10 | paste - - - | f -1 # grab the last column on each line
3
6
9
10
scut - swiss army knife of column cutters¶
scut is a better (if slower) cut, extracts arbitrary columns to be selected based on regexes
# zero indexed, easy to get many columns
cat data | scut -f '2 1'
# negative columns are removed
cat data | scut -f 'ALL -1'
to extract columns from CSV data, use csvcut¶
csvcut
is part of the csvkit
suite. It allows you to process CSV and TSV files,
and you can specify fields by name or by index.