# Extraction Extraction is a subset of transformation, but it is important enough to have its own section. perl, sed and awk are all common tools for both selection and extraction ## Extracting one or more columns with awk one trivial but common use of awk is to extract column(s) from text with variable whitespace, like formatted text or the output of a command like `ls -l`: ls -l | tail +2 | awk '{print $5}' To print multiple columns, remember to join with `,`, not ` `: ls -l | tail +2 | awk '{print $2,$1}' ## Field extraction via perl -anE Perl also has an autosplit mode, -a, which can break up each input line by whitespace and put it into an array @F. Index the array to pick out columns. Note that the fields are zero-indexed. ls -l | perl -anE'say $F[1]' # the second field ### Printing the last column, awk and perl Sometimes you just want to print the last column, when you don't know (or don't want to count) how many columns there are. This is also useful if you have a variable number of columns. In awk, use the number of fields `$NF` variable: ls -l | tail +2 | awk '{print $NF}' Because perl allows negative array indexing to pick elements, I often use this to select the last field (or N from the last field), combining it with the autosplit function. Print the 2nd from the last column: ls -l | perl -anE'say $F[-2]' ## Extract simple fields via cut cut is designed to extract fields from a line, given a single character delimiter or position list. It will not split on patterns or multi-character delimiters. Use `awk` or one of the tools described below if you have more complicated data. By default, it splits fields on a single tab character, but you can easily specify something else with the `-d' option. cat > data < alpha < fields.csv <