Batch and parallel execution with xargs and parallel¶
There are a few commands that are generally useful when working with many files.
xargs¶
xargs allows you to generate commands by piping in parameters.
A trivial example is to compress the 3 oldest files in the directory. I list the csv files, sorted by age(recent first), and then take the last 3. These 3 files, one per line, are passed into xargs, which sends them to whatever command I specify as arguments.
ls -1 -t *.csv | tail -3 | xargs -t gzip
gzip 19.csv 18.csv 17.csv
I use a few options frequently:
-n 1
to only pass one argument at a time, like a for loop. Note that many of the common uses of xargs can also be replaced by a simple bash for loop.parameter substitution with
-I %
. If I want multiple replacements, or need to an an extension, that’s a good way.ls -1 *.sql | xargs -n 1 -I % echo mycommand –logfile %.log % mycommand –logfile 201701.csv.log 201701.csv mycommand –logfile 201702.csv.log 201702.csv
parallel execution of commands.
# gzip csv files, with four parallel processes.
# print lines as we execute them, and send one file
# at a time to each invocation
ls -t -1 *.csv | xargs -P 4 -t -n 1 gzip
gzip 04.csv
gzip 03.csv
gzip 02.csv
gzip 01.csv
GNU parallel¶
Parallel is a powerful and huge tool, and has many pages of manuals and examples.
However, there are a few key things that I like about running commands via parallel:
easily create separate log files for each invocation.
run commands on multiple machines
Here are a few simple examples:
gzip all csv files in a directory. Create one job per core, and provide some diagnostic output.
ls *.csv | parallel --eta gzip
More complicated, use the command substitution to create a basename, and remove the extension with an extended command. {} is the input, {.} is the input without the extension, and {/.} is the basename without the extension.
ls *.gz | parallel --eta 'mkdir {/.} && cd {/.} && unzip ../{}'
See also: sem, part of the gnu parallel package, which allows you to easily limit the number of concurrent processes without the complexity of parallel. Very useful for running N jobs in parallel inside a simple for loop.