Note: If things are not working as described below check the version information for parallel.

parallel --version  

Either add --gnu to all your parallel commands, or edit /etc/parallel/config and make sure it lists --gnu and not --tollef.

Most Linux utilities are single threaded, and when dealing with really large files a single core can be a severe bottleneck. One way to work around this limitation is the GNU parallel command, which I'm still playing around with and finding uses for -- but looks like a promising way to speed up a lot of things I do on a daily basis.

A common time sink with big computers is big logs, and waiting forever to grep them.

ls -lh big.txt  
-rw-r--r--  1 westlund  staff   102M Nov  1 21:23 big.txt

time grep regex bigfile.txt  
real    0m2.376s  
user    0m2.354s  
sys     0m0.021s

time cat big.txt | parallel --pipe grep regex  
real    0m2.592s  
user    0m5.908s  
sys     0m2.498s  

Let's break down the command a bit. We cat the contents of the file to the parallel command, which we give the --pipe option which splits STDIN to multiple copies of the command we supplied. (--pipe is the same as --spreadstdin, which is more readable but longer)

But you'll notice is actually took longer, and it chewed up a lot more CPU. What's going on here? By default parallel splits the input into 1 megabyte blocks and hands each to a single instance of the command, with the number of simultaneous instances set to the number of cores by default. Launching grep 102 times to grep 1 meg is pretty wasteful. Let's tune things a little bit:

time cat big.txt | parallel --block 25M --pipe grep regex  
real    0m1.626s  
user    0m4.634s  
sys     0m0.620s  

Now we've got 25M blocks being handed out, and an over 30% reduction in the time to search the file. If the regex is more CPU intensive then there are better gains:

time grep -i regex big.txt  
real    0m3.412s  
user    0m3.362s  
sys     0m0.049s

time cat big.txt | parallel --block 25M --pipe grep -i regex  
real    0m2.040s  # >%40 improvement  
user    0m6.155s  
sys     0m0.666s  

Another common command to throw large files at is awk. However, depending on what you’re trying to accomplish splitting the input may prevent you from accomplishing certain tasks — like adding all the values of a field. To work around this you can rewrite your commands to perform the same task in two steps. The first step to process blocks of the input, the second step to combine the results.

ls -lh random_numbers.txt  
-rw-r--r--  1 westlund  staff    22M Nov  1 21:56 random_numbers.txt

time cat random_numbers.txt | awk '{sum+=$1} END {print sum}'  
real    0m2.408s  
user    0m2.402s  
sys     0m0.021s

time cat random_numbers.txt | parallel --block 2M --pipe awk \'{sum+=\$1} END {print sum}\'  
real    0m1.436s  
user    0m4.390s  
sys     0m0.358s

time cat random_numbers.txt | parallel --block 2M --pipe awk \'{sum+=\$1} END {print sum}\' | awk '{sum+=$1} END {print sum}'  
real    0m1.408s  
user    0m4.341s  
sys     0m0.356s  

The same trick can be used with commands like wc:

ls -lh big.log  
-rw-r--r-- 1 root root 164M Nov  2 13:33 big.log

time wc -w big.log  
11993589 big.log  
real    0m7.633s  
user    0m7.515s  
sys    0m0.078s

time cat big.log | parallel --block 10M  --pipe wc -w | awk '{sum+=$1} END {print sum}'  
real    0m3.812s  
user    0m5.003s  
sys     0m1.368s  

The caveat being that because it is parallel-izing these commands sometimes the results aren't as reproducible as they would inherently be on a single core. You've got to be really careful with the options to parallel, and the assumptions you make when assembling your commands. When in doubt use the -k (--keep-order) to return the output in the same order as the input, as opposed to as soon as the command running on a particular block of input finishes.

echo -n 2 1 4 3 | parallel -d " " -k -j4 sleep {}\\; echo {}  

echo -n 2 1 4 3 | parallel -d " " -j4 sleep {}\\; echo {}  

In the above command set delimiter is set to spaces to split between each of the input numbers, in the first one output order is maintained (-k), and limited to 4 cores (-j4). The input numbers are inserted in place of the {} in the sleep\; echo {} command.

By maintaining output order parallel can be used to bzip2 files as well:

ls -lh /boot/initramfs-3.8.9*  
-rw------- 1 root root 20M May  2  2013 /boot/initramfs-3.8.9-200.fc18.x86_64.img

time cat /boot/initramfs-3.8.9* | bzip2 --best > /tmp/initramfs.bz2

real    0m5.823s  
user    0m5.753s  
sys     0m0.088s

time cat /boot/initramfs-3.8.9* | parallel --block 10M --pipe --recend '' -k bzip2 --best > /tmp/initramfs.bz2  
real    0m4.089s  
user    0m6.632s  
sys     0m0.299s

bzip2 --test /tmp/initramfs.bz2 # a new-ish version of bzip2 is required  
echo $?  

Pretty cool.