Use Multiple CPU Cores(Parallelize) with Single Threaded Linux Commands

Note: If things are not working as described below check the version information for parallel.

parallel --version
WARNING: YOU ARE USING --tollef. IF THINGS ARE ACTING WEIRD USE --gnu.
...

Either add --gnu to all your parallel commands, or edit /etc/parallel/config and make sure it lists --gnu and not --tollef.


Most Linux utilities are single threaded, and when dealing with really large files a single core can be a severe bottleneck. One way to work around this limitation is the GNU parallel command, which I'm still playing around with and finding uses for -- but looks like a promising way to speed up a lot of things I do on a daily basis.

A common time sink with big computers is big logs, and waiting forever to grep them.

ls -lh big.txt
-rw-r--r--  1 westlund  staff   102M Nov  1 21:23 big.txt

time grep regex bigfile.txt
real    0m2.376s
user    0m2.354s
sys     0m0.021s

time cat big.txt | parallel --pipe grep regex
real    0m2.592s
user    0m5.908s
sys     0m2.498s

Let's break down the command a bit. We cat the contents of the file to the parallel command, which we give the --pipe option which splits STDIN to multiple copies of the command we supplied. (--pipe is the same as --spreadstdin, which is more readable but longer)

But you'll notice is actually took longer, and it chewed up a lot more CPU. What's going on here? By default parallel splits the input into 1 megabyte blocks and hands each to a single instance of the command, with the number of simultaneous instances set to the number of cores by default. Launching grep 102 times to grep 1 meg is pretty wasteful. Let's tune things a little bit:

time cat big.txt | parallel --block 25M --pipe grep regex
real    0m1.626s
user    0m4.634s
sys     0m0.620s

Now we've got 25M blocks being handed out, and an over 30% reduction in the time to search the file. If the regex is more CPU intensive then there are better gains:

time grep -i regex big.txt
real    0m3.412s
user    0m3.362s
sys     0m0.049s

time cat big.txt | parallel --block 25M --pipe grep -i regex
real    0m2.040s  # >%40 improvement
user    0m6.155s
sys     0m0.666s

Another common command to throw large files at is awk. However, depending on what you’re trying to accomplish splitting the input may prevent you from accomplishing certain tasks — like adding all the values of a field. To work around this you can rewrite your commands to perform the same task in two steps. The first step to process blocks of the input, the second step to combine the results.

ls -lh random_numbers.txt
-rw-r--r--  1 westlund  staff    22M Nov  1 21:56 random_numbers.txt

time cat random_numbers.txt | awk '{sum+=$1} END {print sum}'
65538384640
real    0m2.408s
user    0m2.402s
sys     0m0.021s

time cat random_numbers.txt | parallel --block 2M --pipe awk \'{sum+=\$1} END {print sum}\'
6067097462
6064980068
6074889658
6068292593
6073256962
6065663642
6068441658
6071296753
4846052534
6072985491
6065427819
real    0m1.436s
user    0m4.390s
sys     0m0.358s

time cat random_numbers.txt | parallel --block 2M --pipe awk \'{sum+=\$1} END {print sum}\' | awk '{sum+=$1} END {print sum}'
65538384640
real    0m1.408s
user    0m4.341s
sys     0m0.356s

The same trick can be used with commands like wc:

ls -lh big.log
-rw-r--r-- 1 root root 164M Nov  2 13:33 big.log

time wc -w big.log
11993589 big.log
real	0m7.633s
user	0m7.515s
sys	0m0.078s


time cat big.log | parallel --block 10M  --pipe wc -w | awk '{sum+=$1} END {print sum}'
11993602
real    0m3.812s
user    0m5.003s
sys     0m1.368s

The caveat being that because it is parallel-izing these commands sometimes the results aren't as reproducible as they would inherently be on a single core. You've got to be really careful with the options to parallel, and the assumptions you make when assembling your commands. When in doubt use the -k (--keep-order) to return the output in the same order as the input, as opposed to as soon as the command running on a particular block of input finishes.

echo -n 2 1 4 3 | parallel -d " " -k -j4 sleep {}\\; echo {}
2
1
4
3

echo -n 2 1 4 3 | parallel -d " " -j4 sleep {}\\; echo {}
1
2
3
4

In the above command set delimiter is set to spaces to split between each of the input numbers, in the first one output order is maintained (-k), and limited to 4 cores (-j4). The input numbers are inserted in place of the {} in the sleep\; echo {} command.

By maintaining output order parallel can be used to bzip2 files as well:

ls -lh /boot/initramfs-3.8.9*
-rw------- 1 root root 20M May  2  2013 /boot/initramfs-3.8.9-200.fc18.x86_64.img

time cat /boot/initramfs-3.8.9* | bzip2 --best > /tmp/initramfs.bz2

real    0m5.823s
user    0m5.753s
sys     0m0.088s

time cat /boot/initramfs-3.8.9* | parallel --block 10M --pipe --recend '' -k bzip2 --best > /tmp/initramfs.bz2 
real    0m4.089s
user    0m6.632s
sys     0m0.299s

bzip2 --test /tmp/initramfs.bz2 # a new-ish version of bzip2 is required
echo $?
0

Pretty cool.