Use Multiple CPU Cores(Parallelize) with Single Threaded Linux Commands
Note: If things are not working as described below check the version information for parallel
.
parallel --version
WARNING: YOU ARE USING --tollef. IF THINGS ARE ACTING WEIRD USE --gnu.
...
Either add --gnu
to all your parallel commands, or edit /etc/parallel/config
and make sure it lists --gnu
and not --tollef
.
Most Linux utilities are single threaded, and when dealing with really large files a single core can be a severe bottleneck. One way to work around this limitation is the GNU parallel
command, which I'm still playing around with and finding uses for -- but looks like a promising way to speed up a lot of things I do on a daily basis.
A common time sink with big computers is big logs, and waiting forever to grep
them.
ls -lh big.txt
-rw-r--r-- 1 westlund staff 102M Nov 1 21:23 big.txt
time grep regex bigfile.txt
real 0m2.376s
user 0m2.354s
sys 0m0.021s
time cat big.txt | parallel --pipe grep regex
real 0m2.592s
user 0m5.908s
sys 0m2.498s
Let's break down the command a bit. We cat
the contents of the file to the parallel
command, which we give the --pipe
option which splits STDIN
to multiple copies of the command we supplied. (--pipe
is the same as --spreadstdin
, which is more readable but longer)
But you'll notice is actually took longer, and it chewed up a lot more CPU. What's going on here? By default parallel
splits the input into 1 megabyte blocks and hands each to a single instance of the command, with the number of simultaneous instances set to the number of cores by default. Launching grep
102 times to grep
1 meg is pretty wasteful. Let's tune things a little bit:
time cat big.txt | parallel --block 25M --pipe grep regex
real 0m1.626s
user 0m4.634s
sys 0m0.620s
Now we've got 25M blocks being handed out, and an over 30% reduction in the time to search the file. If the regex is more CPU intensive then there are better gains:
time grep -i regex big.txt
real 0m3.412s
user 0m3.362s
sys 0m0.049s
time cat big.txt | parallel --block 25M --pipe grep -i regex
real 0m2.040s # >%40 improvement
user 0m6.155s
sys 0m0.666s
Another common command to throw large files at is awk
. However, depending on what you’re trying to accomplish splitting the input may prevent you from accomplishing certain tasks — like adding all the values of a field. To work around this you can rewrite your commands to perform the same task in two steps. The first step to process blocks of the input, the second step to combine the results.
ls -lh random_numbers.txt
-rw-r--r-- 1 westlund staff 22M Nov 1 21:56 random_numbers.txt
time cat random_numbers.txt | awk '{sum+=$1} END {print sum}'
65538384640
real 0m2.408s
user 0m2.402s
sys 0m0.021s
time cat random_numbers.txt | parallel --block 2M --pipe awk \'{sum+=\$1} END {print sum}\'
6067097462
6064980068
6074889658
6068292593
6073256962
6065663642
6068441658
6071296753
4846052534
6072985491
6065427819
real 0m1.436s
user 0m4.390s
sys 0m0.358s
time cat random_numbers.txt | parallel --block 2M --pipe awk \'{sum+=\$1} END {print sum}\' | awk '{sum+=$1} END {print sum}'
65538384640
real 0m1.408s
user 0m4.341s
sys 0m0.356s
The same trick can be used with commands like wc
:
ls -lh big.log
-rw-r--r-- 1 root root 164M Nov 2 13:33 big.log
time wc -w big.log
11993589 big.log
real 0m7.633s
user 0m7.515s
sys 0m0.078s
time cat big.log | parallel --block 10M --pipe wc -w | awk '{sum+=$1} END {print sum}'
11993602
real 0m3.812s
user 0m5.003s
sys 0m1.368s
The caveat being that because it is parallel-izing these commands sometimes the results aren't as reproducible as they would inherently be on a single core. You've got to be really careful with the options to parallel, and the assumptions you make when assembling your commands. When in doubt use the -k
(--keep-order
) to return the output in the same order as the input, as opposed to as soon as the command running on a particular block of input finishes.
echo -n 2 1 4 3 | parallel -d " " -k -j4 sleep {}\\; echo {}
2
1
4
3
echo -n 2 1 4 3 | parallel -d " " -j4 sleep {}\\; echo {}
1
2
3
4
In the above command set delimiter is set to spaces to split between each of the input numbers, in the first one output order is maintained (-k
), and limited to 4 cores (-j4
). The input numbers are inserted in place of the {}
in the sleep\; echo {}
command.
By maintaining output order parallel
can be used to bzip2
files as well:
ls -lh /boot/initramfs-3.8.9*
-rw------- 1 root root 20M May 2 2013 /boot/initramfs-3.8.9-200.fc18.x86_64.img
time cat /boot/initramfs-3.8.9* | bzip2 --best > /tmp/initramfs.bz2
real 0m5.823s
user 0m5.753s
sys 0m0.088s
time cat /boot/initramfs-3.8.9* | parallel --block 10M --pipe --recend '' -k bzip2 --best > /tmp/initramfs.bz2
real 0m4.089s
user 0m6.632s
sys 0m0.299s
bzip2 --test /tmp/initramfs.bz2 # a new-ish version of bzip2 is required
echo $?
0
Pretty cool.