GNU Parallel's 20th birthday


On 2022-01-06 GNU Parallel will be 20 years old. The birthday is an opportunity to take stock.

Last year I found an old backup that contained the very first version of Parallel. It had an emacs backup file (parallel~) that did not contain working code. This gives a firm birth date of the very first working code of Parallel: 2002-01-06. The code is ~15 lines of Perl, and it still runs:

  #!/usr/bin/perl
  
  $processes=shift;
  
  chomp(@jobs=<>);
  for (@jobs) {
      $jobnr++;
      push @makefile,
      (".PHONY : job$jobnr\n",
       "job$jobnr :\n",
       "\t$_\n");
  }
  unshift @makefile, "all : ",(map { "job$_ " } 1 .. $jobnr),"\n";
  
  open (MAKE, "| make -k -f - -j $processes") || die;
  print MAKE @makefile;
  close MAKE;

This was obviously before Parallel was adopted as a GNU tool. The adoption happened in April 2011.

  gnu parallel is a good program
    -- Pwn A. Day @pwnaday@twitter

Not all software manages to survive for 20 years and stay relevant to more than fans of retro-computing: Most of GNU Parallel's competitors (see list in man parallel_alternatives) have withered away after a few years.

  This is a fantastic tool, and I wish I had upgraded from xargs years ago!
    -- Stuart Anderson

But it seems GNU Parallel has managed to stay relevant.

With around 1000 citations in scientific articles and with a slowly rise in year-by-year citation counts, the relevancy seems not to have topped yet.

The articles cover a diverse range from pruning fruit trees (https://arxiv.org/abs/2102.03700) over checking programs for the Mars rovers (https://www-robotics.jpl.nasa.gov/publications/Mark_Maimone/rp_check_v4.pdf) to COVID-19 research (https://www.nature.com/articles/s41588-021-00862-7).

The number of citations puts GNU Parallel firmly in the top 1% most cited papers and possibly in the top 0.1% (https://www.nature.com/news/nature-top-100-papers-infographicv2-30-10-14-jpg-7.21204?article=1.16224). I think that would not have happened if the citation notice had not been implemented.

  Safe to say, @GnuParallel was a life changer during my PhD! It helped
  me optimise so many of my tasks and analyses.
    -- Parice Brandies @PariceBrandies@twitter

The citation notice has been the single issue that has caused the most contention in the past 20 years. Looking back I should have cleared the wording earlier with RMS, and made it known, that the wording was accepted by RMS.

Given that we still do not have a surefire way of earning a living from free software, I find it important, that we try out different ways, and the citation notice is such a try.

You may disagree with the citation notice, and that is fine: There are more than 50 alternatives to GNU Parallel for you to choose from (see man parallel_alternatives).

  It is, beyond absolutely any doubt whatsoever, the single most
  important tool I use in making me a productive bioinformatician.
    -- A-N-Other@reddit.com

Unfortunately, there are people working directly against the citation notice. I welcome different opinions, but I find it unsympathetic to work directly against the wishes of the author without providing an alternative income.

This behaviour will make it harder to attract developers of free software in the future: If potential developers see that there are people who are willing to spend a considerable amount of time on making it harder to earn a living from free software, it is likely fewer developers will join us. And that will hurt us all in the long term.

It is much preferable if these people simply ignore GNU Parallel and instead choose to use another tool, and channel their energy into building other software. That would be more productive and waste fewer ressources.

Instead of working against eachother we should be working together to find better solutions.

  I get a weird sense of satisfaction every single time I see the
  lovely logo of #GNU Parallel (plus, what an underrated piece of
  great software!)
    -- Emre Sevinç @EmreSevinc@twitter

One of the best examples of finding better solutions together is the design of {==} in GNU Parallel.

GNU Parallel has some predefined replacement strings: {} is replaced with the input - just like in xargs and find -exec. {.} removes the extension of the file name, and {/} removes the dir.

In 2014 I wanted to make more of these replacement strings, but Malcolm Cook came with the brilliant idea to simply allow Perl expressions, and thus make it possible for users to define their own. So {= perl expression =} will run the perl expression and use $_ as the value:

  $ parallel echo '{= $_ = length $_ =}' ::: aaa bb cccc
  3
  2
  4

Today even the predefined replacement strings are made this way.

  My favorite man page is that of GNU parallel.
    -- Jeroen Janssens @jeroenhjanssens@twitter

Initially I had expected most support would happen on the email lists. But today most support is done via unix.stackexchange.com. I am happy that the content from the site is available under a free license (CC-By-SA), but I would prefer if the site ran on free software.

If someone builds a competitor to unix.stackexchange.com using only free software, you can expect me to show up there.

  GNU parallel is very easy to use and has many features for
  specialized use cases. It’s a Perl script.
    -- @harlekyn@twitter uʎʞǝlɹɐɥ 

Many options can trace their origin to users trying to do something, that GNU Parallel could not do at the time.

That is not to say, that every single idea is implemented. From the initial idea to the implementation it is not uncommon to take a full year - especially if I am not convinced, that it is a generally useful idea. But sometimes thinking of an idea for an extended period of time will improve the idea and change it into something generally useful.

One example is using multiple input sources. Earlier versions of GNU Parallel took one input from each source:

  $ parallel echo ::: Blue Gray ::: Whale Elephant
  Blue Whale
  Gray Elephant

but it was more useful to generate all combinations:

  $ parallel echo ::: Blue Gray ::: Whale Elephant
  Blue Whale
  Blue Elephant
  Gray Whale
  Gray Elephant

If you develop software, listen to your users. Don't accept every crazy idea, but use your users to discuss how new functionality should work.

Sometimes you will discover users using your software in ways you never intended. This is also why you should push your software to the limit. Do not artificially limit input to 100 bytes, if you can support 4 GBytes with no extra work; and if it only costs a little extra, try to remove the limit all together.

Pushing your software to the limit will also uncover bugs that might be more serious.

  I have gotten a *ton* of mileage out of jq, awk, and GNU parallel,
  even at multi-GB sizes.
    -- Eric Wolak @ericthewolak@twitter 

Both the videos (https://youtube.com/playlist?list=PL284C9FF2488BC6D1) and the book (https://www.lulu.com/shop/ole-tange/gnu-parallel-2018/paperback/product-23558902.html) were caused by Hans Schou: He teased that one of GNU Parallel's competitors had a video, and this made me record a few screen casts of the basic usage.

  GNU parallel(1): The first CLI utility I have seen that gives a link
  to a YouTube video for a "quick introduction". WTF.
    -- Ralf (RPW) @esizkur@twitter

Hans Schou also started calling GNU Parallel's man page "The Book", and this got me thinking that there really ought to be a book, which introduces you to GNU Parallel. A man page is a decent reference manual, but it is a lousy document for learning what is important and what is not.

So in 2018 I wrote "GNU Parallel 2018". Contrary to other technical books it is still not outdated here 4 years later: The examples still work, because the core functionality remains backwards compatible. None of new functionality added since 2018 is relevant for new users - it is all for advanced users.

Writing the book was a lot of work, and since it has only sold very few copies, it will be hard for me to justify spending time on an update any time soon.

  I wish more command line software had example pages as robust as GNU
  Parallel
    -- Lucidbeaming @lucidbeaming

The cheat sheet was made after learning about the four types of documentation (https://documentation.divio.com/).

  How I love GNU parallel 
    -- @dathanvp@twitter Dathan Pattishall

It is clear that some users really like GNU Parallel, and I must admit I also find it quite satisfying to see a 64 core machine max out all cores instead of just using a single one.

I had hoped I could convince some of these users to make short video testimonials to show the wide range of usage of GNU Parallel. So far only Juan Sierra Pons has done so: http://www.elsotanillo.net/wp-content/uploads/GnuParallel_JuanSierraPons.mp4

If you choose to make one, it really does not have to be as detailed as Juan's. It is perfectly fine if you make a 15 seconds video in which you just say your name, and what field you are using GNU Parallel in. Please put it under the (CC-By) or (CC-By-SA) license, so I can re-use it.

  Deus salve o gnu parallel
    -- marcos @guv_Tuv@twitter

I have used GNU Parallel as a guinea pig to test whether people read the source code of free software.

And people do.

Three times have I secretly inserted a comment asking people to contact me when they read the comment. The first took 3 months, the second 23 months, and the final 5 months. So GNU Parallel's source code is read by users roughly once a year.

The first two comments are covered in: https://www.fsf.org/blogs/community/who-actually-reads-the-code

  Every time I install @ubuntu, one of the first tools I install is
  @gnuparallel. I love it.
    -- Necati Demir @ndemir@twitter

Some of GNU Parallel's options are a bit dated: I cannot remember the last time I used --trim which removes white space at either end of the argument. The dated options seemed like a good idea at the time, and as long as they do not cause problems, they will be supported for backwards compatibilty.

Other options are a bit over-engineered: Dynamic replacement strings are replacement strings that take multiple arguments, so you can define {/foo/bar} to take the two arguments (foo and bar) and let it mean "replace foo with bar" (or any perl expression). I have yet to see normal users of GNU Parallel define their own dynamic replacement strings.

  [L]earning about parallel was amazing for me, it gives us many
  beautiful solutions. 
     -- SergioAraujo@stackoverflow

Other development has proven surprisingly useful. Like env_parallel.

env_parallel exports aliases, function, and shell variables to a remote system through ssh. So you can define a complex function locally and have that run on the remote system without having to deal getting the function and variables to the remote system.

Novice UNIX users do not understand how surprising that is. But senior UNIX users will initially see that as magic.

This is so convenient that I use it even if I only run a single job on a remote system.

parset and parallel --embed are also interesting, because it is not trivial to run computations in parallel and have the outputs stored in different shell variables or to include all source code of a program in a single shell script.

  It's not a data migration party until GNU Parallel is involved...
  involved
  involved
    -- rrees @rrees@twitter

env_parallel started as some simple code around GNU Parallel. A similar story is the provenance of parsort: I had bought a 64 core machine and was amazed how slow GNU Sort is on this machine. GNU Sort scales really badly on multicore machines. And that is sad because a lot of data processing requires sorting.

parsort is a wrapper around sort that makes better use of the cores giving a speedup of 3x on a 64 core machine compared to normal sort. But even so you still get >80% idle CPU time.

This is because GNU Sort does not use algorithms that are 100% parallelized.

If you are a student or a teacher, updating GNU Sort to 100% parallelized algorithms could be a good project for a programming course.

  GNU parallel should be taught in class, it is one of the best tools
  to run grids of experiments
    -- no love deep learning @tetraduzione@twitter

All in all the design of GNU Parallel has stood the test of time: It has been possible to make changes, so GNU Parallel can emulate most of the functionality of the alternatives.

Perl was chosen as the language because I wanted to be able to run GNU Parallel on old systems by simply copying a single file. And back in 2002 all old systems had Perl installed.

The only serious problem today is that parts of GNU Parallel is single-threaded, and with systems having more than 100 CPU cores you hit this limitation more often.

To me it is a bit amazing that GNU Parallel can run both on my ancient accesspoint with 32 MB RAM and on a supercomputer with 100s of nodes and 1000s of cores.

  I think many people would be surprised to learn that GNU parallel is
  "just" a single Perl script.
    -- Peter Menzel @ptr_menzel@twitter

Contrary to many free software projects GNU Parallel has a fixed release cycle: A new version has been released around the 22nd every month with very few exceptions the past 10 years.

The 22nd was chosen because Henrik Sandklef noticed two of the earliest releases by coincidence happened to be released on the 22nd, so he suggested next release to be on the 22nd, too; and it stuck.

So in 2022 we will see the only version number consisting of only 2 digits: 20220222.

  Have you heard of our lord and saviour GNU parallel?
    -- kxyne @Kxyne@twitter

I chose to use the date as version number.

It gives some benefits: it is easy to determine if the project is still being maintained, users can easily tell how old their version is and know which version is newer. This would be harder if the versions were named after big cats, locations in California, or characters in a movie.

A small drawback is that it is impossible to see if there are major changes from one release to the next. Surprisingly, this drawback seems to be quite small: If the major changes are in parts you do not use, you could not care less.

  With multicore systems everywhere GNU Parallel is a must have tool.
    -- Neil H. Watson @neil_h_watson@twitter

Each version also carries a code name. The name is typically inspired by current events. So for most releases you would not be able to predict the name 2 months in advance. Can you tell which event each release refers to? https://savannah.gnu.org/news/?group=parallel

  Today I'm grateful for GNU parallel, especially with the --colsep and
  --jobs parameters #GiveThanks
    -- Erin Young @ErinYoun@twitter

The past 20 years of developing GNU Parallel has forced me to learn details about UNIX and shell programming that I really would prefer not to have learned. Like: Why is the limit for execve not standardized? Why do shells each have their own way of accessing the environment? And all sorts of race conditions.

But it has also given insight into some interesting problems that people try to solve.

I expect the next 20 years of GNU Parallel will see less development: It seems GNU Parallel has reached a stable level, where new features mostly will be for very specialized cases.

In April 2011 GNU Parallel was adopted as a GNU tool. We had prepared celebrations for the 10 year anniversary as GNU tool in 2021, but COVID put a stop to that. This, unfortunately, also goes for the 20th birthday on 2022-01-06.

But you should feel free to celebrate the birthday by re-posting this article with #gnuparallel

GNU Parallel - For people who live life in the parallel lane.

Copenhagen, 2022-01-06, Ole Tange, Author of GNU Parallel.