Writing scripts to automate the steps (GNU Astronomy Utilities)

Next: Citing and acknowledging Gnuastro, Previous: Marking objects for publication, Up: General program usage tutorial [Contents][Index]

2.1.22 Writing scripts to automate the steps ¶

In the previous sub-sections, we went through a series of steps like downloading the necessary datasets (in Setup and data download), detecting the objects in the image, and finally selecting a particular subset of them to inspect visually (in Reddest clumps, cutouts and parallelization). To benefit most effectively from this subsection, please go through the previous sub-sections, and if you have not actually done them, we recommended to do/run them before continuing here.

Each sub-section/step of the sub-sections above involved several commands on the command-line. Therefore, if you want to reproduce the previous results (for example, to only change one part, and see its effect), you’ll have to go through all the sections above and read through them again. If you have ran the commands recently, you may also have them in the history of your shell (command-line environment). You can see many of your previous commands on the shell (even if you have closed the terminal) with the history command, like this:

$ history

Try it in your teminal to see for yourself. By default in GNU Bash, it shows the last 500 commands. You can also save this “history” of previous commands to a file using shell redirection (to have it after your next 500 commands), with this command

$ history > my-previous-commands.txt

This is a good way to temporarily keep track of every single command you ran. But in the middle of all the useful commands, you will have many extra commands, like tests that you did before/after the good output of a step (that you decided to continue working on), or an unrelated job you had to do in the middle of this project. Because of these impurities, after a few days (that you have forgot the context: tests you did not end-up using, or unrelated jobs) reading this full history will be very frustrating.

Keeping the final commands that were used in each step of an analysis is a common problem for anyone who is doing something serious with the computer. But simply keeping the most important commands in a text file is not enough, the small steps in the middle (like making a directory to keep the outputs of one step) are also important. In other words, the only way you can be sure that you are under control of your processing (and actually understand how you produced your final result) is to run the commands automatically.

Fortunately, typing commands interactively with your fingers is not the only way to operate the shell. The shell can also take its orders/commands from a plain-text file, which is called a script. When given a script, the shell will read it line-by-line as if you have actually typed it manually.

Let’s continue with an example: try typing the commands below in your shell. With these commands we are making a text file (a.txt) containing a simple $3\times3$ matrix, converting it to a FITS image and computing its basic statistics. After the first three commands open a.txt with a text editor to actually see the values we wrote in it, and after the fourth, open the FITS file to see the matrix as an image. a.txt is created through the shell’s redirection feature: ‘>’ overwrites the existing contents of a file, and ‘>>’ appends the new contents after the old contents.

$ echo "1 1 1" > a.txt
$ echo "1 2 1" >> a.txt
$ echo "1 1 1" >> a.txt
$ astconvertt a.txt --output=a.fits
$ aststatistics a.fits

To automate these series of commands, you should put them in a text file. But that text file must have two special features: 1) It should tell the shell what program should interpret the script. 2) The operating system should know that the file can be directly executed.

For the first, Unix-like operating systems define the shebang concept (also known as sha-bang or hashbang). In the shebang convention, the first two characters of a file should be ‘#!’. When confronted with these characters, the script will be interpreted with the program that follows them. In this case, we want to write a shell script and the most common shell program is GNU Bash which is installed in /bin/bash. So the first line of your script should be ‘#!/bin/bash’⁴².

It may happen (rarely) that GNU Bash is in another location on your system. In other cases, you may prefer to use a non-standard version of Bash installed in another location (that has higher priority in your PATH, see Installation directory). In such cases, you can use the ‘#!/usr/bin/env bash’ shebang instead. Through the env program, this shebang will look in your PATH and use the first bash it finds to run your script. But for simplicity in the rest of the tutorial, we will continue with the ‘#!/bin/bash’ shebang.

Using your favorite text editor, make a new empty file, let’s call it my-first-script.sh. Write the GNU Bash shebang (above) as its first line. After the shebang, copy the series of commands we ran above. Just note that the ‘$’ sign at the start of every line above is the prompt of the interactive shell (you never actually typed it, remember?). Therefore, commands in a shell script should not start with a ‘$’. Once you add the commands, close the text editor and run the cat command to confirm its contents. It should look like the example below. Recall that you should only type the line that starts with a ‘$’, the lines without a ‘$’, are printed automatically on the command-line (they are the contents of your script).

$ cat my-first-script.sh
#!/bin/bash
echo "1 1 1" > a.txt
echo "1 2 1" >> a.txt
echo "1 1 1" >> a.txt
astconvertt a.txt --output=a.fits
aststatistics a.fits

The script contents are now ready, but to run it, you should activate the script file’s executable flag. In Unix-like operating systems, every file has three types of flags: read (or r), write (or w) and execute (or x). To toggle a file’s flags, you should use the chmod (for “change mode”) command. To activate a flag, you put a ‘+’ before the flag character (for example, +x). To deactivate it, you put a ‘-’ (for example, -x). In this case, you want to activate the script’s executable flag, so you should run

$ chmod +x my-first-script.sh

Your script is now ready to run/execute the series of commands. To run it, you should call it while specifying its location in the file system. Since you are currently in the same directory as the script, it is easiest to use relative addressing like below (where ‘./’ means the current directory). But before running your script, first delete the two a.txt and a.fits files that were created when you interactively ran the commands.

$ rm a.txt a.fits
$ ls
$ ./my-first-script.sh
$ ls

The script immediately prints the statistics while doing all the previous steps in the background. With the last ls, you see that it automatically re-built the a.txt and a.fits files, open them and have a look at their contents.

An extremely useful feature of shell scripts is that the shell will ignore anything after a ‘#’ character. You can thus add descriptions/comments to the commands and make them much more useful for the future. For example, after adding comments, your script might look like this:

$ cat my-first-script.sh
#!/bin/bash

# This script is my first attempt at learning to write shell scripts.
# As a simple series of commands, I am just building a small FITS
# image, and calculating its basic statistics.

# Write the matrix into a file.
echo "1 1 1" > a.txt
echo "1 2 1" >> a.txt
echo "1 1 1" >> a.txt

# Convert the matrix to a FITS image.
astconvertt a.txt --output=a.fits

# Calculate the statistics of the FITS image.
aststatistics a.fits

Is Not this much more easier to read now? Comments help to provide human-friendly context to the raw commands. At the time you make a script, comments may seem like an extra effort and slow you down. But in one year, you will forget almost everything about your script and you will appreciate the effort so much! Think of the comments as an email to your future-self and always put a well-written description of the context/purpose (most importantly, things that are not directly clear by reading the commands) in your scripts.

The example above was very basic and mostly redundant series of commands, to show the basic concepts behind scripts. You can put any (arbitrarily long and complex) series of commands in a script by following the two rules: 1) add a shebang, and 2) enable the executable flag. In fact, as you continue your own research projects, you will find that any time you are dealing with more than two or three commands, keeping them in a script (and modifying that script, and running it) is much more easier, and future-proof, then typing the commands directly on the command-line and relying on things like history. Here are some tips that will come in handy when you are writing your scripts:

As a more realistic example, let’s have a look at a script that will do the steps of Setup and data download and Dataset inspection and cropping. In particular note how often we are using variables to avoid repeating fixed strings of characters (usually file/directory names). This greatly helps in scaling up your project, and avoiding hard-to-find bugs that are caused by typos in those fixed strings.

$ cat gnuastro-tutorial-1.sh
#!/bin/bash


# Download the input datasets
# ---------------------------
#
# The default file names have this format (where `FILTER' differs for
# each filter):
#   hlsp_xdf_hst_wfc3ir-60mas_hudf_FILTER_v1_sci.fits
# To make the script easier to read, a prefix and suffix variable are
# used to sandwich the filter name into one short line.
dldir=download
xdfsuffix=_v1_sci.fits
xdfprefix=hlsp_xdf_hst_wfc3ir-60mas_hudf_
xdfurl=http://archive.stsci.edu/pub/hlsp/xdf

# The file name and full URLs of the input data.
f105w_in=$xdfprefix"f105w"$xdfsuffix
f160w_in=$xdfprefix"f160w"$xdfsuffix
f105w_url=$xdfurl/$f105w_in
f160w_url=$xdfurl/$f160w_in

# Go into the download directory and download the images there,
# then come back up to the top running directory.
mkdir $dldir
cd $dldir
wget $f105w_url
wget $f160w_url
cd ..


# Only work on the deep region
# ----------------------------
#
# To help in readability, each vertice of the deep/flat field is stored
# as a separate variable. They are then merged into one variable to
# define the polygon.
flatdir=flat-ir
vertice1="53.187414,-27.779152"
vertice2="53.159507,-27.759633"
vertice3="53.134517,-27.787144"
vertice4="53.161906,-27.807208"
f105w_flat=$flatdir/xdf-f105w.fits
f160w_flat=$flatdir/xdf-f160w.fits
deep_polygon="$vertice1:$vertice2:$vertice3:$vertice4"

mkdir $flatdir
astcrop --mode=wcs -h0 --output=$f105w_flat \
        --polygon=$deep_polygon $dldir/$f105w_in
astcrop --mode=wcs -h0 --output=$f160w_flat \
        --polygon=$deep_polygon $dldirdir/$f160w_in

The first thing you may notice is that even if you already have the downloaded input images, this script will always try to re-download them. Also, if you re-run the script, you will notice that mkdir prints an error message that the download directory already exists. Therefore, the script above is not too useful and some modifications are necessary to make it more generally useful. Here are some general tips that are often very useful when writing scripts:

Stop script if a command crashes

By default, if a command in a script crashes (aborts and fails to do what it was meant to do), the script will continue onto the next command. In GNU Bash, you can tell the shell to stop a script in the case of a crash by adding this line at the start of your script:

set -e

Check if a file/directory exists to avoid re-creating it

Conditionals are a very useful feature in scripts. One common conditional is to check if a file exists or not. Assuming the file’s name is FILENAME, you can check its existance (to avoid re-doing the commands that build it) like this:

if [ -f FILENAME ]; then
  echo "FILENAME exists"
else
  # Some commands to generate the file
  echo "done" > FILENAME
fi

To check the existance of a directory instead of a file, use -d instead of -f. To negate a conditional, use ‘!’ and note that conditionals can be written in one line also (useful for when it is short).

One common scenario that you’ll need to check the existance of directories is when you are making them: the default mkdir command will crash if the desired directory already exists. On some systems (including GNU/Linux distributions), mkdir has options to deal with such cases. But if you want your script to be portable, it is best to check yourself like below:

if ! [ -d DIRNAME ]; then mkdir DIRNAME; fi

Avoid changing directories (with ‘cd’) within the script

You can directly read and write files within other directories. Therefore using cd to enter a directory (like what we did above, around the wget commands), running command there and coming out is extra, and not good practice. This is because the running directory is part of the environment of a command. You can simply give the directory name before the input and output file names to use them from anywhere on the file system. See the same wget commands below for an example.

Copyright notice: A very important thing to put at the top of your script is a one-line description of what it does and its copyright information (see the example below). Here, we specify who is the author(s) of this script, in which years, and under what license others are allowed to use this file. Without it, your script does not credibility or identity, and others cannot trust, use or acknowledge your work on it. Since Gnuastro is itself licensed under a copyleft license (see Your rights and GNU Gen. Pub. License v3 or GNU GPL, the license finishes with a template on how to add it), any script that uses Gnuastro should also have a copyleft license: we recommend the same GNU GPL v3+ like below.

Taking the above points into consideration, we can write a better version of the script above. Please compare this script with the previous one carefully to spot the differences. These are very important points that you will definitely encouter during your own research, and knowing them can greatly help your productiveity, so pay close attention (even in the comments).

#!/bin/bash
# Script to download and keep the deep region of the XDF survey.
#
# Copyright (C) 2024      Your Name <yourname@email.company>
# Copyright (C) 2021-2024 Initial Author <incase@there-is.any>
#
# This script is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This script is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with Gnuastro. If not, see <http://www.gnu.org/licenses/>.


# Abort the script in case of an error.
set -e


# Download the input datasets
# ---------------------------
#
# The default file names have this format (where `FILTER' differs for
# each filter):
#   hlsp_xdf_hst_wfc3ir-60mas_hudf_FILTER_v1_sci.fits
# To make the script easier to read, a prefix and suffix variable are
# used to sandwich the filter name into one short line.
dldir=download
xdfsuffix=_v1_sci.fits
xdfprefix=hlsp_xdf_hst_wfc3ir-60mas_hudf_
xdfurl=http://archive.stsci.edu/pub/hlsp/xdf

# The file name and full URLs of the input data.
f105w_in=$xdfprefix"f105w"$xdfsuffix
f160w_in=$xdfprefix"f160w"$xdfsuffix
f105w_url=$xdfurl/$f105w_in
f160w_url=$xdfurl/$f160w_in

# Make sure the download directory exists, and download the images.
if ! [ -d $dldir    ]; then mkdir $dldir; fi
if ! [ -f $f105w_in ]; then wget $f105w_url -O $dldir/$f105w_in; fi
if ! [ -f $f160w_in ]; then wget $f160w_url -O $dldir/$f160w_in; fi


# Crop out the deep region
# ------------------------
#
# To help in readability, each vertice of the deep/flat field is stored
# as a separate variable. They are then merged into one variable to
# define the polygon.
flatdir=flat-ir
vertice1="53.187414,-27.779152"
vertice2="53.159507,-27.759633"
vertice3="53.134517,-27.787144"
vertice4="53.161906,-27.807208"
f105w_flat=$flatdir/xdf-f105w.fits
f160w_flat=$flatdir/xdf-f160w.fits
deep_polygon="$vertice1:$vertice2:$vertice3:$vertice4"

if ! [ -d $flatdir ]; then mkdir $flatdir; fi
if ! [ -f $f105w_flat ]; then
    astcrop --mode=wcs -h0 --output=$f105w_flat \
            --polygon=$deep_polygon $dldir/$f105w_in
fi
if ! [ -f $f160w_flat ]; then
    astcrop --mode=wcs -h0 --output=$f160w_flat \
            --polygon=$deep_polygon $dldir/$f160w_in
fi

Footnotes

(42)

When the script is to be run by the same shell that is calling it (like this script), the shebang is optional. But it is still recommended, because it ensures that even if the user is not using GNU Bash, the script will be run in GNU Bash: given the differences between various shells, writing truly portable shell scripts, that can be run by many shell programs/implementations, is not easy (sometimes not possible!).