In this section we will review how Gnuastro manages your input data in your system’s memory. Knowing this can help you optimize your usage (in speed and memory consumption) when the data volume is large and approaches, or exceeds, your available RAM (usually in various calls to multiple programs simultaneously). But before diving into the details, let’s have a short basic introduction to memory in general and in particular the types of memory most relevant to this discussion.
Input datasets (that are later fed into programs for analysis) are commonly first stored in non-volatile memory. This is a type of memory that does not need a constant power supply to keep the data and is therefore primarily aimed for long-term storage, like HDDs or SSDs. So data in this type of storage is preserved when you turn off your computer. But by its nature, non-volatile memory is much slower, in reading or writing, than the speeds that CPUs can process the data. Thus relying on this type of memory alone would create a bad bottleneck in the input/output (I/O) phase of any processing.
The first step to decrease this bottleneck is to have a faster storage space, but with a much limited storage volume. For this type of storage, computers have a Random Access Memory (or RAM). RAM is classified as a volatile memory because it needs a constant flow of electricity to keep the information. In other words, the moment power is cut-off, all the stored information in your RAM is gone (hence the “volatile” name). But thanks to that constant supply of power, it can access any random address with equal (and very high!) speed.
Hence, the general/simplistic way that programs deal with memory is the following (this is general to almost all programs, not just Gnuastro’s): 1) Load/copy the input data from the non-volatile memory into RAM. 2) Use the copy of the data in RAM as input for all the internal processing as well as the intermediate data that is necessary during the processing. 3) Finally, when the analysis is complete, write the final output data back into non-volatile memory, and free/delete all the used space in the RAM (the initial copy and all the intermediate data). Usually the RAM is most important for the data of the intermediate steps (that you never see as a user of a program!).
When the input dataset(s) to a program are small (compared to the available space in your system’s RAM at the moment it is run) Gnuastro’s programs and libraries follow the standard series of steps above. The only exception is that deleting the intermediate data is not only done at the end of the program. As soon as an intermediate dataset is no longer necessary for the next internal steps, the space it occupied is deleted/freed. This allows Gnuastro programs to minimize their usage of your system’s RAM over the full running time.
The situation gets complicated when the datasets are large (compared to your available RAM when the program is run). for example, if a dataset is half the size of your system’s available RAM, and the program’s internal analysis needs three or more intermediately processed copies of it at one moment in its analysis. There will not be enough RAM to keep those higher-level intermediate data. In such cases, programs that do not do any memory management will crash. But fortunately Gnuastro’s programs do have a memory management plans for such situations.
When the necessary amount of space for an intermediate dataset cannot be allocated in the RAM, Gnuastro’s programs will not use the RAM at all. They will use the “memory-mapped file” concept in modern operating systems to create a randomly-named file in your non-volatile memory and use that instead of the RAM. That file will have the exact size (in bytes) of that intermediate dataset. Any time the program needs that intermediate dataset, the operating system will directly go to that file, and bypass your RAM. As soon as that file is no longer necessary for the analysis, it will be deleted. But as mentioned above, non-volatile memory has much slower I/O speed than the RAM. Hence in such situations, the programs will become noticeably slower (sometimes by factors of 10 times slower, depending on your non-volatile memory speed).
Because of the drop in I/O speed (and thus the speed of your running program), the moment that any to-be-allocated dataset is memory-mapped, Gnuastro’s programs and libraries will notify you with a descriptive statement like below (can happen in any phase of their analysis). It shows the location of the memory-mapped file, its size, complemented with a small description of the cause, a pointer to this section of the book for more information on how to deal with it (if necessary), and what to do to suppress it.
astarithmetic: ./gnuastro_mmap/Fu7Dhs: temporary memory-mapped file (XXXXXXXXXXX bytes) created for intermediate data that is not stored in RAM (see the "Memory management" section of Gnuastro's manual for optimizing your project's memory management, and thus speed). To disable this warning, please use the option '--quiet-mmap'
Finally, when the intermediate dataset is no longer necessary, the program will automatically delete it and notify you with a statement like this:
astarithmetic: ./gnuastro_mmap/Fu7Dhs: deleted
To disable these messages, you can run the program with
--quietmmap, or set the
quietmmap variable in the allocating library function to be non-zero.
An important component of these messages is the name of the memory-mapped file. Knowing that the file has been deleted is important for the user if the program crashes for any reason: internally (for example, a parameter is given wrongly) or externally (for example, you mistakenly kill the running job). In the event of a crash, the memory-mapped files will not be deleted and you have to manually delete them because they are usually large and they may soon fill your full storage if not deleted in a long time due to successive crashes.
This brings us to managing the memory-mapped files in your non-volatile memory. In other words: knowing where they are saved, or intentionally placing them in different places of your file system, or deleting them when necessary. As the examples above show, memory-mapped files are stored in a sub-directory of the running directory called gnuastro_mmap. If this directory does not exist, Gnuastro will automatically create it when memory mapping becomes necessary. Alternatively, it may happen that the gnuastro_mmap sub-directory exists and is not writable, or it cannot be created. In such cases, the memory-mapped file for each dataset will be created in the running directory with a gnuastro_mmap_ prefix.
Therefore one easy way to delete all memory-mapped files in case of a crash, is to delete everything within the sub-directory (first command below), or all files stating with this prefix:
rm -f gnuastro_mmap/* rm -f gnuastro_mmap_*
A much more common issue when dealing with memory-mapped files is their location. For example, you may be running a program in a partition that is hosted by an HDD. But you also have another partition on an SSD (which has much faster I/O). So you want your memory-mapped files to be created in the SSD to speed up your processing. In this scenario, you want your project source directory to only contain your plain-text scripts and you want your project’s built products (even the temporary memory-mapped files) to be built in a different location because they are large; thus I/O speed becomes important.
To host the memory-mapped files in another location (with fast I/O), you can set (gnuastro_mmap) to be a symbolic link to it. For example, let’s assume you want your memory-mapped files to be stored in /path/to/dir/for/mmap. All you have to do is to run the following command before your Gnuastro analysis command(s).
ln -s /path/to/dir/for/mmap gnuastro_mmap
The programs will delete a memory-mapped file when it is no longer needed, but they will not delete the gnuastro_mmap directory that hosts them. So if your project involves many Gnuastro programs (possibly called in parallel) and you want your memory-mapped files to be in a different location, you just have to make the symbolic link above once at the start, and all the programs will use it if necessary.
Another memory-management scenario that may happen is this: you do not want a Gnuastro program to allocate internal datasets in the RAM at all. for example, the speed of your Gnuastro-related project does not matter at that moment, and you have higher-priority jobs that are being run at the same time which need to have RAM available. In such cases, you can use the --minmapsize option that is available in all Gnuastro programs (see Processing options). Any intermediate dataset that has a size larger than the value of this option will be memory-mapped, even if there is space available in your RAM. for example, if you want any dataset larger than 100 megabytes to be memory-mapped, use --minmapsize=100000000 (8 zeros!).
You should not set the value of --minmapsize to be too small, otherwise even small intermediate values (that are usually very numerous) in the program will be memory-mapped. However the kernel can only host a limited number of memory-mapped files at every moment (by all running programs combined). for example, in the default109 Linux kernel on GNU/Linux operating systems this limit is roughly 64000. If the total number of memory-mapped files exceeds this number, all the programs using them will crash. Gnuastro’s programs will warn you if your given value is too small and may cause a problem later.
Actually, the default behavior for Gnuastro’s programs (to only use memory-mapped files when there is not enough RAM) is a side-effect of --minmapsize. The pre-defined value to this option is an extremely large value in the lowest-level Gnuastro configuration file (the installed gnuastro.conf described in Configuration file precedence). This value is larger than the largest possible available RAM. You can check by running any Gnuastro program with a -P option. Because no dataset will be larger than this, by default the programs will first attempt to use the RAM for temporary storage. But if writing in the RAM fails (for any reason, mainly due to lack of available space), then a memory-mapped file will be created.
If you need to host more memory-mapped files at one moment, you need to build your own customized Linux kernel.