GNU Astronomy Utilities



13.1 Why C programming language?

Currently the programming languages that are commonly used in scientific applications are C++279, Java280; Python281, and Julia282 (which is a newcomer but swiftly gaining ground). One of the main reasons behind choosing these is their high-level abstractions. However, GNU Astronomy Utilities is fully written in the C programming language283. The reasons can be summarized with simplicity, portability and efficiency/speed. All four are very important in a scientific software and we will discuss them below.

Simplicity can best be demonstrated in a comparison of the main books of C++ and C. The “C programming language”284 book, written by the authors of C, is only 286 pages and covers a very good fraction of the language, it has also remained unchanged from 1988. C is the main programming language of nearly all operating systems and there is no plan of any significant update. On the other hand, the most recent “C++ programming language”285 book, also written by its author, has 1366 pages and its fourth edition came out in 2013! As discussed in Gnuastro manifesto: Science and its tools, it is very important for other scientists to be able to readily read the code of a program at their will with minimum requirements.

In C++ or Java, inheritance in the object oriented programming paradigm and their internal functions make the code very easy to write for a programmer who is deeply invested in those objects and understands all their relations well. But it simultaneously makes reading the program for a first time reader (a curious scientist who wants to know only how a small step was done) extremely hard. Before understanding the methods, the scientist has to invest a lot of time and energy in understanding those objects and their relations. But in C, everything is done with basic language types for example ints or floats and their pointers to define arrays. So when an outside reader is only interested in one part of the program, that part is all they have to understand.

Recently it is also becoming common to write scientific software in Python, or a combination of it with C or C++. Python is a high level scripting language which does not need compilation. It is very useful when you want to do something on the go and do not want to be halted by the troubles of compiling, linking, memory checking, etc. When the datasets are small and the job is temporary, this ability of Python is great and is highly encouraged. A very good example might be plotting, in which Python is undoubtedly one of the best.

But as the data sets increase in size and the processing becomes more complicated, the speed of Python scripts significantly decrease. So when the program does not change too often and is widely used in a large community, mostly on large data sets (like astronomical images), using Python will waste a lot of valuable research-hours. It is possible to wrap C or C++ functions with Python to fix the speed issue. But this creates further complexity, because the interested scientist has to master two programming languages and their connection (which is not trivial).

Like C++, Python is object oriented, so as explained above, it needs a high level of experience with that particular program to reasonably understand its inner workings. To make things worse, since it is mainly for on-the-go programming286, it can undergo significant changes. One recent example is how Python 2.x and Python 3.x are not compatible. Lots of research teams that invested heavily in Python 2.x cannot benefit from Python 3.x or future versions any more. Some converters are available, but since they are automatic, lots of complications might arise in the conversion287. If a research project begins using Python 3.x today, there is no telling how compatible their investments will be when Python 4.x or 5.x will come out.

Java is also fully object-oriented, but uses a different paradigm: its compilation generates a hardware-independent bytecode, and a Java Virtual Machine (JVM) is required for the actual execution of this bytecode on a computer. Java also evolved with time, and tried to remain backward compatible, but inevitably this evolution required discontinuities and replacements of a few Java components which were first declared as becoming deprecated, and removed from later versions.

This stems from the core principles of high-level languages like Python or Java: that they evolve significantly on the scale of roughly 5 to 10 years. They are therefore useful when you want to solve a short-term problem and you are ready to pay the high cost of keeping your software up to date with all the changes in the language. This is fine for private companies, but usually too expensive for scientific projects that have limited funding for a fixed period. As a result, the reproducibility of the result (ability to regenerate the result in the future, which is a core principal of any scientific result) and reusability of all the investments that went into the science software will be lost to future generations! Rebuilding all the dependencies of a software in an obsolete language is not easy, or even not possible. Future-proof code (as long as current operating systems will be used) is therefore written in C.

The portability of C is best demonstrated by the fact that C++, Java and Python are part of the C-family of programming languages which also include Julia, Perl, and many other languages. C libraries can be immediately included in C++, and it is easy to write wrappers for them in all C-family programming languages. This will allow other scientists to benefit from C libraries using any C-family language that they prefer. As a result, Gnuastro’s library is already usable in C and C++, and wrappers will be288 added for higher-level languages like Python, Julia and Java.

The final reason was speed. This is another very important aspect of C which is not independent of simplicity (first reason discussed above). The abstractions provided by the higher-level languages (which also makes learning them harder for a newcomer) come at the cost of speed. Since C is a low-level language289 (closer to the hardware), it has a direct access to the CPU290, is generally considered as being faster in its execution, and is much less complex for both the human reader and the computer. The benefits of simplicity for a human were discussed above. Simplicity for the computer translates into more efficient (faster) programs. This creates a much closer relation between the scientist/programmer (or their program) and the actual data and processing. The GNU coding standards291 also encourage the use of C over all other languages when generality of usage and “high speed” is desired.


Footnotes

(279)

https://isocpp.org/

(280)

https://en.wikipedia.org/wiki/Java_(programming_language)

(281)

https://www.python.org/

(282)

https://julialang.org/

(283)

https://en.wikipedia.org/wiki/C_(programming_language)

(284)

Brian Kernighan, Dennis Ritchie. The C programming language. Prentice Hall, Inc., Second edition, 1988. It is also commonly known as K&R and is based on the ANSI C and ISO C90 standards.

(285)

Bjarne Stroustrup. The C++ programming language. Addison-Wesley Professional; 4 edition, 2013.

(286)

Note that Python is good for fast programming, not fast programs.

(287)

For example see Jenness 2017, which describes how LSST is managing the transition.

(288)

http://savannah.gnu.org/task/?13786

(289)

Low-level languages are those that directly operate the hardware like assembly languages. So C is actually a high-level language, but it can be considered one of the lowest-level languages among all high-level languages.

(290)

for instance the long double numbers with at least 64-bit mantissa are not accessible in Python or Java.

(291)

http://www.gnu.org/prep/standards/