Gnuastro text table format (GNU Astronomy Utilities)

Next: Selecting table columns, Previous: Recognized table formats, Up: Tables [Contents][Index]

4.7.2 Gnuastro text table format ¶

Plain text files are the most generic, portable, and easiest way to (manually) create, (visually) inspect, or (manually) edit a table. In this format, the ending of a row is defined by the new-line character (a line on a text editor). So when you view it on a text editor, every row will occupy one line. The delimiters (or characters separating the columns) are white space characters (space, horizontal tab, vertical tab) and a comma (,). The only further requirement is that all rows/lines must have the same number of columns.

The columns do not have to be exactly under each other and the rows can be arbitrarily long with different lengths. For example, the following contents in a file would be interpreted as a table with 4 columns and 2 rows, with each element interpreted as a 64-bit floating point type (see Numeric data types).

1     2.234948   128   39.8923e8
2 , 4.454        792     72.98348e7

However, the example above has no other information about the columns (it is just raw data, with no meta-data). To use this table, you have to remember what the numbers in each column represent. Also, when you want to select columns, you have to count their position within the table. This can become frustrating and prone to bad errors (getting the columns wrong in your scientific project!) especially as the number of columns increase. It is also bad for sending to a colleague, because they will find it hard to remember/use the columns properly.

To solve these problems in Gnuastro’s programs/libraries you are not limited to using the column’s number, see Selecting table columns. If the columns have names, units, or comments you can also select your columns based on searches/matches in these fields, for example, see Table. Also, in this manner, you cannot guide the program reading the table on how to read the numbers. As an example, the first and third columns above can be read as integer types: the first column might be an ID and the third can be the number of pixels an object occupies in an image. So there is no need to read these to columns as a 64-bit floating point type (which takes more memory, and is slower).

In the bare-minimum example above, you also cannot use strings of characters, for example, the names of filters, or some other identifier that includes non-numerical characters. In the absence of any information, only numbers can be read robustly. Assuming we read columns with non-numerical characters as string, there would still be the problem that the strings might contain space (or any delimiter) character for some rows. So, each ‘word’ in the string will be interpreted as a column and the program will abort with an error that the rows do not have the same number of columns.

To correct for these limitations, Gnuastro defines the following convention for storing the table meta-data along with the raw data in one plain text file. The format is primarily designed for ease of reading/writing by eye/fingers, but is also structured enough to be read by a program.

When the first non-white character in a line is #, or there are no non-white characters in it, then the line will not be considered as a row of data in the table (this is a pretty standard convention in many programs, and higher level languages). In the first case (when the first character of the line is #), the line is interpreted as a comment.

If the comment line starts with ‘# Column N:’, then it is assumed to contain information about column N (a number, counting from 1). Comment lines that do not start with this pattern are ignored and you can use them to include any further information you want to store with the table in the text file. The most generic column information comment line has the following format:

# Column N: NAME [UNIT, TYPE(NUM), BLANK] COMMENT

Any sequence of characters between ‘:’ and ‘[’ will be interpreted as the column name (so it can contain anything except the ‘[’ character). Anything between the ‘]’ and the end of the line is defined as a comment. Within the brackets, anything before the first ‘,’ is the units (physical units, for example, km/s, or erg/s), anything before the second ‘,’ is the short type identifier (see below, and Numeric data types).

If the type identifier is not recognized, the default 64-bit floating point type will be used. The type identifier can optionally be followed by an integer within parenthesis. If the parenthesis is present and the integer is larger than 1, the column is assumed to be a “vector column” (which can have multiple values, for more see Vector columns).

Finally (still within the brackets), any non-white characters after the second ‘,’ are interpreted as the blank value for that column (see Blank pixels). The blank value can either be in the same type as the column (for example, -99 for a signed integer column), or any string (for example, NaN in that same column). In both cases, the values will be stored in memory as Gnuastro’s fixed blank values for each type. For floating point types, Gnuastro’s internal blank value is IEEE NaN (Not-a-Number). For signed integers, it is the smallest possible value and for unsigned integers its the largest possible value.

When a formatting problem occurs, or when the column was already given meta-data in a previous comment, or when the column number is larger than the actual number of columns in the table (the non-commented or empty lines), then the comment information line will be ignored.

When a comment information line can be used, the leading and trailing white space characters will be stripped from all of the elements. For example, in this line:

# Column 5:  column name   [km/s,    f32,-99] Redshift as speed

The NAME field will be ‘column name’ and the TYPE field will be ‘f32’. Note how all the white space characters before and after strings are not used, but those in the middle remained. Also, white space characters are not mandatory. Hence, in the example above, the BLANK field will be given the value of ‘-99’.

Except for the column number (N), the rest of the fields are optional. Also, the column information comments do not have to be in order. In other words, the information for column \(N+m\) (\(m>0\)) can be given in a line before column \(N\). Furthermore, you do not have to specify information for all columns. Those columns that do not have this information will be interpreted with the default settings (like the case above: values are double precision floating point, and the column has no name, unit, or comment). So these lines are all acceptable for any table (the first one, with nothing but the column number is redundant):

# Column 5:
# Column 1: ID [,i8] The Clump ID.
# Column 3: mag_f160w [AB mag, f32] Magnitude from the F160W filter

The data type of the column should be specified with one of the following values:

For a numeric column, you can use any of the numeric types (and their recognized identifiers) described in Numeric data types.
‘strN’: for strings. The N value identifies the length of the string (how many characters it has). The start of the string on each row is the first non-delimiter character of the column that has the string type. The next N characters will be interpreted as a string and all leading and trailing white space will be removed.
If the next column’s characters, are closer than N characters to the start of the string column in that line/row, they will be considered part of the string column. If there is a new-line character before the ending of the space given to the string column (in other words, the string column is the last column), then reading of the string will stop, even if the N characters are not complete yet. See tests/table/table.txt for one example. Therefore, the only time you have to pay attention to the positioning and spaces given to the string column is when it is not the last column in the table.

The only limitation in this format is that trailing and leading white space characters will be removed from the columns that are read. In most cases, this is the desired behavior, but if trailing and leading white-spaces are critically important to your analysis, define your own starting and ending characters and remove them after the table has been read. For example, in the sample table below, the two ‘|’ characters (which are arbitrary) will remain in the value of the second column and you can remove them manually later. If only one of the leading or trailing white spaces is important for your work, you can only use one of the ‘|’s.
```
# Column 1: ID [label, u8]
# Column 2: Notes [no unit, str50]
1    leading and trailing white space is ignored here    2.3442e10
2   |         but they will be preserved here        |   8.2964e11
```

Note that the FITS binary table standard does not define the unsigned int and unsigned long types, so if you want to convert your tables to FITS binary tables, use other types. Also, note that in the FITS ASCII table, there is only one integer type (long). So if you convert a Gnuastro plain text table to a FITS ASCII table with the Table program, the type information for integers will be lost. Conversely if integer types are important for you, you have to manually set them when reading a FITS ASCII table (for example, with the Table program when reading/converting into a file, or with the gnuastro/table.h library functions when reading into memory).