GNU Astronomy Utilities



12.3.25 Labeled datasets (label.h)

A labeled dataset is one where each element/pixel has an integer label (or counter). The label identifies the group/class that the element belongs to. This form of labeling allows the higher-level study of all pixels within a certain class.

For example, to detect objects/targets in an image/dataset, you can apply a threshold to separate the noise from the signal (to detect diffuse signal, a threshold is useless and more advanced methods are necessary, for example NoiseChisel). But the output of detection is a binary dataset (which is just a very low-level labeling of 0 for noise and 1 for signal).

The raw detection map is therefore hardly useful for any kind of analysis on objects/targets in the image. One solution is to use a connected-components algorithm (see gal_binary_connected_components in Binary datasets (binary.h)). It is a simple and useful way to separate/label connected patches in the foreground. This higher-level (but still elementary) labeling therefore allows you to count how many connected patches of signal there are in the dataset and is a major improvement compared to the raw detection.

However, when your objects/targets are touching, the simple connected components algorithm is not enough and a still higher-level labeling mechanism is necessary. This brings us to the necessity of the functions in this part of Gnuastro’s library. The main inputs to the functions in this section are already labeled datasets (for example, with the connected components algorithm above).

Each of the labeled regions are independent of each other (the labels specify different classes of targets). Therefore, especially in large datasets, it is often useful to process each label on independent CPU threads in parallel rather than in series. Therefore the functions of this section actually use an array of pixel/element indices (belonging to each label/class) as the main identifier of a region. Using indices will also allow processing of overlapping labels (for example, in deblending problems). Just note that overlapping labels are not yet implemented, but planned. You can use gal_label_indexs to generate lists of indices belonging to separate classes from the labeled input.

Macro: GAL_LABEL_INIT
Macro: GAL_LABEL_RIVER
Macro: GAL_LABEL_TMPCHECK

Special negative integer values used internally by some of the functions in this section. Recall that meaningful labels are considered to be positive integers (\(\geq1\)). Zero is conventionally kept for regions with no labels, therefore negative integers can be used for any extra classification in the labeled datasets.

Function:
gal_data_t *
gal_label_indexs (gal_data_t *labels, size_t numlabs, size_t minmapsize, int quietmmap)

Return an array of gal_data_t containers, each containing the pixel indices of the respective label (see Generic data container (gal_data_t)). labels contains the label of each element and has to have an GAL_TYPE_INT32 type (see Library data types (type.h)). Only positive (greater than zero) values in labels will be used/indexed, other elements will be ignored.

Meaningful labels start from 1 and not 0, therefore the output array of gal_data_t will contain numlabs+1 elements. The first (zero-th) element of the output (indexs[0] in the example below) will be initialized to a dataset with zero elements. This will allow easy (non-confusing) access to the indices of each (meaningful) label.

numlabs is the number of labels in the dataset. If it is given a value of zero, then the maximum value in the input (largest label) will be found and used. Therefore if it is given, but smaller than the actual number of labels, this function may/will crash (it will write in un-allocated space). numlabs is therefore useful in a highly optimized/checked environment.

For example, if the returned array is called indexs, then indexs[10].size contains the number of elements that have a label of 10 in labels and indexs[10].array is an array (after casting to size_t *) containing the indices of each one of those elements/pixels.

By index we mean the 1D position: the input number of dimensions is irrelevant (any dimensionality is supported). In other words, each element’s index is the number of elements/pixels between it and the dataset’s first element/pixel. Therefore it is always greater or equal to zero and stored in size_t type.

Function:
size_t
gal_label_watershed (gal_data_t *values, gal_data_t *indexs, gal_data_t *label, size_t *topinds, int min0_max1)

Use the watershed algorithm273 to “over-segment” the pixels in the indexs dataset based on values in the values dataset. Internally, each local extrema (maximum or minimum, based on min0_max1) and its surrounding pixels will be given a unique label. For demonstration, see Figures 8 and 9 of Akhlaghi and Ichikawa 2015. If topinds!=NULL, it is assumed to point to an already allocated space to write the index of each clump’s local extrema, otherwise, it is ignored.

The values dataset must have a 32-bit floating point type (GAL_TYPE_FLOAT32, see Library data types (type.h)) and will only be read by this function. indexs must contain the indices of the elements/pixels that will be over-segmented by this function and have a GAL_TYPE_SIZE_T type, see the description of gal_label_indexs, above. The final labels will be written in the respective positions of labels, which must have a GAL_TYPE_INT32 type and be the same size as values.

When indexs is already sorted, this function will ignore min0_max1. To judge if the dataset is sorted or not (by the values the indices correspond to in values, not the actual indices), this function will look into the bits of indexs->flag, for the respective bit flags, see Generic data container (gal_data_t). If indexs is not already sorted, this function will sort it according to the values of the respective pixel in values. The increasing/decreasing order will be determined by min0_max1. Note that if this function is called on multiple threads and values points to a different array on each thread, this function will not return a reasonable result. In this case, please sort indexs prior to calling this function (see gal_qsort_index_multi_d in Qsort functions (qsort.h)).

When indexs is decreasing (increasing), or min0_max1 is 1 (0), local minima (maxima), are considered rivers (watersheds) and given a label of GAL_LABEL_RIVER (see above).

Note that rivers/watersheds will also be formed on the edges of the labeled regions or when the labeled pixels touch a blank pixel. Therefore this function will need to check for the presence of blank values. To be most efficient, it is thus recommended to use gal_blank_present (with updateflag=1) prior to calling this function (see Library blank values (blank.h). Once the flag has been set, no other function (including this one) that needs special behavior for blank pixels will have to parse the dataset to see if it has blank values any more.

If you are sure your dataset does not have blank values (by the design of your software), to avoid an extra parsing of the dataset and improve performance, you can set the two bits manually (see the description of flags in Generic data container (gal_data_t)):

input->flag |=  GAL_DATA_FLAG_BLANK_CH; /* Set bit to 1. */
input->flag &= ~GAL_DATA_FLAG_HASBLANK; /* Set bit to 0. */
Function:
void
gal_label_clump_significance (gal_data_t *values, gal_data_t *std, gal_data_t *label, gal_data_t *indexs, struct gal_tile_two_layer_params *tl, size_t numclumps, size_t minarea, int variance, int keepsmall, gal_data_t *sig, gal_data_t *sigind)

This function is usually called after gal_label_watershed, and is used as a measure to identify which over-segmented “clumps” are real and which are noise.

A measurement is done on each clump (using the values and std datasets, see below). To help in multi-threaded environments, the operation is only done on pixels which are indexed in indexs. It is expected for indexs to be sorted by their values in values. If not sorted, the measurement may not be reliable. If sorted in a decreasing order, then clump building will start from their highest value and vice-versa. See the description of gal_label_watershed for more on indexs.

Each “clump” (identified by a positive integer) is assumed to be surrounded by at least one river/watershed pixel (with a non-positive label). This function will parse the pixels identified in indexs and make a measurement on each clump and over all the river/watershed pixels. The number of clumps (numclumps) must be given as an input argument and any clump that is smaller than minarea is ignored (because of scatter). If variance is non-zero, then the std dataset is interpreted as variance, not standard deviation.

The values and std datasets must have a float (32-bit floating point) type. Also, label and indexs must respectively have int32 and size_t types. values and label must have the same size, but std can have three possible sizes: 1) a single element (which will be used for the whole dataset, 2) the same size as values (so a different error can be assigned to every pixel), 3) a single value for each tile, based on the tl tessellation (see Tile grid). In the last case, a tile/value will be associated to each clump based on its flux-weighted (only positive values) center.

The main output is an internally allocated, 1-dimensional array with one value per label. The array information (length, type, etc.) will be written into the sig generic data container. Therefore sig->array must be NULL when this function is called. After this function, the details of the array (number of elements, type and size, etc) will be written in to the various components of sig, see the definition of gal_data_t in Generic data container (gal_data_t). Therefore sig must already be allocated before calling this function.

Optionally (when sigind!=NULL, similar to sig) the clump labels of each measurement in sig will be written in sigind->array. If keepsmall zero, small clumps (where no measurement is made) will not be included in the output table.

This function is initially intended for a multi-threaded environment. In such cases, you will be writing arrays of clump measures from different regions in parallel into an array of gal_data_ts. You can simply allocate (and initialize), such an array with the gal_data_array_calloc function in Arrays of datasets. For example, if the gal_data_t array is called array, you can pass &array[i] as sig.

Along with some other functions in label.h, this function was initially written for Segment. The description of the parameter used to measure a clump’s significance is fully given in Akhlaghi 2019.

Function:
void
gal_label_grow_indexs (gal_data_t *labels, gal_data_t *indexs, int withrivers, int connectivity)

Grow the (positive) labels of labels over the pixels in indexs (see description of gal_label_indexs). The pixels (position in indexs, values in labels) that must be “grown” must have a value of GAL_LABEL_INIT in labels before calling this function. For a demonstration see Columns 2 and 3 of Figure 10 in Akhlaghi and Ichikawa 2015.

In many aspects, this function is very similar to over-segmentation (watershed algorithm, gal_label_watershed). The big difference is that in over-segmentation local maximums (that are not touching any already labeled pixel) get a separate label. However, here the final number of labels will not change. All pixels that are not directly touching a labeled pixel just get pushed back to the start of the loop, and the loop iterates until its size does not change any more. This is because in a generic scenario some of the indexed pixels might not be reachable through other indexed pixels.

The next major difference with over-segmentation is that when there is only one label in growth region(s), it is not mandatory for indexs to be sorted by values. If there are multiple labeled regions in growth region(s), then values are important and you can use qsort with gal_qsort_index_single_d to sort the indices by values in a separate array (see Qsort functions (qsort.h)).

This function looks for positive-valued neighbors of each pixel in indexs and will label a pixel if it touches one. Therefore, it is very important that only pixels/labels that are intended for growth have positive values in labels before calling this function. Any non-positive (zero or negative) value will be ignored as a label by this function. Thus, it is recommended that while filling in the indexs array values, you initialize all the pixels that are in indexs with GAL_LABEL_INIT, and set non-labeled pixels that you do not want to grow to 0.

This function will write into both the input datasets. After this function, some of the non-positive labels pixels will have a new positivelabel and the number of useful elements in indexs will have decreased. The index of those pixels that could not be labeled will remain inside indexs. If withrivers is non-zero, then pixels that are immediately touching more than one positive value will be given a GAL_LABEL_RIVER label.

Note that the indexs->array is not re-allocated to its new size at the end274. But since indexs->dsize[0] and indexs->size have new values after this function is returned, the extra elements just will not be used until they are ultimately freed by gal_data_free.

Connectivity is a value between 1 (fewest number of neighbors) and the number of dimensions in the input (most number of neighbors). For example, in a 2D dataset, a connectivity of 1 and 2 corresponds to 4-connected and 8-connected neighbors.


Footnotes

(273)

The watershed algorithm was initially introduced by Vincent and Soille. It starts from the minima and puts the pixels in, one by one, to grow them until the touch (create a watershed). For more, also see the Wikipedia article: https://en.wikipedia.org/wiki/Watershed_%28image_processing%29.

(274)

Note that according to the GNU C Library, even a realloc to a smaller size can also cause a re-write of the whole array, which is not a cheap operation.