A labeled dataset is one where each element/pixel has an integer label (or counter). The label identifies the group/class that the element belongs to. This form of labeling allows the higher-level study of all pixels within a certain class.

For example, to detect objects/targets in an image/dataset, you can apply a
threshold to separate the noise from the signal (to detect diffuse signal,
a threshold is useless and more advanced methods are necessary, for example
NoiseChisel). But the output of detection is a binary dataset (which
is just a very low-level labeling of `0`

for noise and `1`

for
signal).

The raw detection map is therefore hardly useful for any kind of analysis
on objects/targets in the image. One solution is to use a
connected-components algorithm (see `gal_binary_connected_components`

in Binary datasets (`binary.h`)). It is a simple and useful way to separate/label
connected patches in the foreground. This higher-level (but still
elementary) labeling therefore allows you to count how many connected
patches of signal there are in the dataset and is a major improvement
compared to the raw detection.

However, when your objects/targets are touching, the simple connected components algorithm is not enough and a still higher-level labeling mechanism is necessary. This brings us to the necessity of the functions in this part of Gnuastro’s library. The main inputs to the functions in this section are already labeled datasets (for example, with the connected components algorithm above).

Each of the labeled regions are independent of each other (the labels
specify different classes of targets). Therefore, especially in large
datasets, it is often useful to process each label on independent CPU
threads in parallel rather than in series. Therefore the functions of this
section actually use an array of pixel/element indices (belonging to each
label/class) as the main identifier of a region. Using indices will also
allow processing of overlapping labels (for example, in deblending
problems). Just note that overlapping labels are not yet implemented, but
planned. You can use `gal_label_indexs`

to generate lists of indices
belonging to separate classes from the labeled input.

- Macro:
**GAL_LABEL_INIT**¶ - Macro:
**GAL_LABEL_RIVER**¶ - Macro:
**GAL_LABEL_TMPCHECK**¶ Special negative integer values used internally by some of the functions in this section. Recall that meaningful labels are considered to be positive integers (\(\geq1\)). Zero is conventionally kept for regions with no labels, therefore negative integers can be used for any extra classification in the labeled datasets.

- Function:

`gal_data_t *`

**gal_label_indexs**`(gal_data_t`

¶`*labels`

, size_t`numlabs`

, size_t`minmapsize`

, int`quietmmap`

) -
Return an array of

`gal_data_t`

containers, each containing the pixel indices of the respective label (see Generic data container (`gal_data_t`

)).`labels`

contains the label of each element and has to have an`GAL_TYPE_INT32`

type (see Library data types (`type.h`)). Only positive (greater than zero) values in`labels`

will be used/indexed, other elements will be ignored.Meaningful labels start from

`1`

and not`0`

, therefore the output array of`gal_data_t`

will contain`numlabs+1`

elements. The first (zero-th) element of the output (`indexs[0]`

in the example below) will be initialized to a dataset with zero elements. This will allow easy (non-confusing) access to the indices of each (meaningful) label.`numlabs`

is the number of labels in the dataset. If it is given a value of zero, then the maximum value in the input (largest label) will be found and used. Therefore if it is given, but smaller than the actual number of labels, this function may/will crash (it will write in unallocated space).`numlabs`

is therefore useful in a highly optimized/checked environment.For example, if the returned array is called

`indexs`

, then`indexs[10].size`

contains the number of elements that have a label of`10`

in`labels`

and`indexs[10].array`

is an array (after casting to`size_t *`

) containing the indices of each one of those elements/pixels.By

*index*we mean the 1D position: the input number of dimensions is irrelevant (any dimensionality is supported). In other words, each element’s index is the number of elements/pixels between it and the dataset’s first element/pixel. Therefore it is always greater or equal to zero and stored in`size_t`

type.

- Function:

`size_t`

**gal_label_watershed**`(gal_data_t`

¶`*values`

, gal_data_t`*indexs`

, gal_data_t`*label`

, size_t`*topinds`

, int`min0_max1`

) -
Use the watershed algorithm

^{265}to “over-segment” the pixels in the`indexs`

dataset based on values in the`values`

dataset. Internally, each local extrema (maximum or minimum, based on`min0_max1`

) and its surrounding pixels will be given a unique label. For demonstration, see Figures 8 and 9 of Akhlaghi and Ichikawa [2015]. If`topinds!=NULL`

, it is assumed to point to an already allocated space to write the index of each clump’s local extrema, otherwise, it is ignored.The

`values`

dataset must have a 32-bit floating point type (`GAL_TYPE_FLOAT32`

, see Library data types (`type.h`)) and will only be read by this function.`indexs`

must contain the indices of the elements/pixels that will be over-segmented by this function and have a`GAL_TYPE_SIZE_T`

type, see the description of`gal_label_indexs`

, above. The final labels will be written in the respective positions of`labels`

, which must have a`GAL_TYPE_INT32`

type and be the same size as`values`

.When

`indexs`

is already sorted, this function will ignore`min0_max1`

. To judge if the dataset is sorted or not (by the values the indices correspond to in`values`

, not the actual indices), this function will look into the bits of`indexs->flag`

, for the respective bit flags, see Generic data container (`gal_data_t`

). If`indexs`

is not already sorted, this function will sort it according to the values of the respective pixel in`values`

. The increasing/decreasing order will be determined by`min0_max1`

. Note that if this function is called on multiple threads*and*`values`

points to a different array on each thread, this function will not return a reasonable result. In this case, please sort`indexs`

prior to calling this function (see`gal_qsort_index_multi_d`

in Qsort functions (`qsort.h`)).When

`indexs`

is decreasing (increasing), or`min0_max1`

is`1`

(`0`

), local minima (maxima), are considered rivers (watersheds) and given a label of`GAL_LABEL_RIVER`

(see above).Note that rivers/watersheds will also be formed on the edges of the labeled regions or when the labeled pixels touch a blank pixel. Therefore this function will need to check for the presence of blank values. To be most efficient, it is thus recommended to use

`gal_blank_present`

(with`updateflag=1`

) prior to calling this function (see Library blank values (`blank.h`). Once the flag has been set, no other function (including this one) that needs special behavior for blank pixels will have to parse the dataset to see if it has blank values any more.If you are sure your dataset does not have blank values (by the design of your software), to avoid an extra parsing of the dataset and improve performance, you can set the two bits manually (see the description of

`flags`

in Generic data container (`gal_data_t`

)):input->flag |= GAL_DATA_FLAG_BLANK_CH; /* Set bit to 1. */ input->flag &= ~GAL_DATA_FLAG_HASBLANK; /* Set bit to 0. */

- Function:

`void`

**gal_label_clump_significance**`(gal_data_t`

¶`*values`

, gal_data_t`*std`

, gal_data_t`*label`

, gal_data_t`*indexs`

, struct gal_tile_two_layer_params`*tl`

, size_t`numclumps`

, size_t`minarea`

, int`variance`

, int`keepsmall`

, gal_data_t`*sig`

, gal_data_t`*sigind`

) -
This function is usually called after

`gal_label_watershed`

, and is used as a measure to identify which over-segmented “clumps” are real and which are noise.A measurement is done on each clump (using the

`values`

and`std`

datasets, see below). To help in multi-threaded environments, the operation is only done on pixels which are indexed in`indexs`

. It is expected for`indexs`

to be sorted by their values in`values`

. If not sorted, the measurement may not be reliable. If sorted in a decreasing order, then clump building will start from their highest value and vice-versa. See the description of`gal_label_watershed`

for more on`indexs`

.Each “clump” (identified by a positive integer) is assumed to be surrounded by at least one river/watershed pixel (with a non-positive label). This function will parse the pixels identified in

`indexs`

and make a measurement on each clump and over all the river/watershed pixels. The number of clumps (`numclumps`

) must be given as an input argument and any clump that is smaller than`minarea`

is ignored (because of scatter). If`variance`

is non-zero, then the`std`

dataset is interpreted as variance, not standard deviation.The

`values`

and`std`

datasets must have a`float`

(32-bit floating point) type. Also,`label`

and`indexs`

must respectively have`int32`

and`size_t`

types.`values`

and`label`

must have the same size, but`std`

can have three possible sizes: 1) a single element (which will be used for the whole dataset, 2) the same size as`values`

(so a different error can be assigned to every pixel), 3) a single value for each tile, based on the`tl`

tessellation (see Tile grid). In the last case, a tile/value will be associated to each clump based on its flux-weighted (only positive values) center.The main output is an internally allocated, 1-dimensional array with one value per label. The array information (length, type, etc.) will be written into the

`sig`

generic data container. Therefore`sig->array`

must be`NULL`

when this function is called. After this function, the details of the array (number of elements, type and size, etc) will be written in to the various components of`sig`

, see the definition of`gal_data_t`

in Generic data container (`gal_data_t`

). Therefore`sig`

must already be allocated before calling this function.Optionally (when

`sigind!=NULL`

, similar to`sig`

) the clump labels of each measurement in`sig`

will be written in`sigind->array`

. If`keepsmall`

zero, small clumps (where no measurement is made) will not be included in the output table.This function is initially intended for a multi-threaded environment. In such cases, you will be writing arrays of clump measures from different regions in parallel into an array of

`gal_data_t`

s. You can simply allocate (and initialize), such an array with the`gal_data_array_calloc`

function in Arrays of datasets. For example, if the`gal_data_t`

array is called`array`

, you can pass`&array[i]`

as`sig`

.Along with some other functions in

`label.h`

, this function was initially written for Segment. The description of the parameter used to measure a clump’s significance is fully given in Akhlaghi [2019].

- Function:

`void`

**gal_label_grow_indexs**`(gal_data_t`

¶`*labels`

, gal_data_t`*indexs`

, int`withrivers`

, int`connectivity`

) Grow the (positive) labels of

`labels`

over the pixels in`indexs`

(see description of`gal_label_indexs`

). The pixels (position in`indexs`

, values in`labels`

) that must be “grown” must have a value of`GAL_LABEL_INIT`

in`labels`

before calling this function. For a demonstration see Columns 2 and 3 of Figure 10 in Akhlaghi and Ichikawa [2015].In many aspects, this function is very similar to over-segmentation (watershed algorithm,

`gal_label_watershed`

). The big difference is that in over-segmentation local maximums (that are not touching any already labeled pixel) get a separate label. However, here the final number of labels will not change. All pixels that are not directly touching a labeled pixel just get pushed back to the start of the loop, and the loop iterates until its size does not change any more. This is because in a generic scenario some of the indexed pixels might not be reachable through other indexed pixels.The next major difference with over-segmentation is that when there is only one label in growth region(s), it is not mandatory for

`indexs`

to be sorted by values. If there are multiple labeled regions in growth region(s), then values are important and you can use`qsort`

with`gal_qsort_index_single_d`

to sort the indices by values in a separate array (see Qsort functions (`qsort.h`)).This function looks for positive-valued neighbors of each pixel in

`indexs`

and will label a pixel if it touches one. Therefore, it is very important that only pixels/labels that are intended for growth have positive values in`labels`

before calling this function. Any non-positive (zero or negative) value will be ignored as a label by this function. Thus, it is recommended that while filling in the`indexs`

array values, you initialize all the pixels that are in`indexs`

with`GAL_LABEL_INIT`

, and set non-labeled pixels that you do not want to grow to`0`

.This function will write into both the input datasets. After this function, some of the non-positive

`labels`

pixels will have a new positivelabel and the number of useful elements in`indexs`

will have decreased. The index of those pixels that could not be labeled will remain inside`indexs`

. If`withrivers`

is non-zero, then pixels that are immediately touching more than one positive value will be given a`GAL_LABEL_RIVER`

label.Note that the

`indexs->array`

is not re-allocated to its new size at the end^{266}. But since`indexs->dsize[0]`

and`indexs->size`

have new values after this function is returned, the extra elements just will not be used until they are ultimately freed by`gal_data_free`

.Connectivity is a value between

`1`

(fewest number of neighbors) and the number of dimensions in the input (most number of neighbors). For example, in a 2D dataset, a connectivity of`1`

and`2`

corresponds to 4-connected and 8-connected neighbors.

The watershed algorithm was initially introduced by Vincent and Soille. It starts from the minima and puts the pixels in, one by one, to grow them until the touch (create a watershed). For more, also see the Wikipedia article: https://en.wikipedia.org/wiki/Watershed_%28image_processing%29.

Note that according to the GNU C Library, even a `realloc`

to a smaller size can also cause a re-write of the whole array, which is not a cheap operation.

JavaScript license information

GNU Astronomy Utilities 0.20 manual, April 2023.