A labeled dataset is one where each element/pixel has an integer label (or counter). The label identifies the group/class that the element belongs to. This form of labeling allows the higher-level study of all pixels within a certain class.
For example, to detect objects/targets in an image/dataset, you can apply a
threshold to separate the noise from the signal (to detect diffuse signal,
a threshold is useless and more advanced methods are necessary, for example
NoiseChisel). But the output of detection is a binary dataset (which
is just a very low-level labeling of
0 for noise and
The raw detection map is therefore hardly useful for any kind of analysis
on objects/targets in the image. One solution is to use a
connected-components algorithm (see
in Binary datasets (binary.h)). It is a simple and useful way to separate/label
connected patches in the foreground. This higher-level (but still
elementary) labeling therefore allows you to count how many connected
patches of signal there are in the dataset and is a major improvement
compared to the raw detection.
However, when your objects/targets are touching, the simple connected components algorithm is not enough and a still higher-level labeling mechanism is necessary. This brings us to the necessity of the functions in this part of Gnuastro’s library. The main inputs to the functions in this section are already labeled datasets (for example with the connected components algorithm above).
Each of the labeled regions are independent of each other (the labels
specify different classes of targets). Therefore, especially in large
datasets, it is often useful to process each label on independent CPU
threads in parallel rather than in series. Therefore the functions of this
section actually use an array of pixel/element indices (belonging to each
label/class) as the main identifier of a region. Using indices will also
allow processing of overlapping labels (for example in deblending
problems). Just note that overlapping labels are not yet implemented, but
planned. You can use
gal_label_indexs to generate lists of indices
belonging to separate classes from the labeled input.
Special negative integer values used internally by some of the functions in this section. Recall that meaningful labels are considered to be positive integers (\(\geq1\)). Zero is conventionally kept for regions with no labels, therefore negative integers can be used for any extra classification in the labeled datasets.
Return an array of
gal_data_t containers, each containing the pixel
indices of the respective label (see Generic data container (
labels contains the label of each element and has to
GAL_TYPE_INT32 type (see Library data types (type.h)). Only
positive (greater than zero) values in
labels will be used/indexed,
other elements will be ignored.
Meaningful labels start from
1 and not
0, therefore the
output array of
gal_data_t will contain
The first (zero-th) element of the output (
indexs in the example
below) will be initialized to a dataset with zero elements. This will allow
easy (non-confusing) access to the indices of each (meaningful) label.
numlabs is the number of labels in the dataset. If it is given a
value of zero, then the maximum value in the input (largest label) will be
found and used. Therefore if it is given, but smaller than the actual
number of labels, this function may/will crash (it will write in
numlabs is therefore useful in a highly
For example, if the returned array is called
indexs.size contains the number of elements that have a label of
indexs.array is an array (after
size_t *) containing the indices of each one of those
By index we mean the 1D position: the input number of dimensions is
irrelevant (any dimensionality is supported). In other words, each
element’s index is the number of elements/pixels between it and the
dataset’s first element/pixel. Therefore it is always greater or equal to
zero and stored in
Use the watershed algorithm210 to
“over-segment” the pixels in the
indexs dataset based on values in
values dataset. Internally, each local extrema (maximum or
minimum, based on
min0_max1) and its surrounding pixels will be
given a unique label. For demonstration, see Figures 8 and 9 of
Akhlaghi and Ichikawa . If
topinds!=NULL, it is assumed to point to an already allocated space
to write the index of each clump’s local extrema, otherwise, it is ignored.
values dataset must have a 32-bit floating point type
GAL_TYPE_FLOAT32, see Library data types (type.h)) and will only be
read by this function.
indexs must contain the indices of the
elements/pixels that will be over-segmented by this function and have a
GAL_TYPE_SIZE_T type, see the description of
gal_label_indexs, above. The final labels will be written in the
respective positions of
labels, which must have a
GAL_TYPE_INT32 type and be the same size as
indexs is already sorted, this function will ignore
min0_max1. To judge if the dataset is sorted or not (by the values
the indices correspond to in
values, not the actual indices), this
function will look into the bits of
indexs->flag, for the respective
bit flags, see Generic data container (
indexs is not
already sorted, this function will sort it according to the values of the
respective pixel in
values. The increasing/decreasing order will be
min0_max1. Note that if this function is called on
multiple threads and
values points to a different array on
each thread, this function will not return a reasonable result. In this
case, please sort
indexs prior to calling this function (see
gal_qsort_index_multi_d in Qsort functions (qsort.h)).
indexs is decreasing (increasing), or
0), local minima (maxima), are considered rivers
(watersheds) and given a label of
GAL_LABEL_RIVER (see above).
Note that rivers/watersheds will also be formed on the edges of the labeled
regions or when the labeled pixels touch a blank pixel. Therefore this
function will need to check for the presence of blank values. To be most
efficient, it is thus recommended to use
updateflag=1) prior to calling this function (see Library blank values (blank.h). Once the flag has been set, no other function (including this one)
that needs special behavior for blank pixels will have to parse the dataset
to see if it has blank values any more.
If you are sure your dataset doesn’t have blank values (by the design of
your software), to avoid an extra parsing of the dataset and improve
performance, you can set the two bits manually (see the description of
flags in Generic data container (
input->flag |= GAL_DATA_FLAG_BLANK_CH; /* Set bit to 1. */ input->flag &= ~GAL_DATA_FLAG_HASBLANK; /* Set bit to 0. */
*indexs, struct gal_tile_two_layer_params
This function is usually called after
gal_label_watershed, and is
used as a measure to identify which over-segmented “clumps” are real and
which are noise.
A measurement is done on each clump (using the
datasets, see below). To help in multi-threaded environments, the operation
is only done on pixels which are indexed in
indexs. It is expected
indexs to be sorted by their values in
values. If not
sorted, the measurement may not be reliable. If sorted in a decreasing
order, then clump building will start from their highest value and
vice-versa. See the description of
gal_label_watershed for more on
Each “clump” (identified by a positive integer) is assumed to be
surrounded by at least one river/watershed pixel (with a non-positive
label). This function will parse the pixels identified in
make a measurement on each clump and over all the river/watershed
pixels. The number of clumps (
numclumps) must be given as an input
argument and any clump that is smaller than
minarea is ignored
(because of scatter). If
variance is non-zero, then the
dataset is interpreted as variance, not standard deviation.
std datasets must have a
floating point) type. Also,
label must have the same size, but
std can have three
possible sizes: 1) a single element (which will be used for the whole
dataset, 2) the same size as
values (so a different error can be
assigned to every pixel), 3) a single value for each tile, based on the
tl tessellation (see Tile grid). In the last case, a
tile/value will be associated to each clump based on its flux-weighted
(only positive values) center.
The main output is an internally allocated, 1-dimensional array with one
value per label. The array information (length, type, etc) will be
written into the
sig generic data container. Therefore
sig->array must be
NULL when this function is called. After
this function, the details of the array (number of elements, type and size,
etc) will be written in to the various components of
the definition of
gal_data_t in Generic data container (
sig must already be allocated before calling
sigind!=NULL, similar to
sig) the clump
labels of each measurement in
sig will be written in
keepsmall zero, small clumps (where no
measurement is made) will not be included in the output table.
This function is initially intended for a multi-threaded environment.
In such cases, you will be writing arrays of clump measures from different regions in parallel into an array of
You can simply allocate (and initialize), such an array with the
gal_data_array_calloc function in Arrays of datasets.
For example if the
gal_data_t array is called
array, you can pass
Along with some other functions in
label.h, this function was initially written for Segment.
The description of the parameter used to measure a clump’s significance is fully given in Akhlaghi .
Grow the (positive) labels of
labels over the pixels in
indexs (see description of
The pixels (position in
indexs, values in
labels) that must be “grown” must have a value of
labels before calling this function.
For a demonstration see Columns 2 and 3 of Figure 10 in Akhlaghi and Ichikawa .
In many aspects, this function is very similar to over-segmentation (watershed algorithm,
The big difference is that in over-segmentation local maximums (that aren’t touching any alreadylabeled pixel) get a separate label.
However, here the final number of labels will not change.
All pixels that aren’t directly touching a labeled pixel just get pushed back to the start of the loop, and the loop iterates until its size doesn’t change any more.
This is because in a generic scenario some of the indexed pixels might not be reachable through other indexed pixels.
The next major difference with over-segmentation is that when there is only one label in growth region(s), it is not mandatory for
indexs to be sorted by values.
If there are multiple labeled regions in growth region(s), then values are important and you can use
gal_qsort_index_single_d to sort the indices by values in a separate array (see Qsort functions (qsort.h)).
This function looks for positive-valued neighbors of each pixel in
indexs and will label a pixel if it touches one.
Therefore, it is very important that only pixels/labels that are intended for growth have positive values in
labels before calling this function.
Any non-positive (zero or negative) value will be ignored as a label by this function.
Thus, it is recommended that while filling in the
indexs array values, you initialize all the pixels that are in
GAL_LABEL_INIT, and set non-labeled pixels that you don’t want to grow to
This function will write into both the input datasets.
After this function, some of the non-positive
labels pixels will have a new positivelabel and the number of useful elements in
indexs will have decreased.
The index of those pixels that couldn’t be labeled will remain inside
withrivers is non-zero, then pixels that are immediately touching more than one positive value will be given a
Note that the
indexs->array is not re-allocated to its new size at the end211.
indexs->size have new values after this function is returned, the extra elements just won’t be used until they are ultimately freed by
Connectivity is a value between
1 (fewest number of neighbors) and the number of dimensions in the input (most number of neighbors).
For example in a 2D dataset, a connectivity of
2 corresponds to 4-connected and 8-connected neighbors.
The watershed algorithm was initially introduced by Vincent and Soille. It starts from the minima and puts the pixels in, one by one, to grow them until the touch (create a watershed). For more, also see the Wikipedia article: https://en.wikipedia.org/wiki/Watershed_%28image_processing%29.
Note that according to the GNU C Library, even a
realloc to a smaller size can also cause a re-write of the whole array, which is not a cheap operation.