A labeled dataset is one where each element/pixel has an integer label (or counter). The label identifies the group/class that the element belongs to. This form of labeling allows the higher-level study of all pixels within a certain class.
For example, to detect objects/targets in an image/dataset, you can apply a
threshold to separate the noise from the signal (to detect diffuse signal,
a threshold is useless and more advanced methods are necessary, for example
NoiseChisel). But the output of detection is a binary dataset (which
is just a very low-level labeling of 0
for noise and 1
for
signal).
The raw detection map is therefore hardly useful for any kind of analysis
on objects/targets in the image. One solution is to use a
connected-components algorithm (see gal_binary_connected_components
in Binary datasets (binary.h)). It is a simple and useful way to separate/label
connected patches in the foreground. This higher-level (but still
elementary) labeling therefore allows you to count how many connected
patches of signal there are in the dataset and is a major improvement
compared to the raw detection.
However, when your objects/targets are touching, the simple connected components algorithm is not enough and a still higher-level labeling mechanism is necessary. This brings us to the necessity of the functions in this part of Gnuastro’s library. The main inputs to the functions in this section are already labeled datasets (for example, with the connected components algorithm above).
Each of the labeled regions are independent of each other (the labels
specify different classes of targets). Therefore, especially in large
datasets, it is often useful to process each label on independent CPU
threads in parallel rather than in series. Therefore the functions of this
section actually use an array of pixel/element indices (belonging to each
label/class) as the main identifier of a region. Using indices will also
allow processing of overlapping labels (for example, in deblending
problems). Just note that overlapping labels are not yet implemented, but
planned. You can use gal_label_indexs
to generate lists of indices
belonging to separate classes from the labeled input.
Special negative integer values used internally by some of the functions in this section. Recall that meaningful labels are considered to be positive integers (\(\geq1\)). Zero is conventionally kept for regions with no labels, therefore negative integers can be used for any extra classification in the labeled datasets.
gal_data_t *
(gal_data_t *labels
, size_t numlabs
, size_t minmapsize
, int quietmmap
)
¶Return an array of gal_data_t
containers, each containing the pixel
indices of the respective label (see Generic data container (gal_data_t
)). labels
contains the label of each element and has to
have an GAL_TYPE_INT32
type (see Library data types (type.h)). Only
positive (greater than zero) values in labels
will be used/indexed,
other elements will be ignored.
Meaningful labels start from 1
and not 0
, therefore the
output array of gal_data_t
will contain numlabs+1
elements.
The first (zero-th) element of the output (indexs[0]
in the example
below) will be initialized to a dataset with zero elements. This will allow
easy (non-confusing) access to the indices of each (meaningful) label.
numlabs
is the number of labels in the dataset. If it is given a
value of zero, then the maximum value in the input (largest label) will be
found and used. Therefore if it is given, but smaller than the actual
number of labels, this function may/will crash (it will write in
unallocated space). numlabs
is therefore useful in a highly
optimized/checked environment.
For example, if the returned array is called indexs
, then
indexs[10].size
contains the number of elements that have a label of
10
in labels
and indexs[10].array
is an array (after
casting to size_t *
) containing the indices of each one of those
elements/pixels.
By index we mean the 1D position: the input number of dimensions is
irrelevant (any dimensionality is supported). In other words, each
element’s index is the number of elements/pixels between it and the
dataset’s first element/pixel. Therefore it is always greater or equal to
zero and stored in size_t
type.
size_t
(gal_data_t *values
, gal_data_t *indexs
, gal_data_t *label
, size_t *topinds
, int min0_max1
)
¶Use the watershed algorithm265 to
“over-segment” the pixels in the indexs
dataset based on values in
the values
dataset. Internally, each local extrema (maximum or
minimum, based on min0_max1
) and its surrounding pixels will be
given a unique label. For demonstration, see Figures 8 and 9 of
Akhlaghi and Ichikawa [2015]. If
topinds!=NULL
, it is assumed to point to an already allocated space
to write the index of each clump’s local extrema, otherwise, it is ignored.
The values
dataset must have a 32-bit floating point type
(GAL_TYPE_FLOAT32
, see Library data types (type.h)) and will only be
read by this function. indexs
must contain the indices of the
elements/pixels that will be over-segmented by this function and have a
GAL_TYPE_SIZE_T
type, see the description of
gal_label_indexs
, above. The final labels will be written in the
respective positions of labels
, which must have a
GAL_TYPE_INT32
type and be the same size as values
.
When indexs
is already sorted, this function will ignore
min0_max1
. To judge if the dataset is sorted or not (by the values
the indices correspond to in values
, not the actual indices), this
function will look into the bits of indexs->flag
, for the respective
bit flags, see Generic data container (gal_data_t
). If indexs
is not
already sorted, this function will sort it according to the values of the
respective pixel in values
. The increasing/decreasing order will be
determined by min0_max1
. Note that if this function is called on
multiple threads and values
points to a different array on
each thread, this function will not return a reasonable result. In this
case, please sort indexs
prior to calling this function (see
gal_qsort_index_multi_d
in Qsort functions (qsort.h)).
When indexs
is decreasing (increasing), or min0_max1
is
1
(0
), local minima (maxima), are considered rivers
(watersheds) and given a label of GAL_LABEL_RIVER
(see above).
Note that rivers/watersheds will also be formed on the edges of the labeled
regions or when the labeled pixels touch a blank pixel. Therefore this
function will need to check for the presence of blank values. To be most
efficient, it is thus recommended to use gal_blank_present
(with
updateflag=1
) prior to calling this function (see Library blank values (blank.h). Once the flag has been set, no other function (including this one)
that needs special behavior for blank pixels will have to parse the dataset
to see if it has blank values any more.
If you are sure your dataset does not have blank values (by the design of
your software), to avoid an extra parsing of the dataset and improve
performance, you can set the two bits manually (see the description of
flags
in Generic data container (gal_data_t
)):
input->flag |= GAL_DATA_FLAG_BLANK_CH; /* Set bit to 1. */ input->flag &= ~GAL_DATA_FLAG_HASBLANK; /* Set bit to 0. */
void
(gal_data_t *values
, gal_data_t *std
, gal_data_t *label
, gal_data_t *indexs
, struct gal_tile_two_layer_params *tl
, size_t numclumps
, size_t minarea
, int variance
, int keepsmall
, gal_data_t *sig
, gal_data_t *sigind
)
¶This function is usually called after gal_label_watershed
, and is
used as a measure to identify which over-segmented “clumps” are real and
which are noise.
A measurement is done on each clump (using the values
and std
datasets, see below). To help in multi-threaded environments, the operation
is only done on pixels which are indexed in indexs
. It is expected
for indexs
to be sorted by their values in values
. If not
sorted, the measurement may not be reliable. If sorted in a decreasing
order, then clump building will start from their highest value and
vice-versa. See the description of gal_label_watershed
for more on
indexs
.
Each “clump” (identified by a positive integer) is assumed to be
surrounded by at least one river/watershed pixel (with a non-positive
label). This function will parse the pixels identified in indexs
and
make a measurement on each clump and over all the river/watershed
pixels. The number of clumps (numclumps
) must be given as an input
argument and any clump that is smaller than minarea
is ignored
(because of scatter). If variance
is non-zero, then the std
dataset is interpreted as variance, not standard deviation.
The values
and std
datasets must have a float
(32-bit
floating point) type. Also, label
and indexs
must
respectively have int32
and size_t
types. values
and
label
must have the same size, but std
can have three
possible sizes: 1) a single element (which will be used for the whole
dataset, 2) the same size as values
(so a different error can be
assigned to every pixel), 3) a single value for each tile, based on the
tl
tessellation (see Tile grid). In the last case, a
tile/value will be associated to each clump based on its flux-weighted
(only positive values) center.
The main output is an internally allocated, 1-dimensional array with one
value per label. The array information (length, type, etc.) will be
written into the sig
generic data container. Therefore
sig->array
must be NULL
when this function is called. After
this function, the details of the array (number of elements, type and size,
etc) will be written in to the various components of sig
, see
the definition of gal_data_t
in Generic data container (gal_data_t
). Therefore sig
must already be allocated before calling
this function.
Optionally (when sigind!=NULL
, similar to sig
) the clump
labels of each measurement in sig
will be written in
sigind->array
. If keepsmall
zero, small clumps (where no
measurement is made) will not be included in the output table.
This function is initially intended for a multi-threaded environment.
In such cases, you will be writing arrays of clump measures from different regions in parallel into an array of gal_data_t
s.
You can simply allocate (and initialize), such an array with the gal_data_array_calloc
function in Arrays of datasets.
For example, if the gal_data_t
array is called array
, you can pass &array[i]
as sig
.
Along with some other functions in label.h
, this function was initially written for Segment.
The description of the parameter used to measure a clump’s significance is fully given in Akhlaghi [2019].
void
(gal_data_t *labels
, gal_data_t *indexs
, int withrivers
, int connectivity
)
¶Grow the (positive) labels of labels
over the pixels in indexs
(see description of gal_label_indexs
).
The pixels (position in indexs
, values in labels
) that must be “grown” must have a value of GAL_LABEL_INIT
in labels
before calling this function.
For a demonstration see Columns 2 and 3 of Figure 10 in Akhlaghi and Ichikawa [2015].
In many aspects, this function is very similar to over-segmentation (watershed algorithm, gal_label_watershed
).
The big difference is that in over-segmentation local maximums (that are not touching any already labeled pixel) get a separate label.
However, here the final number of labels will not change.
All pixels that are not directly touching a labeled pixel just get pushed back to the start of the loop, and the loop iterates until its size does not change any more.
This is because in a generic scenario some of the indexed pixels might not be reachable through other indexed pixels.
The next major difference with over-segmentation is that when there is only one label in growth region(s), it is not mandatory for indexs
to be sorted by values.
If there are multiple labeled regions in growth region(s), then values are important and you can use qsort
with gal_qsort_index_single_d
to sort the indices by values in a separate array (see Qsort functions (qsort.h)).
This function looks for positive-valued neighbors of each pixel in indexs
and will label a pixel if it touches one.
Therefore, it is very important that only pixels/labels that are intended for growth have positive values in labels
before calling this function.
Any non-positive (zero or negative) value will be ignored as a label by this function.
Thus, it is recommended that while filling in the indexs
array values, you initialize all the pixels that are in indexs
with GAL_LABEL_INIT
, and set non-labeled pixels that you do not want to grow to 0
.
This function will write into both the input datasets.
After this function, some of the non-positive labels
pixels will have a new positivelabel and the number of useful elements in indexs
will have decreased.
The index of those pixels that could not be labeled will remain inside indexs
.
If withrivers
is non-zero, then pixels that are immediately touching more than one positive value will be given a GAL_LABEL_RIVER
label.
Note that the indexs->array
is not re-allocated to its new size at the end266.
But since indexs->dsize[0]
and indexs->size
have new values after this function is returned, the extra elements just will not be used until they are ultimately freed by gal_data_free
.
Connectivity is a value between 1
(fewest number of neighbors) and the number of dimensions in the input (most number of neighbors).
For example, in a 2D dataset, a connectivity of 1
and 2
corresponds to 4-connected and 8-connected neighbors.
The watershed algorithm was initially introduced by Vincent and Soille. It starts from the minima and puts the pixels in, one by one, to grow them until the touch (create a watershed). For more, also see the Wikipedia article: https://en.wikipedia.org/wiki/Watershed_%28image_processing%29.
Note that according to the GNU C Library, even a realloc
to a smaller size can also cause a re-write of the whole array, which is not a cheap operation.
JavaScript license information
GNU Astronomy Utilities 0.20 manual, April 2023.