GNU Astronomy Utilities


Next: , Previous: , Up: Gnuastro library   [Contents][Index]


10.3.19 K-d tree (kdtree.h)

K-d tree is a space-partitioning binary search tree for organizing points in a k-dimensional space. They are a very useful data structure for multidimensional searches like range searches and nearest neighbor searches. For a more formal and complete introduction see the Wikipedia page.

Each non-leaf node in a k-d tree divides the space into two parts, known as half-spaces. To select the top/root node for partitioning, we find the median of the points and make a hyperplane normal to the first dimension. The points to the left of this space are represented by the left subtree of that node and points to the right of the space are represented by the right subtree. This is then repeated for all the points in the input, thus associating a “left” and “right” branch for each input point.

Gnuastro uses the standard algorithms of the k-d tree with one small difference that makes it much more memory and CPU optimized. The set of input points that define the tree nodes are given as a list of Gnuastro’s data container type, see List of gal_data_t. Each gal_data_t in the list represents the point’s coordinate in one dimension, and the first element in the list is the first dimension. Hence the number of data values in each gal_data_t (which must be equal in all of them) represents the number of points. This is the same format that Gnuastro’s Table reading/writing functions read/write columns in tables, see Table input output (table.h).

The output k-d tree is a list of two gal_data_ts, representing the input’s row-number (or index, counting from 0) of the left and right subtrees of each row. Each gal_data_t thus has the same number of rows (or points) as the input, but only containing integers with a type of uint32_t (unsigned 32-bit integer). If a node has no left, or right subtree, then GAL_BLANK_UINT32 will be used. Below you can see the simple tree for 2D points from Wikipedia. The input point coordinates are represented as two input gal_data_ts (X and Y, where X->next=Y and Y->next=NULL). If you had three dimensional points, you could define an extra gal_data_t such that Y->next=Z and Z->next=NULL. The output is always a list of two gal_data_ts, where the first one contains the index of the left sub-tree in the input, and the second one, the index of the right subtree. The index of the root node (2 in the case below) is also returned as a single number.

INDEX         INPUT              OUTPUT                 K-D Tree
(as guide)    X --> Y        LEFT --> RIGHT           (visualized)
----------    -------        --------------        ------------------
0             5     4        1        4                   (7,2)
1             2     3        BLANK    BLANK               /   \
2             7     2        0        3               (5,4)    \
3             9     6        5        BLANK           /   \    (9,6)
4             4     7        BLANK    BLANK       (2,3) (4,7)  /
5             8     1        BLANK    BLANK                   (8,1)

This format is therefore scalable to any number of dimensions: the number of dimensions are determined from the number of nodes in the input list of gal_data_ts (for example, using gal_list_data_number). In Gnuastro’s k-d tree implementation, there are thus no special structures to keep every tree node (which would take extra memory and would need to be moved around as the tree is being created). Everything is done internally on the index of each point in the input dataset: the only thing that is flipped/sorted during tree creation is the index to the input row for any number of dimensions. As a result, Gnuastro’s k-d tree implementation is very memory and CPU efficient and its two output columns can directly be written into a standard table (without having to define any special binary format).

Function:
gal_data_t *
gal_kdtree_create (gal_data_t *coords_raw, size_t *root)

Create a k-d tree in a bottom-up manner (from leaves to the root). This function returns two gal_data_ts connected as a list, see description above. The first dataset contains the indexes of left and right nodes of the subtrees for each input node. The index of the root node is written into the memory that root points to. coords_raw is the list of the input points (one gal_data_t per dimension, see above).

#include <stdio.h>
#include <gnuastro/table.h>
#include <gnuastro/kdtree.h>

int
main (void)
{
  size_t root;
  gal_data_t *input, *kdtree;
  char kdtreefile[]="kd-tree.txt";    /* Inputs and outputs can */
  char inputfile[]="coordinates.txt"; /* also be FITS tables.   */

  /* Read the input table. Note: this assumes the table only
   * contains your input point coordinates (one column for each
   * dimension). If it contains more columns with other properties
   * for each point, you can specify which columns to read by
   * name or number, see the documentation of 'gal_table_read'. */
  input=gal_table_read(inputfile, "1", NULL, NULL,
                       GAL_TABLE_SEARCH_NAME, 0, -1, 0, NULL);

  /* Construct a k-d tree. The index of root is stored in `root` */
  kdtree=gal_kdtree_create(coords_raw, &root);

  /* Write output to a file. */
  gal_table_write(kdtree, NULL, GAL_TABLE_FORMAT_BFITS,
                  kdtreefile, "kdtree", 0);

  /* Report k-d tree root. */
  printf("Root row of '%s': %zu\n", kdtreefile, root);

  /* Clean up and return. */
  gal_list_data_free(input);
  gal_list_data_free(kdtree);
  return EXIT_SUCCESS;
}
Function:
size_t
gal_kdtree_nearest_neighbour (gal_data_t *coords_raw, gal_data_t *kdtree, size_t root, double *point, double *least_dist)

Returns the index of the nearest input point to the query point (point, assumed to be an array with same number of elements as gal_data_ts in coords_raw). The distance between the query point and its nearest neighbor is stored in the space that least_dist points to. This search is efficient due to the constantly checking for the presence of possible best points in other branches. If it isn’t possible for the other branch to have a better nearest neighbor, that branch is not searched.

For example the function below reads the input, and the k-d tree file created in the example of gal_kdtree_create (so you won’t have to re-create the k-d tree every time) and finds the input point that is closest to a given query point.

#include <stdio.h>
#include <gnuastro/table.h>
#include <gnuastro/kdtree.h>

int
main (void)
{
  size_t root=KDTREE_ROOT;
  gal_data_t *input, *kdtree;
  char kdtreefile[]="kd-tree.txt";    /* Inputs and outputs can */
  char inputfile[]="coordinates.txt"; /* also be FITS tables.   */

  double point[2]={9,8};              /* assuming a 2-d tree.   */
  size_t nearest_index;

  /* Read the input coordinates, see comments in example of
   * 'gal_kdtree_create' for more. */
  input=gal_table_read(inputfile, "1", NULL, NULL,
                       GAL_TABLE_SEARCH_NAME, 0, -1, 0, NULL);

  /* Read the k-d tree (created before). */
  kdtree=gal_table_read(kdtreefile, "1", NULL, NULL,
                        GAL_TABLE_SEARCH_NAME, 0, -1, 0, NULL);

  /* Find the nearest neighbour of the point. */
  nearest_index=gal_kdtree_nearest_neighbour(input, kdtree, root,
                                             point);

  /* Clean up and return. */
  gal_list_data_free(input);
  gal_list_data_free(kdtree);
  return EXIT_SUCCESS;
}

Next: , Previous: , Up: Gnuastro library   [Contents][Index]