GNU Astronomy Utilities



12.3.19 K-d tree (kdtree.h)

K-d tree is a space-partitioning binary search tree for organizing points in a k-dimensional space. They are a very useful data structure for multidimensional searches like range searches and nearest neighbor searches. For a more formal and complete introduction see the Wikipedia page.

Each non-leaf node in a k-d tree divides the space into two parts, known as half-spaces. To select the top/root node for partitioning, we find the median of the points and make a hyperplane normal to the first dimension. The points to the left of this space are represented by the left subtree of that node and points to the right of the space are represented by the right subtree. This is then repeated for all the points in the input, thus associating a “left” and “right” branch for each input point.

Gnuastro uses the standard algorithms of the k-d tree with one small difference that makes it much more memory and CPU optimized. The set of input points that define the tree nodes are given as a list of Gnuastro’s data container type, see List of gal_data_t. Each gal_data_t in the list represents the point’s coordinate in one dimension, and the first element in the list is the first dimension. Hence the number of data values in each gal_data_t (which must be equal in all of them) represents the number of points. This is the same format that Gnuastro’s Table reading/writing functions read/write columns in tables, see Table input output (table.h).

The output k-d tree is a list of two gal_data_ts, representing the input’s row-number (or index, counting from 0) of the left and right subtrees of each row. Each gal_data_t thus has the same number of rows (or points) as the input, but only containing integers with a type of uint32_t (unsigned 32-bit integer). If a node has no left, or right subtree, then GAL_BLANK_UINT32 will be used. Below you can see the simple tree for 2D points from Wikipedia. The input point coordinates are represented as two input gal_data_ts (X and Y, where X->next=Y and Y->next=NULL). If you had three dimensional points, you could define an extra gal_data_t such that Y->next=Z and Z->next=NULL. The output is always a list of two gal_data_ts, where the first one contains the index of the left sub-tree in the input, and the second one, the index of the right subtree. The index of the root node (0 in the case below267) is also returned as a single number.

INDEX         INPUT              OUTPUT              K-D Tree
(as guide)    X --> Y        LEFT --> RIGHT        (visualized)
----------    -------        --------------     ------------------
0             5     4        1        2               (5,4)
1             2     3        BLANK    4               /   \
2             7     2        5        3           (2,3)    \
3             9     6        BLANK    BLANK           \    (7,2)
4             4     7        BLANK    BLANK         (4,7)  /   \
5             8     1        BLANK    BLANK               (8,1) (9,6)

This format is therefore scalable to any number of dimensions: the number of dimensions are determined from the number of nodes in the input list of gal_data_ts (for example, using gal_list_data_number). In Gnuastro’s k-d tree implementation, there are thus no special structures to keep every tree node (which would take extra memory and would need to be moved around as the tree is being created). Everything is done internally on the index of each point in the input dataset: the only thing that is flipped/sorted during tree creation is the index to the input row for any number of dimensions. As a result, Gnuastro’s k-d tree implementation is very memory and CPU efficient and its two output columns can directly be written into a standard table (without having to define any special binary format).

Function:
gal_data_t *
gal_kdtree_create (gal_data_t *coords_raw, size_t *root)

Create a k-d tree in a bottom-up manner (from leaves to the root). This function returns two gal_data_ts connected as a list, see description above. The first dataset contains the indexes of left and right nodes of the subtrees for each input node. The index of the root node is written into the memory that root points to. coords_raw is the list of the input points (one gal_data_t per dimension, see above). If the input dataset has no data (coords_raw->size==0), this function will return a NULL pointer.

For example, assume you have the simple set of points below (from the visualized example at the start of this section) in a plain-text file called coordinates.txt:

$ cat coordinates.txt
5     4
2     3
7     2
9     6
4     7
8     1

With the program below, you can calculate the kd-tree, and write it in a FITS file (while keeping the root index as a FITS keyword inside of it).

#include <stdio.h>
#include <gnuastro/table.h>
#include <gnuastro/kdtree.h>

int
main (void)
{
  gal_data_t *input, *kdtree;
  char kdtreefile[]="kd-tree.fits";
  char inputfile[]="coordinates.txt";

  /* To write the root within the saved file. */
  size_t root;
  char *unit="index";
  char *keyname="KDTROOT";
  gal_fits_list_key_t *keylist=NULL;
  char *comment="k-d tree root index (counting from 0).";

  /* Read the input table. Note: this assumes the table only
   * contains your input point coordinates (one column for each
   * dimension). If it contains more columns with other properties
   * for each point, you can specify which columns to read by
   * name or number, see the documentation of 'gal_table_read'. */
  input=gal_table_read(inputfile, "1", NULL, NULL,
                       GAL_TABLE_SEARCH_NAME, 0, -1, 0, NULL);

  /* Construct a k-d tree. The index of root is stored in `root` */
  kdtree=gal_kdtree_create(input, &root);

  /* Write the k-d tree to a file and write root index and input
   * name as FITS keywords ('gal_table_write' frees 'keylist').*/
  gal_fits_key_list_title_add(&keylist, "k-d tree parameters", 0);
  gal_fits_key_write_filename("KDTIN", inputfile, &keylist, 0, 1);
  gal_fits_key_list_add_end(&keylist, GAL_TYPE_SIZE_T, keyname, 0,
                            &root, 0, comment, 0, unit, 0);
  gal_table_write(kdtree, &keylist, NULL, GAL_TABLE_FORMAT_BFITS,
                  kdtreefile, "kdtree", 0, 1);

  /* Clean up and return. */
  gal_list_data_free(input);
  gal_list_data_free(kdtree);
  return EXIT_SUCCESS;
}

You can inspect the saved k-d tree FITS table with Gnuastro’s Table (first command below), and you can see the keywords containing the root index with Fits (second command below):

asttable kd-tree.fits
astfits kd-tree.fits -h1
Function:
size_t
gal_kdtree_nearest_neighbour (gal_data_t *coords_raw, gal_data_t *kdtree, size_t root, double *point, double *least_dist)

Returns the index of the nearest input point to the query point (point, assumed to be an array with same number of elements as gal_data_ts in coords_raw). The distance between the query point and its nearest neighbor is stored in the space that least_dist points to. This search is efficient due to the constant checking for the presence of possible best points in other branches. If it is not possible for the other branch to have a better nearest neighbor, that branch is not searched.

As an example, let’s use the k-d tree that was created in the example of gal_kdtree_create (above) and find the nearest row to a given coordinate (point). This will be a very common scenario, especially in large and multi-dimensional datasets where the k-d tree creation can take long and you do not want to re-create the k-d tree every time. In the gal_kdtree_create example output, we also wrote the k-d tree root index as a FITS keyword (KDTROOT), so after loading the two table data (input coordinates and k-d tree), we will read the root from the FITS keyword. This is a very simple example, but the scalability is clear: for example, it is trivial to parallelize (see Library demo - multi-threaded operation).

#include <stdio.h>
#include <gnuastro/table.h>
#include <gnuastro/kdtree.h>

int
main (void)
{
  /* INPUT: desired point. */
  double point[2]={8.9,5.9};

  /* Same as example in description of 'gal_kdtree_create'. */
  gal_data_t *input, *kdtree;
  char kdtreefile[]="kd-tree.fits";
  char inputfile[]="coordinates.txt";

  /* Processing variables of this function. */
  char kdtreehdu[]="1";
  double *in_x, *in_y, least_dist;
  size_t root, nkeys=1, nearest_index;
  gal_data_t *rkey, *keysll=gal_data_array_calloc(nkeys);

  /* Read the input coordinates, see comments in example of
   * 'gal_kdtree_create' for more. */
  input=gal_table_read(inputfile, "1", NULL, NULL,
                       GAL_TABLE_SEARCH_NAME, 0, -1, 0, NULL);

  /* Read the k-d tree contents (created before). */
  kdtree=gal_table_read(kdtreefile, "1", NULL, NULL,
                        GAL_TABLE_SEARCH_NAME, 0, -1, 0, NULL);

  /* Read the k-d tree root index from the header keyword.
   * See example in description of 'gal_fits_key_read_from_ptr'.*/
  keysll[0].name="KDTROOT";
  keysll[0].type=GAL_TYPE_SIZE_T;
  gal_fits_key_read(kdtreefile, kdtreehdu, keysll, 0, 0, NULL);
  keysll[0].name=NULL; /* Since we did not allocate it. */
  rkey=gal_data_copy_to_new_type(&keysll[0], GAL_TYPE_SIZE_T);
  root=((size_t *)(rkey->array))[0];

  /* Find the nearest neighbour of the point. */
  nearest_index=gal_kdtree_nearest_neighbour(input, kdtree, root,
                                             point, &least_dist);

  /* Print the results. */
  in_x=input->array;
  in_y=input->next->array;
  printf("(%g, %g): nearest is (%g, %g), with a distance of %g\n",
         point[0], point[1], in_x[nearest_index],
         in_y[nearest_index], least_dist);

  /* Clean up and return. */
  gal_data_free(rkey);
  gal_list_data_free(input);
  gal_list_data_free(kdtree);
  gal_data_array_free(keysll, nkeys, 1);
  return EXIT_SUCCESS;
}

Footnotes

(267)

This example input table is the same as the example in Wikipedia (as of December 2020). However, on the Wikipedia output, the root node is (7,2), not (5,4). The difference is primarily because there are 6 rows and the median element of an even number of elements can vary by integer calculation strategies. Here we use 0-based indexes for finding median and round to the smaller integer.