GNU Astronomy Utilities



7.5.3 Matching algorithms

Matching involves two catalogs, let’s call them catalog A (with N rows) and catalog B (with M rows). The most basic matching algorithm that immediately comes to mind is this: for each row in A (let’s call it \(A_i\)), go over all the rows in B (\(B_j\), where \(0

However, this basic parsing algorithm is very computationally expensive: \(N\times M\) distances have to measured, and calculating the distance requires a square root and power of 2: in 2 dimensions it would be \(d(A_i,B_j)=\sqrt{(B_{ix}-A_{ix})^2+(B_{iy}-A_{iy})^2}\). If an elliptical aperture is necessary, it can even get more complicated (see Defining an ellipse and ellipsoid). Such operations are not simple, and will consume many cycles of your CPU! As a result, this basic algorithm will become terribly slow as your datasets grow in size. For example, when N or M exceed hundreds of thousands (which is common in the current days with datasets like the European Space Agency’s Gaia mission). Therefore that basic parsing algorithm will take too much time and more efficient ways to find the nearest neighbor need to be found. Gnuastro’s Match currently has the following algorithms for finding the nearest neighbor:

Sort-based

In this algorithm, we will use a moving window over the sorted datasets:

  1. Sort the two datasets by their first coordinate. Therefore \(A_i
  2. Use the radial distance threshold to define the width of a moving interval over both A and B. Therefore, with a single parsing of both simultaneously, for each A-point, we can find all the elements in B that are sufficiently near to it (within the requested aperture).

You can use this method by disabling the default algorithm that is described next with --kdtree=disable. The reason the sort-based algorithm is not the default is that it has some caveats:

  • It requires sorting, which can again be slow on large numbers.
  • It can only be done on a single thread! So it cannot benefit from the modern CPUs with many threads, or GPUs that have hundreds/thousands of computing units.
  • There is no way to preserve intermediate information for future matches (and not have to repeat them).
k-d tree

The k-d tree concept is much more abstract, but powerful (addressing all the caveats of the sort-based method described above.). In short a k-d tree is a partitioning of a k-dimensional space (“k” is just a place-holder, so together with “d” for dimension, “k-d” means “any number of dimensions”!).

The k-d tree of table A is another table with the same number of rows, but only two integer columns: the integers contain the row indexs (counting from zero) of the left and right “branch” (in the “tree”) of that row. With a k-d tree we can find the nearest point with much fewer checks (statistically: compared to always parsing everything from the top-down). We won’t go deeper into the concept of k-d trees here and will focus on the high-level usage in of k-d trees in Match. In case you are interested to learn more on the k-d tree concept and Gnuastro’s implementation, please see its Wikipedia page and K-d tree (kdtree.h).

When given two catalogs (like the command below), Gnuastro’s Match will internally construct a k-d tree for catalog A (the first catalog given to it) and use the k-d tree of A, for finding the nearest row in B to each row in A. This is done in parallel on all available threads (unless you specify a certain number of threads to use with --numthreads, see Multi-threaded operations)

$ astmatch A.fits --ccol1=ra,dec B.fits --ccol2=RA,DEC \
           --aperture=1/3600

In scenarios where your reference (A) catalog is the same (and it is large!), you can save time by building the k-d tree of A and saving it into a file once, and simply use that k-d tree in all future matches. The command below shows how to do the first step (to build the k-d tree and keep it in a file).

$ astmatch A.fits --ccol1=ra,dec --kdtree=build \
           --output=A-kdtree.fits

This external k-d tree (A-kdtree.fits) can be fed to Match later (to avoid having to reconstruct it every time you want to match a new catalog with A). The commands below show how to do this by matching both B.fits and C.fits with A.fits using its pre-built k-d tree. Note that the same --kdtree option above (which has a value of build), is now given the file name of the already-built k-d tree.

$ astmatch A.fits --ccol1=ra,dec --kdtree=A-kdtree.fits \
           B.fits --ccol2=RA,DEC --aperture=1/3600 \
           --output=A-B.fits
$ astmatch A.fits --ccol1=ra,dec --kdtree=A-kdtree.fits \
           C.fits --ccol2=RA,DEC --aperture=1/3600 \
           --output=A-C.fits

There is just one technical issue however: when there is no neighbor within the acceptable distance of the k-d tree, it is forced to parse all elements to confirm that there is no match! Therefore if one catalog only covers a small portion (in the coordinate space) of the other catalog, the k-d tree algorithm will be forced to parse the full k-d tree for the majority of points! This will dramatically decrease the running speed of Match.

To mitigate this, Match first divides the range of the first input in all its dimensions into bins that have a width of the requested aperture (similar to a histogram), and will only do the k-d tree based search when the point in catalog B actually falls within a bin that has at least one element in A.

In summary, here are the points to consider when selecting an algorithm, or the order of your inputs (for optimal speed, the match will be the same):

  • For larger datasets, the k-d tree based method (when running on all threads possible) is much more faster than the classical sort-based method.
  • If you always need to match against one catalog (that is large!), the k-d tree construction itself can take a significant fraction of the running time. In such cases, save the k-d tree into a file and simply give it to later calls (as shown above)
  • For the inner or full arrangement of the output (described in Arranging match output), the order of inputs does not matter. But if you put the table with fewer rows as the first input, you will gain a lot in processing time (depending on the size of the other table and the number of threads). This is because of the following facts:
    • The k-d tree is constructed for the first input table and the construction of a larger dataset’s k-d tree will take longer (as a non-linear function of the number of rows).
    • Multi-threading is done on the rows of the second input table and it scales in a linear way with more rows.