Matching algorithms (GNU Astronomy Utilities)

Next: Invoking Match, Previous: Match, Up: Match [Contents][Index]

7.5.1 Matching algorithms ¶

Matching involves two catalogs, let’s call them catalog A (with N rows) and catalog B (with M rows). The most basic matching algorithm that immediately comes to mind is this: for each row in A (let’s call it \(A_i\)), go over all the rows in B (\(B_j\), where \(0

This basic parsing algorithm is very computationally expensive: \(N\times M\) distances have to measured, and calculating the distance requires a square root and power of 2: in 2 dimensions it would be \(\sqrt{(B_{ix}-A_{ix})^2+(B_{iy}-A_{iy})^2}\). If an elliptical aperture is necessary, it can even get more complicated, see Defining an ellipse and ellipsoid. Such operations are not simple, and will consume many cycles of your CPU! As a result, this basic algorithm will become terribly slow as your datasets grow in size. For example, when N or M exceed hundreds of thousands (which is common in the current days with datasets like the European Space Agency’s Gaia mission). Therefore that basic parsing algorithm will take too much time and more efficient ways to find the nearest neighbor need to be found. Gnuastro’s Match currently has algorithms for finding the nearest neighbor:

Sort-based

In this algorithm, we will use a moving window over the sorted datasets:

Sort the two datasets by their first coordinate. Therefore \(A_i
Use the radial distance threshold to define the width of a moving interval over both A and B. Therefore, with a single parsing of both simultaneously, for each A-point, we can find all the elements in B that are sufficiently near to it (within the requested aperture).

This method has some caveats: 1) It requires sorting, which can again be slow on large numbers. 2) It can only be done on a single CPU thread! So it cannot benefit from the modern CPUs with many threads. 3) There is no way to preserve intermediate information for future matches, for example, this can greatly help when one of the matched datasets is always the same. To use this sorting method in Match, use --kdtree=disable.

k-d tree based

The k-d tree concept is much more abstract, but powerful (addressing all the caveats of the sort-based method described above.). In short a k-d tree is a partitioning of a k-dimensional space (“k” is just a place-holder, so together with “d” for dimension, “k-d” means “any number of dimensions”!). The k-d tree of table A is another table with the same number of rows, but only two integer columns: the integers contain the row indexs (counting from zero) of the left and right “branch” (in the “tree”) of that row. With a k-d tree we can find the nearest point with much fewer (statistically) checks, compared to always parsing from the top-down. For more on the k-d tree concept and Gnuastro’s implementation, please see K-d tree (kdtree.h).

When given two catalogs (like the command below), Gnuastro’s Match will internally construct a k-d tree for catalog A (the first catalog given to it) and use the k-d tree of A, for finding the nearest B-point(s) to each A-point (this is done in parallel on all available CPU threads, unless you specify a certain number of threads to use with --numthreads, see Multi-threaded operations)

$ astmatch A.fits --ccol1=ra,dec B.fits --ccol2=RA,DEC \
           --aperture=1/3600

However, optionally, you can also build the k-d tree of A and save it into a file, with a separate call to Match, like below

$ astmatch A.fits --ccol1=ra,dec --kdtree=build \
           --output=A-kdtree.fits

This external k-d tree (A-kdtree.fits) can be fed to Match later (to avoid having to reconstruct it every time you want to match a new catalog with A) like below for matching both B.fits and C.fits with A.fits. Note that the same --kdtree option above, is now given the file name of the k-d tree, instead of build.

$ astmatch A.fits --ccol1=ra,dec --kdtree=A-kdtree.fits \
           B.fits --ccol2=RA,DEC --aperture=1/3600 \
           --output=A-B.fits
$ astmatch A.fits --ccol1=ra,dec --kdtree=A-kdtree.fits \
           C.fits --ccol2=RA,DEC --aperture=1/3600 \
           --output=A-C.fits

Irrespective of how the k-d tree is made ready (by importing or by constructing internally), it will be used to find the nearest A-point to each B-point. The k-d tree is parsed independently (on different CPU threads) for each row of B.

There is just one technical issue however: when there is no neighbor within the acceptable distance of the k-d tree, it is forced to parse all elements to confirm that there is no match! Therefore if one catalog only covers a small portion (in the coordinate space) of the other catalog, the k-d tree algorithm will be forced to parse the full k-d tree for the majority of points! This will dramatically decrease the running speed of Match. Therefore, Match first divides the range of the first input in all its dimensions into bins that have a width of the requested aperture (similar to a histogram), and will only do the k-d tree based search when the point in catalog B actually falls within a bin that has at least one element in A.

Above, we described different ways of finding the \(A_i\) that is nearest to each \(B_j\). But this is not the whole matching process! Let’s go ahead with a “basic” description of what happens next... You may be tempted to remove \(A_i\) from the search of matches for \(B_k\) (where \(k>j\)). Therefore, as you go down B (and more matches are found), you have to calculate less distances (there are fewer elements in A that remain to be checked). However, this will introduce an important bias: \(A_i\) may actually be closer to \(B_k\) than to \(B_j\)! But because \(B_j\) happened to be before \(B_k\) in your table, \(A_i\) was removed from the potential search domain of \(B_k\). The good match (\(B_k\) with \(A_i\) will therefore be lost, and replaced by a false match between \(B_j\) and \(A_i\)!

In a single-dimensional match, this bias depends on the sorting of your two datasets (leading to different matches if you shuffle your datasets). But it will get more complex as you add dimensionality. For example, catalogs derived from 2D images or 3D cubes, where you have 2 and 3 different coordinates for each point.

To address this problem, in Gnuastro (the Match program, or the matching functions of the library) similar to above, we first parse over the elements of B. But we will not associate the first nearest-neighbor with a match! Instead, we will use an array (with the number of rows in A, let’s call it “B-in-A”) to keep the list of all nearest element(s) in B that match each A-point. Once all the points in B are parsed, each A-point in B-in-A will (possibly) have a sorted list of B-points (there may be multiple B-points that fall within the acceptable aperture of each A-point). In the previous example, the \(i\) element (corresponding to \(A_i\)) of B-in-A will contain the following list of B-points: \(B_j\) and \(B_k\).

A new array (with the number of points in B, let’s call it A-in-B) is then used to find the final match. We parse over B-in-A (that was completed above), and from it extract the nearest B-point to each A-point (\(B_k\) for \(A_i\) in the example above). If this is the first A-point that is found for this B-point, then we put this A-point into A-in-B (in the example above, element \(k\) is filled with \(A_k\)). If another A-point was previously found for this B-point, then the distance of the two A-points to that B-point are compared, and the A-point with the smaller distance is kept in A-in-B. This will give us the best match between the two catalogs, independent of any sorting issues. Both the B-in-A and A-in-B will also keep the distances, so distances are only measured once.

In summary, here are the points to consider when selecting an algorithm, or the order of your inputs (for optimal speed, the match will be the same):

For larger datasets, the k-d tree based method (when running on all threads possible) is much more faster than the classical sort-based method.
The k-d tree is constructed for the first input table and the multi-threading is done on the rows of the second input table. The construction of a larger dataset’s k-d tree will take longer, but multi-threading will work better when you have more rows. As a result, the optimal way to place your inputs is to give the smaller input table (with fewer rows) as the first argument (so its k-d tree is constructed), and the larger table as the second argument (so its rows are checked in parallel).
If you always need to match against one catalog (that is large!), the k-d tree construction itself can take a significant fraction of the running time. Therefore you can save its k-d tree into a file and simply give it to later calls, like the example given in the description of the k-d algorithm mentioned above.