A word difference finder (and others): mdiff

3 The multi-difference finder

The name mdiff stands for multi-diff, and has the purpose of encompassing the functionnality of a few other diff-type programs. The prefix multi- also stands for the fact the program is often able to study more than two input files at once.

The theory of operation is simple. The program splits all input files into a sequence of items, which may be lines or words. mdiff is then said to operate either in line mode or in word mode. It then tries to find sequences of items which are repeated in the input files. Such common sequences are called clusters of items, and each occurrence of a repetition is called a cluster member. What remains, once all cluster members are conceptually removed from all input files, is a set of differences. The role of mdiff is to conveniently list either cluster members and differences.

When input files are very similar, it is likely that clusters will encompass many items (lines or words) and differences will be small. So, most listing options inhibit the printing of cluster members. However, one may ask for the few beginning or ending items of cluster members to be printed nevertheless, as a way to provide a kind of feedback or context of the difference, those context items are sometimes said to be at the horizon of the difference. In merged listings, cluster members may just not be printed, except maybe for a few context items at the beginning of the member (just after a difference), and a few context items at the end of the member (just before a difference).

When cluster members are short, or if you prefer, when the differences are not far away from each other, it is quite possible that the required context items often cover the full extent of the cluster members, which then are not inhibited anymore when this happens. A run of differences intermixed with such non-suppressed members is called a hunk. Some reports produced by mdiff are showned as a list of hunks, and it is to be understood that common items are elided between hunks. However, each hunk in itself has no item missing, and each item of the hunk is analysed as pertaining either to only one of the input file or to many of them. Each hunk is preceded by a header, which explains the line position of all input files prior to the hunk itself. By comparing a hunk header with the previous hunk header, the user can have a hint about how much printing was spared.

When two input files are quite similar, clusters are usually presented in the same order in all files. If a cluster member A in the first file corresponds to a cluster member A in the second file, it is likely that another cluster member B which appears after A in the first file will correspond to a cluster member B in the second file which appears after A as well. So, in many cases, while producing merged listing of files, cluster members may be made to naturally correspond to one another. However, this is not always true, in particular when the second file has been produced from the first by moving a big chunk of code away from its original position. In such cases, we say that members have crossed. When members are crossed and mdiff has to make a merged listing, it selects one cluster member as being naturally associated with its correspondant (either the pair of A’s or the pair of B’s) and then consider the other cluster as being part of a difference. The crossed nature of the member may still be analysed and reported, or it may be ignored.

The standard diff program is meant for when there are exactly two input files, for which crossed members should be ignored. mdiff output format has been designed in such a way that it should resemble diff output for this precise case. However, diff formats are not sufficient for representing all cases which mdiff may address, and this is not mature yet. That is why mdiff, in its current state, still experiments with output formats, which are subject to change.

When the input files are not very similar, or rather different, merged listings are not very significant nor useful, and may even be rather confusing. The best to do in such cases is using mdiff for making an annotated relisting of all input files, in which cluster members are properly identified and referred to one another.

Statistics.

Read summary: 137 files, 41975 lines
Work summary: 439 clusters, 1608 members, 8837 duplicate lines

The summary lines, triggered by the -s option, say that about 8837 non-ignorable lines could be removed over the 41975 which has been read, by using functions, #include, #define, or similar devices.

If one manages to execute mdiff within GNU Emacs so the output described above is collected into the *compilation* buffer, the command C-` (‘M-x next-error’) will proceed to the next cluster member in the other window, and similarily for other compilation mode commands. This is a useful way for handling mdiff output.

Each line in the hunk, after the header, comes from the compared files, but is shifted right so the first column (or the first few columns) of each line gives information about where the line is coming from. A space indicates a line which is common to all files. In case there are only two input files, a minus sign indicates a line from the first file and a plus sign a line from the second file. Else, a letter from ‘a’ to ‘z’, or more than one letter if there are more than 26 files, indicates to which file the line pertains. If a line or a block of line pertains to many files but not to all of them, the first column holds a vertical bar, and the line or block of lines is bracketed between ‘@/’ and ‘@\’ lines, which are kind of comments within the hunk. The initial bracket lists all file letters that are related to the incoming line.

I initially wrote mdiff specifically to help cleaning a C++ project which was a bit large, and in which many big monolithic classes were derived from each other most probably by rough copying followed by local modifications. I intended to fragment most common clusters and segregate the parts into virtual methods in outer classes, and override these methods, as appropriate, with less common variants within inner classes. mdiff was good at pointing me to exactly where I should look at. Of course, it never did the cleanup for me, but it helped doing the research about what should be done. Reusing mdiff over the half-cleaned project gave me more fine grained analysis of what was left to consider.

• mdiff invocation:		Invoking `mdiff`
• Efficiency:		Resource considerations and efficiency