A word difference finder (and others): Efficiency

3.2 Resource considerations and efficiency

Memory consumption

mdiff can easily handle medium-sized project. For a 32 bits architecture, the memory requirements may computed like this:

8 bytes per file
8 bytes per line
4 bytes per cluster
8 bytes per cluster member

Time consumption

To evaluate the speed, consider the example shown above (see mdiff), and yielding these statistics:

Read summary: 137 files, 41975 lines
Work summary: 439 clusters, 1608 members …

Once many files in the memory cache, and redirecting the output to /dev/null, the processing takes 3 seconds of real time on an Intel 486/100, which looks good. I was indeed afraid of some hidden O(n^2) behaviour¹, even if the program is mostly O(n*log(n)). Maybe one will discover or construct cases putting mdiff on its knees. So far, mdiff seemingly behaves well for the little problems given to it. If we devise and generate a more traditional diff-like output, in which all input files are relisted, this will add some time to the processing, but it will be only linear with regard with the total length of input files.

There is a clever optimized sorting algorithm for all substrings of a file, which might be generalised to handle words or lines for mdiff. But since the program is already faster than we initially expected, there is no emergency to resort to using such an algorithm.

Trading complexity for clarity

When lines repeat a lot, there are surprisingly many ways to relate blocks of lines, and reporting them all can make very hairy listings. Any choice about reporting similarities, or not, is somewhat arbitrary, but we ought to make some of such choices for the program to be practical. Some of these choices are detailed here.

If all members of a given cluster A are proper subsets of all members of another given cluster B, then cluster A is wholly forgotten. However, let’s presume for example that there are more members in A than in B. Then, some members of A necessarily appear unrelated to any member of B. In such case, it has been decided more useful to report all occurrences of A members, even those embedded within occurrences of B members. When only interested in members B, annotations pertaining to A may be perceived as clutter. However, when interested in members of A, getting all of them is probably the most useful choice.

It sometimes happen that members of a very same cluster overlap. In the string ‘a a a’, there are two overlapping members for the cluster represented by the string ‘a a’, one from the first two ‘a’, another from the last two ‘a’. In such cases, one member of such an overlap is automatically chopped so the overlap does not occur.

White lines and items containing only delimiters are the possible source of a lot of complexity, if these are fully taken as significant. Since this does not add much to clarity, they are better ignored, usually, through using --ignore-blank-lines (-B) or --ignore-delimiters (-j). Increasing the value of --minimum-size=items (‘-J items’) option also cut off complexity in favor of clarity, yet some small matches may then go unnoticed. Exactly how to best adjust the items value is left for the user to decide.

Footnotes

(1)

n is the total number of lines.