mdiff can easily handle medium-sized project. For a 32 bits
architecture, the memory requirements may computed like this:
To evaluate the speed, consider the example shown above (see mdiff), and yielding these statistics:
Read summary: 137 files, 41975 lines Work summary: 439 clusters, 1608 members …
Once many files in the memory cache, and redirecting the output to
/dev/null, the processing takes 3 seconds of real time on an
Intel 486/100, which looks good. I was indeed afraid of some hidden
even if the program is mostly O(n*log(n)). Maybe one will
discover or construct cases putting
mdiff on its knees. So far,
mdiff seemingly behaves well for the little problems given to it.
If we devise and generate a more traditional
in which all input files are relisted, this will add some time to the
processing, but it will be only linear with regard with the total length
of input files.
There is a clever optimized sorting algorithm for all substrings
of a file, which might be generalised to handle words or lines for
mdiff. But since the program is already faster than we initially
expected, there is no emergency to resort to using such an algorithm.
When lines repeat a lot, there are surprisingly many ways to relate blocks of lines, and reporting them all can make very hairy listings. Any choice about reporting similarities, or not, is somewhat arbitrary, but we ought to make some of such choices for the program to be practical. Some of these choices are detailed here.
If all members of a given cluster A are proper subsets of all members of another given cluster B, then cluster A is wholly forgotten. However, let’s presume for example that there are more members in A than in B. Then, some members of A necessarily appear unrelated to any member of B. In such case, it has been decided more useful to report all occurrences of A members, even those embedded within occurrences of B members. When only interested in members B, annotations pertaining to A may be perceived as clutter. However, when interested in members of A, getting all of them is probably the most useful choice.
It sometimes happen that members of a very same cluster overlap. In the string ‘a a a’, there are two overlapping members for the cluster represented by the string ‘a a’, one from the first two ‘a’, another from the last two ‘a’. In such cases, one member of such an overlap is automatically chopped so the overlap does not occur.
White lines and items containing only delimiters are the possible source of a lot of complexity, if these are fully taken as significant. Since this does not add much to clarity, they are better ignored, usually, through using --ignore-blank-lines (-B) or --ignore-delimiters (-j). Increasing the value of --minimum-size=items (‘-J items’) option also cut off complexity in favor of clarity, yet some small matches may then go unnoticed. Exactly how to best adjust the items value is left for the user to decide.
n is the total number of lines.