The core of the XML file comparison engine is Chawathe's matching algorithm. This matching algorithm is a heuristic method based on a scoring system. This means that comparison results could be unexpected when many elements in each document are very similar.
See the following sections for some examples.
XML text documents are hierarchical data structures. Users can insert, delete, or reorder elements, modify their contents, or move elements across different parts of the hierarchy. The Chawathe algorithm can detect these different types of changes within the hierarchy of the document. As with conventional text differencing utilities, the Chawathe algorithm detects local text that is added, deleted, or changed, and additionally can prepare an edit script that can be used to create a report of the hierarchical location of detected differences.
The Chawathe algorithm attempts to match elements that are of
the same category. The Chawathe paper refers to these categories as labels.
In the following XML example documents (with labels
C elements on the left
are compared with the three
C elements on the right
B element on the left
is compared with the two
B elements on the right
The Chawathe algorithm matches a particular label by extracting a flat sequence of elements from the hierarchical document and attempting to match the elements in the sequences. In the example above, elements of the sequence
(<C> First </C>, <C> Second </C>, <C> Third </C>)
(<C> First </C>, <C> Third </C>, <C> Fourth </C>)
Sequences are matched using a Longest Common Subsequence (LCS)
algorithm. For example, if
C elements are matched
on their text content, the LCS of the above sequences is given by:
(<C> First </C>, <C> Third </C>)
You can define a score for matching elements
of a particular label in different ways. For instance, in the above
C elements can be matched on text content,
be matched on text content and on
the number of
they have in common. To determine whether elements match or not, the
Chawathe algorithm compares the score to a threshold.
The implementation can specify scoring methods, thresholds, the definition of labels, and the order in which labels are processed. These can be defined separately for each problem domain or type of XML file. The XML comparison tool provides suitable definitions for a set of common XML file types, and uses a default definition for any type of XML document it does not recognize.
Chawathe's algorithm is a heuristic. That is, it cannot guarantee to return the optimal matching between two sequences. It is the use of a threshold mechanism in combination with an LCS algorithm that makes the algorithm a heuristic. A heuristic algorithm is preferable to an optimal matching because the heuristic is much faster.
An algorithm can only guarantee a mathematically optimal matching by exhaustively computing the score between all pairs of elements in the two sequences and choosing those pairs that maximize an overall matching score between the two sequences. This exhaustive approach is computationally very expensive because its running time increases exponentially with the length of the sequences to be matched.
Also a user's expectations can depend on context information that is not available to the matching algorithm (e.g., prior knowledge of the precise sequence of changes applied). This means even a mathematically optimal algorithm might match elements unexpectedly from a user's perspective.
In contrast with the mathematically optimal approach, Chawathe's algorithm guarantees linear running time for sequences that are the same or very similar. The worst-case scenario is quadratic running time for sequences that are entirely different.
The XML comparison tool performs best when the files to be compared are mostly similar. It becomes slower for files that contain more differences.
Elements could fail to match even if they were matched in comparisons of previous versions of the documents. A seemingly small change in one of the properties used for matching can cause this to happen if it tips the score under the threshold.
Consider the following example where
B elements are scored on the value
A elements are scored on the ratio
score is compared with a threshold of 0.5.
A and the middle
two out of three
B elements in common, resulting
in a matching score of 2/3=0.66. The XML comparison tool marks the
as matched and the report shows that their contents have been modified.
When a user makes a further change to the middle document (resulting
in the right document), and this new document is compared again to
the left document, the matching score for
to 1/3=0.33. The algorithm considers the
unmatched this time. In this case, the difference between the two
documents is marked as a deletion of
the left document and an insertion of a new
the right document.
This problem is likely to occur when there is little information available inside a single element to score a match. A seemingly small change in one of the properties used for matching could tip the score under the threshold, and therefore result in a large change in the outcome of the comparison.
Sometimes unexpected matches of similar items occur across different
parts of the hierarchy. In the following example,
are matched on
In this case, the user might expect to see the very first
on the left marked as deleted, with the second and third
matched to the corresponding
C element on the right.
However, this might not happen, if the first
the left is matched to the second
C on the right,
even though these two
C elements exist in very
different parts of the document hierarchy. This mismatch would result
in the third
C element on the left being marked
as deleted, which the user might find unexpected.
This case is likely to occur when there are several potential
matching candidates for a particular element. In other words, when
elements of a particular label tend to be very similar. Whether such
a spurious cross-matching occurs or not depends on all of the other
within the two documents. The LCS algorithm used for matching the
two sequences favors local matches over distant ones.
In other words, sub-sequences of elements that are close together
in the first sequence tend to be matched to sub-sequences of elements
that are close together in the second sequence. However, this locality
is not always guaranteed, and the outcome depends on how other elements
in the sequence are matched.
It is difficult to distinguish many similar potential matches
and this could produce unexpected results. In the following example,
are scored on
and the score is compared to a threshold of 0.5.
The right document contains one
more than the left document, and therefore one of the
on the right must remain unmatched and the tool will mark one as inserted.
However, since most
B elements on the left potentially
B elements on the right, it is impossible
to predict exactly how the sequences will be matched. For instance,
the comparison could generate the following result:
"B name="1" …" > "B name="2" …" "B name="2" …" > "B name="new" …" "B name="3" …" > "B name="3" …" "B name="4" …" > "B name="4" …"
In this case, "B name= "1" on the right
remains unmatched. As in the previous example, this depends on how
all of the other
B elements in the two documents
are matched. This situation is likely to occur when elements have
several potential matching candidates.