|On this page…|
The core of the XML file comparison engine is Chawathe's matching algorithm. This matching algorithm is a heuristic method based on a scoring system. This means that comparison results could be unexpected when many elements in each document are very similar.
See the following sections for some examples.
XML text documents are hierarchical data structures. Users can insert, delete, or reorder elements, modify their contents, or move elements across different parts of the hierarchy. The Chawathe algorithm can detect these different types of changes within the hierarchy of the document. As with conventional text differencing utilities, the Chawathe algorithm detects local text that is added, deleted, or changed, and additionally can prepare an edit script that can be used to create a report of the hierarchical location of detected differences.
The Chawathe algorithm attempts to match elements that are of the same category. The Chawathe paper refers to these categories as labels. In the following XML example documents (with labels A, B, and C):
The three C elements on the left are compared with the three C elements on the right
The single B element on the left is compared with the two B elements on the right
The Chawathe algorithm matches a particular label by extracting a flat sequence of elements from the hierarchical document and attempting to match the elements in the sequences. In the example above, elements of the sequence
(<C> First </C>, <C> Second </C>, <C> Third </C>)
are matched against elements of the sequence
(<C> First </C>, <C> Third </C>, <C> Fourth </C>)
Sequences are matched using a Longest Common Subsequence (LCS) algorithm. For example, if C elements are matched on their text content, the LCS of the above sequences is given by:
(<C> First </C>, <C> Third </C>)
You can define a score for matching elements of a particular label in different ways. For instance, in the above example, C elements can be matched on text content, B can be matched on text content and on Name, and A on the number of B and C elements they have in common. To determine whether elements match or not, the Chawathe algorithm compares the score to a threshold.
The implementation can specify scoring methods, thresholds, the definition of labels, and the order in which labels are processed. These can be defined separately for each problem domain or type of XML file. The XML comparison tool provides suitable definitions for a set of common XML file types, and uses a default definition for any type of XML document it does not recognize.
Chawathe's algorithm is a heuristic. That is, it cannot guarantee to return the optimal matching between two sequences. It is the use of a threshold mechanism in combination with an LCS algorithm that makes the algorithm a heuristic. A heuristic algorithm is preferable to an optimal matching because the heuristic is much faster.
An algorithm can only guarantee a mathematically optimal matching by exhaustively computing the score between all pairs of elements in the two sequences and choosing those pairs that maximize an overall matching score between the two sequences. This exhaustive approach is computationally very expensive because its running time increases exponentially with the length of the sequences to be matched.
Also a user's expectations can depend on context information that is not available to the matching algorithm (e.g., prior knowledge of the precise sequence of changes applied). This means even a mathematically optimal algorithm might match elements unexpectedly from a user's perspective.
In contrast with the mathematically optimal approach, Chawathe's algorithm guarantees linear running time for sequences that are the same or very similar. The worst-case scenario is quadratic running time for sequences that are entirely different.
The XML comparison tool performs best when the files to be compared are mostly similar. It becomes slower for files that contain more differences.
Elements could fail to match even if they were matched in comparisons of previous versions of the documents. A seemingly small change in one of the properties used for matching can cause this to happen if it tips the score under the threshold.
Consider the following example where
B elements are scored on the value of x
A elements are scored on the ratio of matching B elements
For both A and B the score is compared with a threshold of 0.5.
The left A and the middle A have two out of three B elements in common, resulting in a matching score of 2/3=0.66. The XML comparison tool marks the A elements as matched and the report shows that their contents have been modified.
When a user makes a further change to the middle document (resulting in the right document), and this new document is compared again to the left document, the matching score for A drops to 1/3=0.33. The algorithm considers the A elements unmatched this time. In this case, the difference between the two documents is marked as a deletion of A from the left document and an insertion of a new A into the right document.
This problem is likely to occur when there is little information available inside a single element to score a match. A seemingly small change in one of the properties used for matching could tip the score under the threshold, and therefore result in a large change in the outcome of the comparison.
Sometimes unexpected matches of similar items occur across different parts of the hierarchy. In the following example, C elements are matched on name:
In this case, the user might expect to see the very first C element on the left marked as deleted, with the second and third C elements matched to the corresponding C element on the right. However, this might not happen, if the first C on the left is matched to the second C on the right, even though these two C elements exist in very different parts of the document hierarchy. This mismatch would result in the third C element on the left being marked as deleted, which the user might find unexpected.
This case is likely to occur when there are several potential matching candidates for a particular element. In other words, when elements of a particular label tend to be very similar. Whether such a spurious cross-matching occurs or not depends on all of the other C elements within the two documents. The LCS algorithm used for matching the two sequences favors local matches over distant ones. In other words, sub-sequences of elements that are close together in the first sequence tend to be matched to sub-sequences of elements that are close together in the second sequence. However, this locality is not always guaranteed, and the outcome depends on how other elements in the sequence are matched.
It is difficult to distinguish many similar potential matches and this could produce unexpected results. In the following example, B elements are scored on name, p1, and p2, and the score is compared to a threshold of 0.5.
The right document contains one B element more than the left document, and therefore one of the B elements on the right must remain unmatched and the tool will mark one as inserted. However, since most B elements on the left potentially match most B elements on the right, it is impossible to predict exactly how the sequences will be matched. For instance, the comparison could generate the following result:
"B name="1" …" > "B name="2" …" "B name="2" …" > "B name="new" …" "B name="3" …" > "B name="3" …" "B name="4" …" > "B name="4" …"
In this case, "B name= "1" on the right remains unmatched. As in the previous example, this depends on how all of the other B elements in the two documents are matched. This situation is likely to occur when elements have several potential matching candidates.