XML Canonicalization, Part 2
by Bilal Siddiqui
October 09, 2002
In the
previous installment of this article, I introduced Canonical XML, and
I discussed when and why you need to canonicalize an XML file. I also
demonstrated a step-by-step process that results in the canonical form of
an XML document.
In this second and final installment, I'll take the concept further and
explain the canonicalization requirements of CDATA sections, processing
instructions, comments, external entity references and XML document
subsets.
Let's start with an example. Listing 1 is an XML file that
contains, among other things, a CDATA section, comments, a processing
instruction, and an external entity references. The thirteen steps of part
1 are not sufficient to canonicalize it. We need to perform a few
additional steps.
14. CDATA Sections
The canonical form requires all CDATA sections to be replaced with
their equivalent PCDATA XML content. This is what we have done in Listing 2. If you compare the
two listings, you will find that the markup for CDATA section
("<![CDATA[" in the beginning and "]]>" at the end) has been deleted
and "<" character in the CDATA section of Listing 1 has been replaced
with its equivalent escape sequence (<) in Listing 2.
15. Processing Instructions
We need to normalize whitespace inside processing instructions. This
means that the whitespace between the target and its data will be reduced
to a single space (the #x20 character).
There is only one processing instruction in the XML file of Listing 2. The target in this
processing instruction is xml-stylesheet, which is followed
by the data string. Listing
3 is the same as Listing
2, except that all whitespace between the target and its data has
been normalized.
16. External Entity References:
Recall the section on entity references in part 1, where we
demonstrated how to canonicalize parsed internal entity references. In a
similar fashion, parsed external entity references also need to be
replaced with the content they refer to, as shown in Listing 4.
17. Comments
The canonical XML specification allows both retaining and removing
comments from an XML file. An XML canonicalization engine will receive a
boolean parameter (flag) along with the XML file to be canonicalized,
which will tell the canonicalization engine whether to include or exclude
XML comments in the canonical form.
For example, Listing 5
shows the removal of comments from Listing 4 (canonical XML
without comments).
We are now ready to apply the thirteen steps to Listing 5 (as described in
part 1). The result is shown in Listing 6.
18. Document Subsets
XML document subsets or fragments (portions of complete XML files) are
an interesting case. When we extract a portion from an XML file, we
essentially separate a child node form its parent (call it an
orphan node). This separation may result in the invalidation of the
child's namespace context if the namespace context of the orphan child was
declared in the parent that has been omitted in the document subset.
The Canonical XML specification proposes a method to preserve the
namespace context while extracting a document subset. However, there are
application scenarios in which preserving the namespace context may create
other problems. W3C has released a separate recommendation named Exclusive
XML Canonicalization which deals with such scenarios.
The difference between the Canonical XML and Exclusive XML
Canonicalization specifications is only about preserving or excluding the
ancestor context.
Preserving the Ancestor Context
Have a look at Listing
7, which is a SOAP message. Let's assume we need to canonicalize the
booking element in Listing 7 whose
unitCharge attribute shows "50" as the value. The first step
in doing this is to write an XPath expression that will extract the
required document fragment from the XML file. While trying to identify
which element I intend to canonicalize, I said "the booking element in
Listing 7 whose unitCharge attribute shows '50' as the value". The
equivalent XPath expression with the same meaning is
(//. | //@* | //namespace::*)[ancestor-or-self::bs:booking[@unitCharge="50"]]
(with namespace declaration xmlns:bs="http://www.FictitiousTourismInterface/BookingService")
This XPath expression will extract the required booking element from
the XML file of Listing
7. The expression in the first pair of brackets (//. | //@* |
//namespace::*) selects all element, attribute, and namespace
nodes of an XML file. The expression in the outer pair of square brackets
(ancestor-or-self::bs:booking) selects all booking
elements (along with their children) and the expression in the inner pair
of square brackets (@unitCharge="50") selects the
booking element whose unitCharge attribute has
the value "50".
Listing 8 is a subset
of Listing 7 and consists
of the booking element. Some readers might be tempted at this point to
apply the thirteen steps of part 1 to canonicalize Listing 8. However, there are
a couple of problems that require additional processing before we can
apply those thirteen steps:
The namespace declarations for the bs and
hs prefixes were made in booking element's parent tag, which
is not included in the document subset shown in Listing 8.
The xml:lang attribute
of the bookingPackage element of Listing 7 was applicable to all its
children. This attribute is also missing in the document subset of Listing 8.
These problems clearly indicate that extracting document fragments
should be accompanied by actions to preserve the namespace context and the
effect of attributes from the xml: namespace. The Canonical
XML specification requires the following measures to be taken while
canonicalizing document subsets (in addition to all the requirements of
canonicalizing complete XML files).
Namespace declarations in the omitted ancestors of the
document subset are included in the canonical form.
Attributes in the xml namespace are also included in the canonical
form, if they are not already present in the fragment being canonicalized.
These two steps are intended to conserve the ancestor context of a
document subset. Have a look at Listing 9 (the required
canonical form), which includes the four namespace declarations made in
the ancestors of the booking element of Listing 7. Listing 9 also includes the
xml:lang attribute. Also notice that the canonical form of
document subsets does not have any line breaks (#xA) i.e. the entire file
appears on the same line.
Once the ancestor context has been included, the ordering of namespace
declarations and attributes is the same as for canonicalizing the complete
XML file.
[1] [2] Next