
Utility Stylesheets
by Bob DuCharme
April 07, 2004
I have several useful little stylesheets that I've never mentioned
in this column because each is so short that describing it would make
for a pretty short column. I recently realized, though, that by
combining them I have enough to fill up two columns, so this month
we'll look at the first few.
Despite their brevity -- counting white space and comments, the
longest is 25 lines long -- they're each useful in a wide variety of
situations and can be used on nearly any XML document. I say "nearly"
because some are focused on XHTML, but you can easily modify them to
handle DocBook documents or other document types.
Most follow a similar pattern (or, to use the appropriate buzz
phrase, "design pattern"): one template rule copies everything in the
source document verbatim to the result tree, and another template
rule, or even another instruction, takes care of the particular
problem that the stylesheet addresses. As pipelining approaches to
processing XML become more popular, stylesheets like these can be
useful building blocks when creating larger, more complex
processes.
Stripping Empty Paragraphs
Sometimes, when using something like Perl or Python to convert a
text file to XML, you have to assume that a carriage return in your
text file input shows the end of a paragraph, and multiple carriage
returns in a row get converted to empty paragraphs. The following
stylesheet's addition to the "copy everything verbatim" template rule
is a template rule for p elements that only copies them if
they have any content after the removal of their extraneous white
space.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<!-- Only copy non-empty p elements. -->
<xsl:template match="p">
<xsl:if test="normalize-space(.)">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:if>
</xsl:template>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<@/xsl:stylesheet>
You can customize the first template rule's match condition to look
for other empty elements to suppress. For example, to have it check
for empty p, pre, and h4 elements, the
match condition would be "p|pre|h4". If you want it to look
for empty para elements in a DocBook document, set the match
condition to "para".
Convert Mixed Content to Element Content
Technically, any element that has character content at all is
considered to have mixed content, but in more popular usage, "mixed
content" describes an element that has both character data and child
elements mixed together in the same element. For example,
<doc>
<title>In the Mix</title>
<para>Technically, this element has mixed content.</para>
<para>This one is <keyterm>really <emph>very</emph> mixed</keyterm>,
as you can see. </para>
</doc>
The presence of text nodes and elements as children of the same
element can present problems when processing and storing XML
documents. For example, if you were storing each element of the
document above in its own record in a database, the
first para element would be simple enough to store, but what
about the second one? Would you store the keyterm element in
its own record? How could its relationship to the phrases that precede
and follow it be tracked? How would you map the relationship of its
two text node children and one element child to database records or
objects?
The following stylesheet can help in these situations. Its second
template rule, like the second template rule in the stylesheet above,
copies everything not addressed by the first template rule. The first
template rule looks for non-whitespace text nodes that have element
nodes as siblings and wraps them in a textnode element. (You
might want to name them text or PCDATA elements
instead. If your source documents are XHTML, you could name the text
node wrapper elements span elements and give
them class attribute set to something useful for your
application, thereby making the result valid XHTML.)
When parsing a document without checking its DTD, a template rule
that wraps a textnode element around all text nodes
with element siblings would also wrap the carriage returns at the end
of each line (for example, the "text node" between the title
end-tag and the first para start-tag in the example document
above), which you probably don't want, so
the [normalize-space(.)] predicate in the following
stylesheet's first template rule ensures that this only happens to
text nodes that, after extraneous white space is removed, still have
something left.
<!-- mixed2ec.xsl: convert mixed content to element content: wrap any non-blank
text nodes that have element siblings in <textnode></textnode> tags.
-->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:template match="text()[normalize-space(.)][../*]">
<textnode><xsl:value-of select="."/></textnode>
</xsl:template>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
(Note for XSLT geeks: you could also use the preceding-sibling and
following-sibling axes to check for siblings of the text node,
but [../*], which checks whether the parent has any element
children, is more concise.) This stylesheet converts the sample
document above to the document shown here, in which no non-blank text
node has an element as a sibling:
<doc>
<title>In the Mix</title>
<para>Technically, this element has mixed content.</para>
<para><textnode>This one is </textnode><keyterm><textnode>really </textnode>
<emph>very</emph><textnode> mixed</textnode></keyterm><textnode>,
as you can see. </textnode></para>
</doc>
Adding ID Values to Elements
Ask devotees of object-oriented development about the value of
object identity, and then just try to shut them up. When an XML
element has an attribute with a value that's guaranteed to be unique
within the document, it has identity, and this brings several
advantages. Like a record's key value in a database, it can provide a
hook for referring to it from elsewhere, which lets you associate new
data with it. If the attribute's name is "id" (which is a common
convention) it makes it easier to link to that element, especially if
it's within an XHTML document -- just add a pound sign and the ID
value to the document's URL to link to that point in the
document. Adding IDs to your elements is a simple, quick way to add
value to your data.
The following XSLT stylesheet copies a source document to a result
tree, taking advantage of XSLT 1.0's generate-id() function
along the way to create unique IDs for every element that doesn't
already have an id attribute. (Technically, there's a small
chance that one of the created ones will be the same as an existing
one, but it's a very small chance.) The first template rule copies all
elements, adding the id value if it's not already there, and
the second template rule copies all the other node types.
<!-- addids1.xsl: Add ID values to all elements that don't have them. -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:template match="*">
<xsl:copy>
<xsl:if test="not(@id)">
<xsl:attribute name="id">
<xsl:value-of select="generate-id()"/>
</xsl:attribute>
</xsl:if>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="@*|processing-instruction()|comment()">
<xsl:copy>
<xsl:apply-templates select="node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
Also in Transforming XML
Automating Stylesheet Creation
Appreciating Libxslt
Push, Pull, Next!
Seeking Equality
The Path of Control
When describing the mixed2ec.xsl stylesheet, I mentioned how it
could make loading of narrative XML into a relational database
easier. Combining that stylesheet with this one would make it even
easier, because assigning an identifier to every node of a document
makes it easier to track if you split up and rejoin the document.
The following variation on the stylesheet above is one that I use
often. In theory, it's nice to have ID values on every single element,
but it won't add much value to inline elements other than linking
elements, which can then hold their own as part of a two-way
link. Instead of adding IDs to every element, this next stylesheet
adds them to a specific list of elements: my most-used HTML block
elements (plus the a element), which will make them all valid
link destinations. Because the HTML documents I use with this
stylesheet may or may not have the XHTML namespace declared as the
default namespace, the first template rule's match attribute
lists each element that should get an ID added twice: once in case
they're in that namespace and once if they aren't in any
namespace.
<!-- addids2.xsl: Add ID values to the elements listed in the
first xsl:template elements match attribute -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:h="http://www.w3.org/1999/xhtml"
version="1.0">
<!-- Add id attributes to these elements -->
<xsl:template match="p|pre|li|h1|h2|h3|h4|a|blockquote|h:p|h:pre|
h:li|h:h1|h:h2|h:h3|h:h4|h:a|h:blockquote">
<xsl:copy>
<xsl:if test="not(@id)">
<xsl:attribute name="id">
<xsl:value-of select="generate-id()"/>
</xsl:attribute>
</xsl:if>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
If you don't want to add IDs to every element in your document, you
can modify the match attribute in this stylesheet's second
template rule to list whatever elements you like.
Next month we'll look at some stylesheets for indenting, for
cleaning up potential namespace headaches, and for converting document
encodings. And, if you have any short general-purpose stylesheets like
these that you're interested in sharing with XML.com readers, let me
know; maybe this can be a three-part series.