
Converting XML to RDF
by Bob DuCharme
September 01, 2004
Last month
we looked at the REST interface to Amazon Web Services (AWS), and how
an f parameter in a URL calling this interface can point to
an XSLT stylesheet. If you set it to "xml" instead of pointing it at a
stylesheet, Amazon returns data in formats that conform to either the
"lite" or "heavy" DTDs (and corresponding schemas) included with their
SDK; if you do, their server applies the stylesheet to that data at
the server before returning the result to you.
In that column, I promised to show how to use this
feature to pull RDF from the Amazon servers. I had written a
stylesheet called aws2rdf.xsl, but the more I thought about it the
more I realized that such a stylesheet needed very few dependencies on
the Amazon Web Services DTDs, and that it could convert a wide variety
of XML to RDF. So, I revised and renamed it to xml2rdf.xsl, and we'll
look at it here.
RDF and Data-Oriented XML
RDF/XML sometimes looks strange, but it doesn't need to
to. RDF-friendly
XML adds a few things to otherwise typical-looking XML so that an RDF
parser can treat all of its information as RDF triples.
This is not too difficult as long as your XML has no
text nodes with elements as siblings. For example, <p>this p
element has <emph>three</emph> text nodes and
<emph>two</emph> emph elements</p>. XML
developers often call this "mixed content" because the p
element's contents are a mix of text nodes and elements. The official definition of mixed content, however, is any element type that may
have any character data, <p>even a p element like
this</p>.
An element that has only character data and isn't
"mixed" in the more popular sense can often be converted to RDF/XML
without much trouble. Many applications use these elements, along
with element content container elements that group these elements, to represent
transactions and database records — what people often call
"data-oriented" XML, despite the fact that all XML is data (or rather,
data objects). The kind of XML used to describe narrative content for
publication in one medium or another — what people call "document-oriented" XML, despite all XML being in documents — is more likely to have elements and text nodes
as siblings of each other (like in the first p example in the
preceding paragraph), and is not a good candidate for automated
conversion to RDF.
The data being returned by Amazon Web Services, which
obviously came from relational databases somewhere, is a fine
candidate for conversion to RDF. Besides, Amazon is in the business of
selling physical objects, and its site provides metadata about those
objects. Having that data in RDF-friendly XML makes it easier to link
this metadata with other metadata, thereby extending the potential
reach of the Semantic Web.
A Somewhat Generic XML to RDF Converter
When processing XML documents that are good candidates
for conversion to RDF/XML, a stylesheet can handle certain tasks
generically. Other tasks require modifications to the conversion
stylesheet to prepare it for the specific input that's coming. The
generic parts of the stylesheet below, which come after the comment
beginning with the words "End of template rules addressing," automate
the advice given in the XML.com article Make Your XML RDF-Friendly. Rule numbers mentioned below refer to the
numbered pieces of advice in that article.
The first half of the stylesheet has the parts that
require editing to prepare the stylesheet for your particular source
documents. The bold parts show my customizations to tailor the
stylesheet for documents returned by Amazon Web Services:
As Rule Number 1 says, make sure that
every element comes from a specific namespace, so the namespace must
be declared. I clipped the filename off the URIs used for the U.S./Japan
versions of the DTDs and schemas to come up with
http://xml.amazon.com/schemas3/ as an Amazon Web Services namespace
URI.
The result of the transformation will be
metadata about a single resource, and the "resourceURL" variable is
where the stylesheet stores the URL of that resource. While there are
several variations on the basic URI that take you to the web page
describing a particular book on Amazon, the developer's kit describes
a format of http://www.amazon.com/exec/obidos/ASIN/ followed by the
ASIN number, so the stylesheet below constructs this URL by appending
the ASIN number (using an XPath expression to pull it out of the XML)
to that URI string.
The generic code later in the stylesheet
uses the namespace prefix for the described resource's properties in
several different places, so storing it in a variable lets us leave
the generic code alone. This should be the prefix declared with the
namespace URI added to the xsl:stylesheet start-tag — in
this case, "aws."
You won't necessarily want every element
in your source document passed along to your RDF version, so add the
names of the ones to suppress to the stylesheet's first template
rule.
Similarly, certain container elements in
the source won't add anything to the RDF version, so adding their
names to the second template rule tells the stylesheet to pass along
their contents without their enclosing tags. (As we'll see, certain
containers are very useful, so we'll keep them.)
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:aws="http://xml.amazon.com/schemas3/">
<!-- Convert XML to RDF that all describes one resource. Template
rules after "End of template rules" comment are generic; those
before are for customizing treatment of source XML
(e.g. deleting elements). -->
<!-- URL of the resource being described. -->
<xsl:variable name="resourceURL">
<xsl:text>http://www.amazon.com/exec/obidos/ASIN/</xsl:text>
<xsl:value-of select="/ProductInfo/Details/Asin"/>
</xsl:variable>
<!-- Namespace prefix for predicates. Needs a corresponding xmlns
declaration in the xsl:stylesheet start-tag above. If your set
of predicates come from more than one namespace, than this
stylesheet is too simple for your needs. -->
<xsl:variable name="nsPrefix">aws</xsl:variable>
<!-- Elements to suppress. priority attribute necessary
because of template that adds rdf:parseType above. -->
<xsl:template priority="1" match="Request|TotalResults|TotalPages"/>
<!-- Just pass along contents without tags. -->
<xsl:template match="ProductInfo|Details">
<xsl:apply-templates/>
</xsl:template>
<!-- ========================================================
End of template rules addressing specific element types.
Remaining template rules are generic xml2rdf template rules.
======================================================== -->
<xsl:template match="/">
<rdf:RDF>
<rdf:Description
rdf:about="{$resourceURL}">
<xsl:apply-templates/>
</rdf:Description>
</rdf:RDF>
</xsl:template>
<!-- Elements with URLs as content: convert them to store
their value in rdf:resource attribute of empty element -->
<xsl:template match="*[starts-with(.,'http://') or starts-with(.,'urn:')]">
<xsl:element name="{$nsPrefix}:{name()}">
<xsl:attribute name="rdf:resource">
<xsl:value-of select="."/>
</xsl:attribute>
</xsl:element>
</xsl:template>
<!-- Container elements: if the element has children and an element parent
(i.e. it isn't the root element) and it has no attributes, add
rdf:parseType = "Resource". -->
<xsl:template match="*[* and ../../* and not(@*)]">
<xsl:element name="{$nsPrefix}:{name()}">
<xsl:attribute name="rdf:parseType">Resource</xsl:attribute>
<xsl:apply-templates select="@*|node()"/>
</xsl:element>
</xsl:template>
<!-- Copy remaining elements, putting them in a namespace. -->
<xsl:template match="*">
<xsl:element name="{$nsPrefix}:{name()}">
<xsl:apply-templates select="@*|node()"/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
The generic part of the stylesheet has four template rules:
The first template rule in the generic
part (the third template rule in the stylesheet) wraps the contents in
an rdf:RDF element and identifies the resource being
described.
The next template rule implements
RDF-friendliness Rule Number 4, converting any elements whose contents
consist of a URI (or rather, any elements whose contents begin with
"http://" or "urn:") into empty elements with the URI stored in
an rdf:about attribute.
The stylesheet's second-to-last template
rule follows the advice given near the end of RDF-friendliness Rule 6
by adding an rdf:parseType attribute with a value of
"Resource" to container elements that aren't the root element of the
document. This way, these containers won't throw off the striping
pattern of nested predicate/object pairs that an RDF processor expects
to find in an RDF/XML document.
The stylesheet's last template rule
copies any elements not covered by the other template rules to the
result tree with the namespace prefix from the nsPrefix
variable added onto their names.
I tested this with both "lite" and "heavy" XML returned
by Amazon Web Services for various books, CDs, authors, and bands, and
the ARP2 RDF
parser had no problem with any of the results. (For authors and bands,
though, the RDF isn't quite semantically correct, because all of the
triples created by the stylesheet have the same subject, so it makes
more sense to use this for Amazon pages that describe a single work
such as a book or CD.) For example, with the stylesheet stored at http://www.snee.com/xsl/xml2rdf.xsl,
the following REST URL (with carriage returns deleted and a working
developer ID substituted for "dev-ID-here") retrieves kosher RDF
metadata (saved version here; when viewing
with a browser, do a View Source to see the RDF/XML) about the boxed
set of Robert Quine's live recordings of the Velvet Underground:
http://xml.amazon.com/onca/xml3?locale=us&t=bobducharmeA
&dev-t=dev-ID-here&AsinSearch=B00005Q567&mode=music
&type=heavy&f=http://www.snee.com/xsl/xml2rdf.xsl
Also in Transforming XML
Automating Stylesheet Creation
Appreciating Libxslt
Push, Pull, Next!
Seeking Equality
The Path of Control
With the appropriate revisions to the bold parts of the stylesheet above, there's a lot of regularly structured XML out there that could be converted to RDF. The great thing about using it on XML returned by Amazon Web Services is that we can execute the XSLT transformation on Amazon's servers, so a single REST URL can retrieve
RDF directly from Amazon. This is the power that Amazon has put into
our hands by letting us use its server-side XSLT processor with
its database.
(For more on mapping XML to RDF using XSLT, see Michael
Sperberg-McQueen and Eric Miller's Extreme 2004 paper On mapping from colloquial XML to RDF using XSLT.)