Translating XML Documents with xml:tm
by Andrzej Zydron
January 07, 2004
A Russian translation of this article is available at xmlhack.ru
Introduction
Sooner or later someone will want to have your XML document
translated into another language. In fact XML
documents are much easier to translate than other
electronic documents because they separate out form
from content, and they conform to a rigorous standard
and defined syntax. There are various approaches to
improving the translation process.
Machine Translation
Language technology has had a mixed history over
the past 40 years. The early promises of cheap
automated translation soon lead to dissolution and
effectively a marginal role for this technology in
providing "gisting" information for certain foreign
language texts. There have been significant advances
in language technology since then, and we all benefit
from these on a day to day basis when we use spelling
and grammar checkers and complex search
engines. Nevertheless we are still a long way away
from usable machine translation based on free format
text, although there has been some success for very
tightly controlled text in very narrow domains.
Translation memory
In order to reduce translation costs in an
environment where documentation can change frequently
to reflect improvements and innovation in a product
lifecycle the best answer to date has been the use of
translation memory. In comparison with machine
translation this is a relatively primitive approach to
language technology but can bring considerable
benefits.
Translation memory works by aligning previously
translated text in a target language with the source
language. This is accomplished either by the use of a
manual tool or automatically by using a controlled
environment for the translation process. Alignment is
usually done at a sentence level. This affords the
best level of usable granularity. The aligned source
and target text is held in a repository. The next time
the document is updated the repository is searched in
order to locate any text that has not changed. Where
such a sentence is identified the source language text
can be replaced with the target language text. This
low tech method has nevertheless provided benefits in
terms of translation consistency and reduced
costs.
The main weakness of this approach is the fact that
how a piece of text is translated in a given target
language can depend on its context. When text is
pulled in from a translation memory repository it does
not posses any of the context within which it existed
in the original document. Because there is no
contextual information regarding the target language
text, a translator is still required to proof read the
matched text and adapt it if required. The proof
reading process, although less expensive than straight
forward translation, still consumes time and money.
Translating XML Documents
The approach to translating XML documents to date has been
to extract the translatable text and attributes into an
external, typically proprietary format where translation
memories matches are performed on the data. On completion of
the translation process the newly translated sentences are
written to traditional non-standard translation memory
repositories. XML in these sorts of environments is treated
merely as yet another encoding format.
Special mention must be made here of some important
XML based standards concerning translation technology:
-
the OASIS XLIFF (XML Localisation
Interchange File Format - http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=xliff)
specification which provides an XML framework for
interchanging translatable text from any native
format. More about XLIFF later.
-
Lisa (the Localisation Industry
Standards Association) provides many
XML based initiatives under the
auspices of its OSCAR working
committee (Open Standards for
Container/Content Allowing Re-use -
http://www.lisa.org/oscar)
which include amongst others TMX (
http://www.lisa.org/tmx/tmx.htm). TMX
allows for the interchange of
translation memories using XML.
All of these excellent standards address the
interchange of information using XML rather than the
actual translation of XML documents.
xml:tm
xml:tm is a new approach to the problem of
translation for XML documents. It is a XML namespace
based syntax that uses the power of XML to embed
additional information within the XML document itself.
At the core of xml:tm is the concept of “text
memory”. Text memory is made up of two
components:
-
Author Memory
-
Translation Memory
Author Memory
XML namespace is used to map a text memory view
onto a document. This process is called
segmentation. The text memory view works at the
sentence level -- the text unit. Each individual
xml:tm text unit is allocated a unique
identifier. This unique identifier is immutable for
the life of the document. As a document goes through
its life cycle the unique identifiers are maintained
and new ones are allocated as required. This aspect
of text memory is called author memory. It can be
used to build author memory systems which can be
used to simplify and improve the consistency of
authoring.
The following diagram shows the how the tm
namespace maps onto an existing XML document:
Figure 1: xml:tm mapping diagram
In this diagram "te" stands for "text element" (an
XML element that contains text) and "tu" stands for "text unit" (a
single sentence or stand alone piece of text).
The following is an example of part of an xml:tm
document. The xml:tm elements are highlighted in red to show how
xml:tm maps onto an existing XML document.:
<?xml
version="1.0" encoding="UTF-8" ?>
<office:document-content
xmlns:text="http://openoffice.org/2000/text"
xmlns:tm="urn:xmlintl-tm-tags"
xmlns:xlink="http://www.w3.org/1999/xlink">
<tm:tm>
...
<text:p
text:style-name="Text body">
<tm:te
id="e1" tuval="2">
<tm:tu
id="u1.1">
Xml:tm
is a revolutionary technology for dealing with the problems of
translation memory for XML documents by using XML techniques to embed
memory directly into the XML documents themselves.
</tm:tu>
<tm:tu
id="u1.2">
It makes
extensive use of XML namespace.
</tm:tu>
</tm:te>
</text:p>
<text:p
text:style-name="Text body">
<tm:te
id="e2">
<tm:tu
id="u2.1">
The “tm”
stands for “text memory”.
</tm:tu>
<tm:tu
id="u2.2">
There are two
aspects to text memory:
</tm:tu>
</tm:te>
</text:p>
<text:ordered-list
text:continue-numbering="false" text:style-name="L1">
<text:list-item>
<text:p
text:style-name="P3">
<tm:te
id="e3">
<tm:tu
id="u3.1">
Author
memory</tm:tu>
</tm:te>
</text:p>
</text:list-item>
<text:list-item>
<text:p
text:style-name="P3">
<tm:te
id="e4">
<tm:tu
id="u4.1">
Translation
memory</tm:tu>
</tm:te>
</text:p>
</text:list-item>
And the composed
document:
Figure 2: Composed Document