Standard Data Vocabularies Unquestionably Harmful
by Walter Perry
May 29, 2002
At the onset of XML four long years ago, I commenced a jeremiad against
Standard Data Vocabularies (SDVs), to little effect. Almost immediately after
the light bulb moment -- you mean, I can get all the cool benefits of web in
HTML and create my own tags? I can call the price of my crullers
<PricePerCruller>, right beside beside <PricePerDonutHole> in my
menu? -- new users realized the problem: a browser knows how to display a
heading marked as <h1> bigger and more prominently than a lowlier
<h3>. Yet there are no standard display expectations or semantics
for the XML tags which users themselves create.
That there is no specific display for <Cruller> and, especially,
not as distinct from <DonutHole> has been readily understood to
demonstrate the separation of data structure expressed in XML from its display,
which requires the application of styling to accomodate the fixed expectations
of the browser. What has not been so readily accepted is that there should not
be a standard expectation for how a data element, as identified by its markup,
should be processed by programs doing something other than simple
display. To a new user of XML bidding for a contract the newfound advantage of
calling a <DonutHole> a <DonutHole> disappears if the
pastry procurement protocol expects the bids to be
<PricePerPastry DonutType="hole" SpecializedType="brown sugar rolled"
PricingQuantityStandardRange="50K/day - 150K/day" />
Now, clearly, a mom-and-pop shop wanting to leverage the Web into a supplier
contract to General Motors is perfectly happy to label its wares however GM
expects. Over the past three years GM's expectations, and the expectations of
the dominant players in more than a thousand vertical markets, have been
codified as Standard Data Vocabularies. Where the dynamics of a vertical
industry are like the automobile sector, this makes some sense and has resulted
in marketplaces narrowly specific to vertical industries growing up around SDVs,
as with Covisint (for automobile parts) or
NewView (for steel).
|
Related Reading
XML Schema
The W3C's Object-Oriented Descriptions for XML
By Ericvan der Vlist
Table of Contents
Index
Sample Chapter
Read Online--Safari
|
There have also emerged a few "horizontal" data vocabularies, intended for
expressing business communication in more general terms. One of these is the eXtensible Business Reporting Language (XBRL),
about which more below. Most recently, governments and governmental
organizations have begun to suggest and eventually mandate particular SDVs for
required filings, a development which expands what troubles me about these
vocabularies by an order of magnitude.
At the just-concluded XML Europe 2002
conference in Barcelona I delivered a presentation
which explained how an Enron-like organization could disseminate its particular
spin and its own, perhaps not generally accepted, interpretations in financial
reportings.
Adhering to standard vocabularies has recently meant all too often that an
item properly labeled and conforming to an expected form is naively accepted as
being actually what it purports to be. My talk was scheduled in the Legal and
Government track of the conference, which makes sense given the topic, but it
was not what an audience which came for news on the latest governmental
initiatives in standard vocabularies might have wanted, and I found myself with
a room of fourteen people. That number includes the session chair, me, and the
two previous speakers in this track: an official of the European Patent Office
and one from the Japan Patent Office now at the World Intellectual Property
Organization, where they are working toward promulgating an SDV of 500 elements
intended to express patent filings to 180+ patent offices worldwide. These
patent office officials immediately understood the import of my argument to
their work, and by question time the session had become a discussion of how
firmly rooted in the nature of SDVs themselves is the problem of misstatement,
of misdirection of naive interpretation, and the potential for fraud.
I have argued for years that, on the basis of their mechanism for elaborating
semantics, SDVs are inherently unreliable for the transmission or repository of
information. They become geometrically less reliable when the types or roles of
either the sources or consumers of that information increase, ending at a
nightmarish worst case of a third-order diminution of the reliability of
information. And what is the means by which SDVs convey meaning? By simple
assertion against the expected semantic interpretations hard-coded into a
process consuming the data in question.
One recurring theme at the Barcelona conference was the need to break down
"silos of information". Clearly new uses for data and the realization of
synergies between previously unrelated functions require that information be
released from a single vertical path of use within narrowly-defined areas of
expertise. The uncritically accepted assumption is that this laudable goal
should or can be reached through bisecting the silos of expertise with a
horizontal common denominator which offers access to different narrow areas of
expertise through a single shared vocabulary. Conceptually, that solution
misunderstands what expertise is based on and how it operates.
Have the implications of standard vocabularies been properly considered? Share your opinions in our forum. |
| Post your comments
|
Expert analysis or other processing depends at least as much on knowing what
to process, where and how to find it, what form to expect those inputs to
exhibit if they are valid, and what form of output most precisely conveys the
effects of the expert process, as it does on the detail of how those inputs are
manipulated into those outputs. In short, the bulk of expertise is in
understanding the detail of connections between data and the processes which
produced it or must consume it. It is precisely these expert connections which
standard data vocabularies are intended to sever.
Patent filing
In the case of the SDV for worldwide patent filings, the presenters at the
Barcelona conference lamented that, once what had seemed the hard work of
designing the vocabulary was finished, they were surprised and frustrated by how
much salesmanship and evangelism was required to encourage patent filers to use
the vocabulary and governments to mandate its use.
In my opinion that will change quickly as filers realize that power to shape
the outcome of a patent process has been shifted to them by the SDV. By design,
the patenting process will begin with the filer's own assertions conveyed in the
SDV. Filers can learn to effect particular outcomes by these assertions (or
perhaps by unexpected combinations of assertions), which they submit to trigger
hard-coded semantics from the patenting process. In effect, the SDV hands the
general public a patenting process API, capable of significant remote imperative
invocation of particular outcomes, precisely because the semantic outcome of the
process is, by design, conditioned on the submission of specific items from the
standard data vocabulary.
Security measures which generally protect remote invocation interfaces cannot
be used to screen out submitters where the interface is intended, even mandated
for use by the general public. General Motors might simply refuse bids from a
particular submitter, but governmental organizations face steeper barriers to
discriminating against individuals using a mandated vocabulary for an official
communication.
In fact, precisely identifying the submitter, which would be the basis of
discouragement in many security systems, is in this case a chief goal of a
submitter seeking to be granted an individual right of entitlement by
governmental authority. Submitters who want to game such a system have a better
perspective on how it works than do the designers of its standard data
vocabulary. Particular combinations of components from the SDV which might seem
illogical to designers of that vocabulary may be found to result in process
outcomes which benefit the submitters in ways never anticipated by
designers.
Remember that what is at stake is control of intellectual property and the
lucrative fruits of its use, obtainable by asserting effective incantations from
the standard vocabulary. The gamesters have every incentive, while the guardians
of the system can at best run to patch their process code whenever they discover
it has yielded an unanticipated result. The vulnerability itself can never be
removed so long as the principal design premise of the system is open access to
the process code for anyone who uses the SDV to convey established semantics.
Worrisome as this is, it gets worse. The patent filing SDV only standardizes
what is already the case: patent application in current practice begins from a
submitter's formulation of its own claims. At present there is a human examiner
in the patent office to restate those claims (if they seem to have some initial
merit) into the terms on which they will be evaluated in the patent application
process. Rather than the simple mechanical mapping of the semantics of the SDV
to the execution of various processes, there is a complex expertise embodied in
a human being which transforms a variety of incoming vocabularies into a
functional one internal to the expert domain of the patent office. The proposed
patent filing SDV will replace that expertise at the door of the patent office
with a single fixed mapping of vocabulary items to the specific semantics of
process outcomes.
XBRL
The eXtensible Business Reporting Language (XBRL) carries the consequences of
such mechanically mapped semantics to another order of magnitude and effectively
dumbs down the professional expertise of accountancy to the generalities of a
SDV. Rather than the starting point, as in the patent filing process, the SDV of
XBRL is the midpoint and interface between complex expert processes which
acquire and prepare data and other processes which report and otherwise render
that data through professional expertise.
XBRL bisects the closed silo of accountancy with a general-purpose common
denominator SDV, which by design lacks the specificity required to proceed from
input to output with the precision and nuance which both sides require. The
rationale of the design is to open the silo so that other expert data collection
processes may submit their product to a generalized repository, out of which
reports and renderings in many areas of expertise might be
generated. Unfortunately, this design ignores an inevitable outcome;
generalizing data between the specific demands of domain expertise in collection
and corresponding domain expertise in reporting will introduce vagueness,
ambiguity, doubt, and error, wasting the expertise of the collection effort and
reducing the reporting to meaninglessness or worse.
Again, the stakes are considerable. The United Kingdom has mandated XBRL for
corporate tax reporting beginning in 2006, and the XBRL consortium is actively
lobbying for other such government support. Again, however, the users of XBRL,
and their purposes in using it, may not be what the designers of the SDV
expect. Within the silo of accounting, the reasonable assumption is that the
data is prepared by the same or equivalent experts to those who report it. The
very rationale of the SDV is to break down that assumption. By design, data
expressed in the SDV will be reported, rendered, and otherwise manipulated by
those specifically inexpert in, and quite possibly unaware of the nature of its
collection. Data integrity in such circumstances is simply unachievable.
This methodology itself strikes at the heart of domain expertise, which
demands intimate knowledge of the details of the data which defines the
field. Instead we have an open invitation -- indeed a government mandate -- to
gamers of the system to concatenate those specific items of the SDV which will
produce desired outcomes in reports ranging across taxation, securities
regulation, investment analysis, and other high profit opportunities for
fraud. What is not at all clear is that this gaming of the outcomes is in fact
fraud, for the SDV itself severs the connection between input and output which
would allow a reasonable inference of intent from the result.
It didn't and doesn't have to be this way. Instead of the static mapping of
process semantics to particular items of the SDV, we can have processes which
demonstrate specific expertise in their instantiation of data for their own
unique purposes. They exhibit, that is, the crucial expertise of understanding
their own data needs.
That expertise permits a process to operate upon data from a variety of
sources, in each case available in a form particular to the expertise that
created it and without regard to the nature or needs of the process -- or
multiple very different processes -- which might consume or manipulate it. Each
process produces only one expert rendition or other process outcome. Yet taken
together with the variety of similarly expert processes which supply their input
data, the group of such processes more than meet the ostensible goal of SDVs in
opening the silo to the sharing of data on a many-to-many basis among different
expert domains. That goal is not achieved without the effort of a strict
discipline in designing process intercommunication and interaction, which I
shall describe in a subsequent article, "The Natural Process Model of XML".