
The Cost of XML
by Edd Dumbill
December 15, 2004
In this week's column, I cover two debates that consider the
cost of XML. In the first discussion, the cost is that of
file size and processing overhead. In the second, it's actual
dollars charged for access to a web service. Also, watch out for
the special twilight zone moment as we find ourselves considering
CSV files as a serious option.
CSV Will Save Us All
The overhead or otherwise of using XML is once again a hot
topic, as we have seen in recent XML-DEV discussions. The strong
resurgence of the debate leads me to consider that XML might be
crossing the threshold of yet another order of magnitude in
adoption, causing a rush of reconsideration of old issues.
A newcomer to XML, Tedd Sperling asked
how best to model tabular data in XML:
In everything I have read, it appears that every chunk of content
must be encapsulated by tags, such as:
<data>123.456</data>
But what about streams of data, like from a x/y recorder where one
may have thousands of pieces of data? Is there some way to wrap this
data into a series of comma delimited fields, such as:
<data>
123.456,
234.567,
...
</data>
A detailed
answer came from no less than Steve DeRose, who gets 10 SGML
Old Guard points for managing to mention SHORTREF by the second
paragraph of his reply. DeRose goes on to sum up why
tagging each data item is useful, even though it seems like
tremendous overhead.
1: If your data is "text files that are literally tens of
thousands of characters in length", that is small enough that the
overhead won't disturb most software running even on a cell
phone. If we were talking many millions or billions of *records*,
then this would be more of an issue (as it is for some users).
2: If you want the data formatted by CSS or XSL-FO,
or transformed by XSLT, or whatever, having all the data in one
syntax that the applications *already* know about is much easier
than rewriting the applications or working around them to add
some syntax (like commas) that they *don't* know about. You'll
never have to debug the XML parser you use to parse all those
"<data>" tags, but you will spend a lot of time if you try to
introduce a new syntax in your process.
3: Any text file that contains zillions of instances of a certain
string is necessarily very compressible...
DeRose then went on to give some figures that show, as we have
heard before, that XML compression is reasonably competitive with
compression of more basic delimited formats.
And then this seemingly well-trodden debate went just a little
wild. Enter
Stephen Beller, who repeated DeRose's experiment with a
spreadsheet. He saved data from Excel into both XML and CSV:
The XML file was 840MB, the CSV 34MB -- a 2,500% difference.
Compressed, the XML file was 2.5MB, the CSV 0.00015MB (150KB) --
a 1,670% difference.
Equally dramatic is the time it
took to uncompress and render the files as an Excel spreadsheet:
It took about 20 minutes with the XML file; the CSV took 1
minute -- a 2,000% difference.
Now reasonable people will be willing to accept some
performance difference between XML and CSV for a spreadsheet
"filled with a single-digit number." Not the world's most
realistic test, and the XML export is likely to contain much more
metadata than the CSV export. As Tim
Bray implied, Excel's "Save as XML" isn't quite the same as
having designed a schema for one's data.
In a further exchange Bill
Kearney made the same point about Excel's XML format and
also offered the viewpoint that XML's self-documenting nature will
stand the test of time better then bald CSV. So just why, asked
Kearney, is Beller arguing for CSV?
Beller's response
indicates that he accepts the extra power of XML but bemoans the
"greater consumption of resources during transport and
parsing." And it gets worse, we are told:
And when you throw in all sorts of attributes and formatting
instructions, the consumption climbs even more. Hence, the XML
backlash. We'd we wise, IMO, to recognize this trade-off and act
accordingly.
By now you are probably as agog as I am to find out what, after
six years, we really ought to be doing. I'll delay no longer:
There is an elegant solution, which involves using CSV data in
novel ways, but it's a proprietary process and this is not the
right venue to discuss it.
Elegance! CSV! Proprietary processes!
Bill Kearney certainly
wasn't joining the line to pay royalties. Besides, the
argument was getting very silly, he said. "What next, railing
against using Unicode?" Kearney's quip was just a little more
depressingly likely than it was funny. I certainly recall enough
U.S.-based developers vociferously unaccepting of the need for anything
other than ASCII. But what can you do when seemingly self-evident
truths are denied by blinkered zealots?
I'll leave the last
word on this strange debate to Mike Kay:
You have totally missed the point, Steve. The benefit of XML is
that we no longer have to reinvent clever ways of representing
complex data, and can exercise our innovative skills at higher
level of the system where it gives a greater return.
With that all said, I can't suppress a somewhat morbid desire
to see how XML's expressiveness can be packed into comma separated
value files and still remain "elegant." Do CSVs dream of qnames?
What Price Web Services?
An interesting debate blew up this week in the weblog of open
source developer Alex Graveley. A programmer working with the
GNOME desktop platform, Graveley wanted to
create a system-tray notification program that worked with
eBay's web services to notify users of the status of their active eBay
bids.
Unfortunately, Graveley ran afoul of the current pricing and
registration requirements around eBay's service. It seems that
even if you join up to eBay's developer program yourself, at a
cost of $100, the users of your software must also pay eBay to be
able to use your program. Graveley thought this counterproductive
and somewhat at odds with eBay's "viral" business model:
Of course the only option for most developers (open or
proprietary) given these restrictions is to screen-scrape,
completely defeating the stated purpose of the Developer
Program.
It's amazing that a large company, built largely around a viral
business model can be this hypocritical.
In
response to Graveley's post, Ryan Thiessen reckoned that the
reason the web service use is so expensive is to contain
usage:
I think eBay is just trying to give a monetary incentive for
developers to use as few API calls as possible to reduce the
load for eBay's servers, which is different tha[n] not allowing
open source applications because of any perceived quality
difference.
EBay's web services evangelist, Jeff McManus, joined
in the conversation, agreeing with Thiessen's
diagnosis. McManus points to an entry
on his own weblog, where the topic of open source programs
against the eBay API is discussed more fully. McManus' position
is that he doesn't see why developers should object to paying, as
he perceives eBay as being akin to a telephone operator. If you
want to use the service, in whatever way, you pay the bill.
In a subsequent post, Graveley reiterates that it's not just a matter of the $100 fee for the developer, but that all users of the software will face a similar fee.
It isn't a one-time fee. It's a per-user $100 fee,
plus a multi-phase disconnected registration process that cannot
be automated. How much of a percentage drop in purchases could
you expect if Ebay charged $100 for a user's first purchase, no
matter what?
Really, what do you have to lose by opening up the read-only
methods for all to use for free? I mean, it isn't as if people
aren't screen-scraping already.
Amazon and Google have similar, though free, read-only APIs, and
it's not beyond imagination that for simple things such as
checking on bid status, eBay might introduce a similar service.
The likely reason this hasn't happened so far seems to be the
possible effect of the load on eBay's servers.
This discussion highlights the fact that companies must
be wary when introducing public web services. But we also know
from cases such as Amazon's that public web services can be
fantastic success stories. I sense there really is an opportunity
for eBay here if they can come up with a cheaper solution for
offering web service access.
Births, Deaths, and Marriages
This week taken from the RDF Interest list, due to lack of
XML-specific announcements,
- Resources of a Resource
A somewhat strange XML/RDF vocabulary for describing
"resources," which is what I thought RDF did anyway...
- New RDF Grammar-Based N3 Parser and Test Suite
Sean Palmer has developed a second implementation of Tim
Berners-Lee's N3 notation for RDF.
Scrapings
A
change in Microsoft's XML team, but it looks like Software
AG's given up on XML ... Apple's
chance to influence XQuery ... lest
we forget Engelbart ...
a perverse brain teaser for the holidays ...
148 messages to XML-DEV last week, 26% XQuery bickering ...
Sean McGrath neatly sums up XML vs RDF.