§ ¶An XML annoyance
A couple of days ago, I looked into XML as a possibility for an exchange format for a program I was working on. Using an off-the-shelf XML parser wasn't an option, so a relatively simple format was needed. XML seemed like a relatively good fit due to its hierarchical tag-based nature, and if I was going to use a simple text-based format, using a ubiquitous one seemed to be a good idea. I've acquired a bit of a distaste for XML over the years, primarily from seeing people convert 10MB of binary data to 100MB of XML for parsing in an interpreted language. For a simple file with a few data items, though, it makes a lot of sense.
The first set of warning bells went off when I pulled down the XML 1.0 standard from the W3C and discovered it was 35 pages long. W3C standards don't seem to be organized well in general, since they delve immediately into details without giving a good overview first. Well, I could deal with that -- I've survived ISO standard documents before, and these aren't that bad. Much of the standard deals with document type declarations (DTDs) and validation, which could be omitted.
That is, until I discovered the horrors of the internal DTD subset.
The internal DTD subset allows you to embed the DTD directly into the document. That's fine, and since it's wrapped in <!DOCTYPE> then in theory it should be easily skippable. Well, it would be, were it not for two little problems called character entities and attribute value defaults:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE data [
<!ENTITY foo "The quick brown fox quickly jumped over the lazy dog's back.">
<!ENTITY bar "&foo;">
mode CDATA "preformatted">
If you load this XML into a web browser like Firefox or Internet Explorer, you'll see the effects of the DTD, which is to introduce a mode attribute into the text tag, and to expand the &bar; character entity. These two features have a number of annoying consequences:
- All XML parsers, including non-validating ones, must parse the internal DTD subset. This means that an alternate tag parsing path must be introduced since the DTD doesn't follow the same attribute=value format that the rest of XML uses.
- The internal DTD subset cannot be ignored, since it can change the interpretation of the data.
- Character entities can now expand to arbitrary lengths. This prohibits in-place conversion and requires dynamic memory allocation. Even more fun is the possibility of nested expansion, which leads to the billion laughs attack.
- XML parsers must both parse elements and interpret them, due to the need to inject attribute defaults.
Suddenly XML didn't seem like a simple tag-based format anymore. I guess there's always CSV or INI....
Unfortunately, it seems that this has led to some compatibility problems in XML. The idea behind XML is that well-formedness is both strictly defined and strictly enforced in order to prevent the format from decaying. TinyXML was once recommended to me, and it's one of the parsers that doesn't parse the internal DTD subset, which means it doesn't really parse XML. SOAP apparently forbids their use as well, and both MSXML 6.0 and .NET 2.0 deny their use by default. The result is that there's now an effectively undocumented subset of XML. Ugh.
I really wonder how much benefit there was in including user-defined character entities and attribute defaults in the XML standard. It seems to me that if these two features had been omitted, there could have been a clear delineation in the standard between DTD/validation and data, and the core non-validating part could have been made much simpler.
Years have passed, so maybe it's changed, but when I tried to read the XML standard I got extremely annoyed because it referred to things *before* it had defined what they were. It didn't even tell you that it had not defined them yet, let alone say vaguely what they were.
So I kept coming across terms and wondering how I had forgotten their meaning already. I'd re-read and re-read but I still couldn't see where they'd defined this thing that suddenly they were talking about very specific, esoteric aspects of when I didn't have the first clue what the **** the thing was at all. Then I realised that thing was not properly defined until much later in the text.
I had similar problems trying to read a book on XML, perhaps because it was written in line with the standard?
I gave up trying to learn all there is about XML in depth. Instead I've just learned about the bits I've needed when I've needed them. I use XML for stuff like config files, and I've used tiny snippets of XPath in .Net, but that's it.
What XML does is so bloody simple, conceptually, yet the people behind it have somehow managed to make it very complex, partly by making it do simple things in a needlessly complex way and partly by doing such a bad job of explaining it all.
I'd say XML with a DTD is the exception, not the rule, FWIW. I'd also say that in most simple XML formats the DTD itself is more likely to be wrong than the data written by any application as the way the DTD is defined can be so orthogonal to the way the data is parsed by both apps and humans.
Obviously there are cases where a DTD makes a lot of sense, and you only need one expert to properly define one for everyone else, but I think those are the exceptions not the rules.
It's a bit like expecting every programmer to code using formal methods (http://en.wikipedia.org/wiki/Formal_meth..
) which, while fantastic for some people and some problems, are just another thing to get wrong for most people most of the time.
(That is, someone can only mathematically derive a program or prove it is correct if they're good enough at maths to do the derivation/proof correctly. If they get the maths wrong then they'll prove nothing and probably produce broken code. Same problem with DTDs, IMO.)
Leo Davidson (link) - 15 05 09 - 20:44
The internal DTD subset is indeed seldom used and I haven't ever seen it in the wild. Problem is, it's still a required part of the standard. Part of the reason I'm annoyed is that I really like the attempts of the committee to prevent XML from being subsetted ad-hoc and thus would try to avoid implementing or using a parser that wasn't compliant in this manner.
On the other hand, part of me is also tempted to slap a sliding window compressor on top of user-defined entities just to be evil....
Phaeron - 15 05 09 - 21:01
) is a popular alternative to XML. It, too, is a standardized, hierarchical text-based format for structured data. Many projects find it preferable to XML for things like configuration files that can't be represented using the simpler INI syntax due to hierarchical data requirements.
That said, it's still a fairly complex format (http://yaml.org/spec/1.1/
Jon Parise (link) - 15 05 09 - 22:59
YAML looks like a simple idea that's grown too big through accretion -- if it had just stayed a way to represent node trees using Python-style indentation, it would have been fine. They've added so many formatting and compression options, though, that it's fairly complex and it sort of resembles a sendmail config file. The kind/type diagram makes my head hurt. The relative paucity of YAML tools would also make me prefer mini-XML = XML sans DTDs instead, because at least in that case effective compatibility with XML tools would be very high.
Phaeron - 16 05 09 - 00:28
Forget about DTDs and move to xml schema :D
Mastermnd - 16 05 09 - 10:08
Maybe give JSON a spin? I found less "bloated" than XML and easy to work with. Depends what you wanna do, though.
igro - 16 05 09 - 13:25
Honestly, I had no idea you could define character entities inline or that the DTD is a requirement of XML parsers! Given the inclusion of "annoyance" in your title, I'm supposing you haven't given up on XML? Why are you writing a new parser, is it from scratch? With W3C recommendations almost never being implemented "to spec" (see SVG!), is it much of a surprise that major parsers and protocols have ignored this aspect? :) However, if you are planning on going all the way, then I have confidence you'll come up with decent defenses against things like the billion laughs attack.
Neil C. Obremski (link) - 16 05 09 - 14:25
Unfortunately, you don't get a choice in this matter, because the DTD is the only schema that is part of the XML spec, and the problem is what you can receive. It's perfectly OK to not use a schema or DTD when writing XML, and I don't know of anyone who does.
Looks interesting. I like the format, but it's still less supported than XML.
@Neil C. Obremski:
There are many rules in play, but one of them is that the size of the parser used to read the config file may not exceed the size of the rest of the application. Another is that I don't see a good reason to cheat on implementation here, because unlike many of W3C's standards, it is actually feasible to implement all of XML. You certainly couldn't say that of XHTML, for instance.
I haven't done anything one way or another since I've been working on other stuff. I'm tempted just to try writing an XML parser from scratch just to see how bad it is -- reading the spec is probably worse than actually writing the code, since I've done both tag parsing and recursive descent parsing before. In terms of effort to goal, though, JSON looks like a much faster way to go.
Phaeron - 16 05 09 - 14:51
The annotated standard http://www.xml.com/axml/testaxml.htm
is worth reading, as it explains some of the history and the forward references.
As an alternative to going the SOAP/XMPP route of disallowing DTDs, it's not unheard of to have a lightweight embedded parser without DTD support, and fall-back to an external full parser to pre-process anything with a DTD to expand entities.
Pete Kirkham (link) - 17 05 09 - 05:34
XML tries to do everything, and that's not actually a good thing for most people.
XML badly needs to take the MPEG approach, and define specific subsets of the API. Give us XML without custom entities, without DTDs (allow specifying them for validators, but ignore them), mandate UTF-8, and so on--turn off the stuff that's relatively less useful and relatively hard or heavyweight (eg. character conversion) to support. People would then have much less need to define their own subsets.
> exceed the size of the rest of the application
Well, in theory, a major benefit of using a common format is to use a common library, so the relative size of the parser is zero.
Of course, non-Microsoft libraries are rarely actually shared in Windows, and even in Linux, the hassle of library compatibility often strongly encourages statically linking some libraries if you're distributing binaries (though libexpat probably isn't a problem).
Glenn Maynard - 17 05 09 - 18:30
The MAME emulator has a command line switch (-listxml) that dumps a database in XML format of the games that it supports. They include a DTD in the dump. It appears to be mostly for validation purposes, but it also defines some default values that don't exist in the XML until it gets processed by the parser. Here's the XML file for MAME 0.131 if you're curious...(WARNING it's over 34MB)... http://files.3feetunder.com/mame0131.xml
Tankadin (link) - 18 05 09 - 01:43
Sam - 25 05 09 - 19:12