Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Donate
Contact info
Forum
 
Other projects
   Altirra

Search

Archives

01 Dec - 31 Dec 2013
01 Oct - 31 Oct 2013
01 Aug - 31 Aug 2013
01 May - 31 May 2013
01 Mar - 31 Mar 2013
01 Feb - 29 Feb 2013
01 Dec - 31 Dec 2012
01 Nov - 30 Nov 2012
01 Oct - 31 Oct 2012
01 Sep - 30 Sep 2012
01 Aug - 31 Aug 2012
01 June - 30 June 2012
01 May - 31 May 2012
01 Apr - 30 Apr 2012
01 Dec - 31 Dec 2011
01 Nov - 30 Nov 2011
01 Oct - 31 Oct 2011
01 Sep - 30 Sep 2011
01 Aug - 31 Aug 2011
01 Jul - 31 Jul 2011
01 June - 30 June 2011
01 May - 31 May 2011
01 Apr - 30 Apr 2011
01 Mar - 31 Mar 2011
01 Feb - 29 Feb 2011
01 Jan - 31 Jan 2011
01 Dec - 31 Dec 2010
01 Nov - 30 Nov 2010
01 Oct - 31 Oct 2010
01 Sep - 30 Sep 2010
01 Aug - 31 Aug 2010
01 Jul - 31 Jul 2010
01 June - 30 June 2010
01 May - 31 May 2010
01 Apr - 30 Apr 2010
01 Mar - 31 Mar 2010
01 Feb - 29 Feb 2010
01 Jan - 31 Jan 2010
01 Dec - 31 Dec 2009
01 Nov - 30 Nov 2009
01 Oct - 31 Oct 2009
01 Sep - 30 Sep 2009
01 Aug - 31 Aug 2009
01 Jul - 31 Jul 2009
01 June - 30 June 2009
01 May - 31 May 2009
01 Apr - 30 Apr 2009
01 Mar - 31 Mar 2009
01 Feb - 29 Feb 2009
01 Jan - 31 Jan 2009
01 Dec - 31 Dec 2008
01 Nov - 30 Nov 2008
01 Oct - 31 Oct 2008
01 Sep - 30 Sep 2008
01 Aug - 31 Aug 2008
01 Jul - 31 Jul 2008
01 June - 30 June 2008
01 May - 31 May 2008
01 Apr - 30 Apr 2008
01 Mar - 31 Mar 2008
01 Feb - 29 Feb 2008
01 Jan - 31 Jan 2008
01 Dec - 31 Dec 2007
01 Nov - 30 Nov 2007
01 Oct - 31 Oct 2007
01 Sep - 30 Sep 2007
01 Aug - 31 Aug 2007
01 Jul - 31 Jul 2007
01 June - 30 June 2007
01 May - 31 May 2007
01 Apr - 30 Apr 2007
01 Mar - 31 Mar 2007
01 Feb - 29 Feb 2007
01 Jan - 31 Jan 2007
01 Dec - 31 Dec 2006
01 Nov - 30 Nov 2006
01 Oct - 31 Oct 2006
01 Sep - 30 Sep 2006
01 Aug - 31 Aug 2006
01 Jul - 31 Jul 2006
01 June - 30 June 2006
01 May - 31 May 2006
01 Apr - 30 Apr 2006
01 Mar - 31 Mar 2006
01 Feb - 29 Feb 2006
01 Jan - 31 Jan 2006
01 Dec - 31 Dec 2005
01 Nov - 30 Nov 2005
01 Oct - 31 Oct 2005
01 Sep - 30 Sep 2005
01 Aug - 31 Aug 2005
01 Jul - 31 Jul 2005
01 June - 30 June 2005
01 May - 31 May 2005
01 Apr - 30 Apr 2005
01 Mar - 31 Mar 2005
01 Feb - 29 Feb 2005
01 Jan - 31 Jan 2005
01 Dec - 31 Dec 2004
01 Nov - 30 Nov 2004
01 Oct - 31 Oct 2004
01 Sep - 30 Sep 2004
01 Aug - 31 Aug 2004

Stuff

Powered by Pivot  
XML: RSS feed 
XML: Atom feed 

§ An XML annoyance

A couple of days ago, I looked into XML as a possibility for an exchange format for a program I was working on. Using an off-the-shelf XML parser wasn't an option, so a relatively simple format was needed. XML seemed like a relatively good fit due to its hierarchical tag-based nature, and if I was going to use a simple text-based format, using a ubiquitous one seemed to be a good idea. I've acquired a bit of a distaste for XML over the years, primarily from seeing people convert 10MB of binary data to 100MB of XML for parsing in an interpreted language. For a simple file with a few data items, though, it makes a lot of sense.

The first set of warning bells went off when I pulled down the XML 1.0 standard from the W3C and discovered it was 35 pages long. W3C standards don't seem to be organized well in general, since they delve immediately into details without giving a good overview first. Well, I could deal with that -- I've survived ISO standard documents before, and these aren't that bad. Much of the standard deals with document type declarations (DTDs) and validation, which could be omitted.

That is, until I discovered the horrors of the internal DTD subset.

The internal DTD subset allows you to embed the DTD directly into the document. That's fine, and since it's wrapped in <!DOCTYPE> then in theory it should be easily skippable. Well, it would be, were it not for two little problems called character entities and attribute value defaults:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE data [
    <!ENTITY foo "The quick brown fox quickly jumped over the lazy dog's back.">
    <!ENTITY bar "&foo;">
    <!ATTLIST text
              mode  CDATA   "preformatted">
]>
<data>
    <text>
        &bar;
    </text>
</data>

If you load this XML into a web browser like Firefox or Internet Explorer, you'll see the effects of the DTD, which is to introduce a mode attribute into the text tag, and to expand the &bar; character entity. These two features have a number of annoying consequences:

Suddenly XML didn't seem like a simple tag-based format anymore. I guess there's always CSV or INI....

Unfortunately, it seems that this has led to some compatibility problems in XML. The idea behind XML is that well-formedness is both strictly defined and strictly enforced in order to prevent the format from decaying. TinyXML was once recommended to me, and it's one of the parsers that doesn't parse the internal DTD subset, which means it doesn't really parse XML. SOAP apparently forbids their use as well, and both MSXML 6.0 and .NET 2.0 deny their use by default. The result is that there's now an effectively undocumented subset of XML. Ugh.

I really wonder how much benefit there was in including user-defined character entities and attribute defaults in the XML standard. It seems to me that if these two features had been omitted, there could have been a clear delineation in the standard between DTD/validation and data, and the core non-validating part could have been made much simpler.

Comments

Comments posted:


Years have passed, so maybe it's changed, but when I tried to read the XML standard I got extremely annoyed because it referred to things *before* it had defined what they were. It didn't even tell you that it had not defined them yet, let alone say vaguely what they were.

So I kept coming across terms and wondering how I had forgotten their meaning already. I'd re-read and re-read but I still couldn't see where they'd defined this thing that suddenly they were talking about very specific, esoteric aspects of when I didn't have the first clue what the **** the thing was at all. Then I realised that thing was not properly defined until much later in the text.

I had similar problems trying to read a book on XML, perhaps because it was written in line with the standard?

I gave up trying to learn all there is about XML in depth. Instead I've just learned about the bits I've needed when I've needed them. I use XML for stuff like config files, and I've used tiny snippets of XPath in .Net, but that's it.

What XML does is so bloody simple, conceptually, yet the people behind it have somehow managed to make it very complex, partly by making it do simple things in a needlessly complex way and partly by doing such a bad job of explaining it all.

I'd say XML with a DTD is the exception, not the rule, FWIW. I'd also say that in most simple XML formats the DTD itself is more likely to be wrong than the data written by any application as the way the DTD is defined can be so orthogonal to the way the data is parsed by both apps and humans.

Obviously there are cases where a DTD makes a lot of sense, and you only need one expert to properly define one for everyone else, but I think those are the exceptions not the rules.

It's a bit like expecting every programmer to code using formal methods (http://en.wikipedia.org/wiki/Formal_meth..) which, while fantastic for some people and some problems, are just another thing to get wrong for most people most of the time.

(That is, someone can only mathematically derive a program or prove it is correct if they're good enough at maths to do the derivation/proof correctly. If they get the maths wrong then they'll prove nothing and probably produce broken code. Same problem with DTDs, IMO.)

Leo Davidson (link) - 15 05 09 - 20:44


The internal DTD subset is indeed seldom used and I haven't ever seen it in the wild. Problem is, it's still a required part of the standard. Part of the reason I'm annoyed is that I really like the attempts of the committee to prevent XML from being subsetted ad-hoc and thus would try to avoid implementing or using a parser that wasn't compliant in this manner.

On the other hand, part of me is also tempted to slap a sliding window compressor on top of user-defined entities just to be evil....

Phaeron - 15 05 09 - 21:01


YAML (http://www.yaml.org/) is a popular alternative to XML. It, too, is a standardized, hierarchical text-based format for structured data. Many projects find it preferable to XML for things like configuration files that can't be represented using the simpler INI syntax due to hierarchical data requirements.

That said, it's still a fairly complex format (http://yaml.org/spec/1.1/).

Jon Parise (link) - 15 05 09 - 22:59


YAML looks like a simple idea that's grown too big through accretion -- if it had just stayed a way to represent node trees using Python-style indentation, it would have been fine. They've added so many formatting and compression options, though, that it's fairly complex and it sort of resembles a sendmail config file. The kind/type diagram makes my head hurt. The relative paucity of YAML tools would also make me prefer mini-XML = XML sans DTDs instead, because at least in that case effective compatibility with XML tools would be very high.

Phaeron - 16 05 09 - 00:28


Forget about DTDs and move to xml schema :D

Mastermnd - 16 05 09 - 10:08


Maybe give JSON a spin? I found less "bloated" than XML and easy to work with. Depends what you wanna do, though.

igro - 16 05 09 - 13:25


Honestly, I had no idea you could define character entities inline or that the DTD is a requirement of XML parsers! Given the inclusion of "annoyance" in your title, I'm supposing you haven't given up on XML? Why are you writing a new parser, is it from scratch? With W3C recommendations almost never being implemented "to spec" (see SVG!), is it much of a surprise that major parsers and protocols have ignored this aspect? :) However, if you are planning on going all the way, then I have confidence you'll come up with decent defenses against things like the billion laughs attack.

Neil C. Obremski (link) - 16 05 09 - 14:25


@Mastermind:
Unfortunately, you don't get a choice in this matter, because the DTD is the only schema that is part of the XML spec, and the problem is what you can receive. It's perfectly OK to not use a schema or DTD when writing XML, and I don't know of anyone who does.

@igro:
Looks interesting. I like the format, but it's still less supported than XML.

@Neil C. Obremski:
There are many rules in play, but one of them is that the size of the parser used to read the config file may not exceed the size of the rest of the application. Another is that I don't see a good reason to cheat on implementation here, because unlike many of W3C's standards, it is actually feasible to implement all of XML. You certainly couldn't say that of XHTML, for instance.

I haven't done anything one way or another since I've been working on other stuff. I'm tempted just to try writing an XML parser from scratch just to see how bad it is -- reading the spec is probably worse than actually writing the code, since I've done both tag parsing and recursive descent parsing before. In terms of effort to goal, though, JSON looks like a much faster way to go.

Phaeron - 16 05 09 - 14:51


The annotated standard http://www.xml.com/axml/testaxml.htm is worth reading, as it explains some of the history and the forward references.

As an alternative to going the SOAP/XMPP route of disallowing DTDs, it's not unheard of to have a lightweight embedded parser without DTD support, and fall-back to an external full parser to pre-process anything with a DTD to expand entities.

Pete Kirkham (link) - 17 05 09 - 05:34


XML tries to do everything, and that's not actually a good thing for most people.

XML badly needs to take the MPEG approach, and define specific subsets of the API. Give us XML without custom entities, without DTDs (allow specifying them for validators, but ignore them), mandate UTF-8, and so on--turn off the stuff that's relatively less useful and relatively hard or heavyweight (eg. character conversion) to support. People would then have much less need to define their own subsets.

> exceed the size of the rest of the application

Well, in theory, a major benefit of using a common format is to use a common library, so the relative size of the parser is zero.

Of course, non-Microsoft libraries are rarely actually shared in Windows, and even in Linux, the hassle of library compatibility often strongly encourages statically linking some libraries if you're distributing binaries (though libexpat probably isn't a problem).

Glenn Maynard - 17 05 09 - 18:30


The MAME emulator has a command line switch (-listxml) that dumps a database in XML format of the games that it supports. They include a DTD in the dump. It appears to be mostly for validation purposes, but it also defines some default values that don't exist in the XML until it gets processed by the parser. Here's the XML file for MAME 0.131 if you're curious...(WARNING it's over 34MB)... http://files.3feetunder.com/mame0131.xml

Tankadin (link) - 18 05 09 - 01:43


Timeless classic:
http://www.schnada.de/grapt/eriknaggum-x..

Sam - 25 05 09 - 19:12

Comment form


Please keep comments on-topic for this entry. If you have unrelated comments about VirtualDub, the forum is a better place to post them.
Name:  
Remember personal info?

Email (Optional):
Your email address is only revealed to the blog owner and is not shown to the public.
URL (Optional):
Comment: /

An authentication dialog may appear when you click Post Comment. Simply type in "post" as the user and "now" as the password. I have had to do this to stop automated comment spam.



Small print: All html tags except <b> and <i> will be removed from your comment. You can make links by just typing the url or mail-address.