A few days ago, in a similar mood to the one that caused me to start an Atari emulator, I decided to write my own XML parser.
I've had an increasing interest in language parsers ever since I got to the point of parsing algebraic infix expressions and simple C-like languages. I've written about XML annoyances before, but I don't actually have much occasion to work with XML at the code level, because:
- I work mainly in C++, and picking C++ to do your XML processing makes most question your sanity.
- I work mainly with video, and picking XML to do your video processing also makes everyone question your sanity.
And yet, one of the advantages of XML is that it keeps people from creating their own interchange formats, which are typically far more broken. Since I occasionally do need to import and export little bits of metadata, I wanted to see just how much would be involved in having a little XML parser on the side. It wouldn't need to be terribly fast, as we're talking about a couple of kilobytes of data at most being parsed on a fast CPU, but it would need to be small to be usable. And I just wanted to see if I could do it. So I sat down with the XML 1.0 spec, and started writing a parser.
I have to say, my opinion of XML has dropped several notches in the process (er, lower than it already was), and I'm convinced that we need a major revision or a replacement. I got as far as having a working non-validating, internal-subset-only parser that passed all of the applicable tests in the XML test suite, but after writing more than 2000 lines of code just for the parser and not having even started the DOM yet, I had already run into the following:
- The internal DTD subset. This is ugly, due to the goofy non-standard syntax that just borders on being unparseable with a recursive descent parser. About half of the code I had to write was dedicated just to parsing the internal DTD subset. And remember, the internal DTD subset has to be parsed, because it can affect a document in two ways: entity expansion, and attribute value normalization.
- Entities. This was much more of a mess than I had imagined. It turns out that not only can you define entities out of order, but you can include elements in entities, which turns into a recursive parsing architecture that could probably be adapted as a C preprocessor. And to add insult to injury, parameter entities (%foo;) have to be supported since they can be used in the internal DTD subset. At least they didn't allow elements to span entity boundaries.
- Character rules. The rules for what code points constitute a character or a name in XML are crazy, with the base Char production covering six ranges, the one for name start characters covering sixteen, and the one for name following characters covering twenty-two. And no, I refuse to use an array of size 0x110000 to handle this.
- CDATA sections. Not only do I have to check for the starting <![CDATA sequence, which partially overlaps with the prefixes for processing instructions (PIs) and comments, but I also have to scan every text section for ]]> just so I can ban it, even though I don't see why this is necessary. XML doesn't ban > in text spans.
- Other apparently nonsensical rules. For instance, everything in XML for the most part is case sensitive, including the XML declaration which must be lowercase "<?xml"... except when checking processing instructions, where all case forms including <?xML and <?XMl must be rejected. WTF?
All of this adds up to a lot of flexibility and thus overhead that simply isn't necessary for most uses of XML that I've seen. For those of who say who cares and modern systems are fast, I'd like to remind you that every piece of complexity is a piece that can go wrong in terms of an export/import failing, a parser glitch turning into an exploit, or a source of stability problems. This can be true even with a parser that is 100% compliant with the standard if the parser does not have guards against infinite expansion or parser recursion depth. It'd be so much easier if someone would just go through and strip down XML to an "embedded subset" that only contains what most programmers really think is XML and actually use, but I don't see this happening any time soon.
So, in the end, I stopped working on the XML parser and started working on a JSON parser instead. First, it's so much easier to work off of a spec that essentially fits on one page and doesn't have spaghetti hyperlinks like a Choose Your Own Derivation Adventure book. Second, it's so much simpler. Names? Parsed just like strings, which can contain every character except a backslashes and control codes. Entities? Just a reduced set of C-like escapes in strings, and thankfully sans octal. Comments? None. Processing instructions? None. Normalization? None. And as a bonus, it's ideal for serializing property sets or tables. The JSON parser and DOM combined was less than half the size of the XML parser at under 1K lines and took less than a day total to write, and half of that is just UTF-8/16/32 input code (surrogates suck).
To be fair, there are a few downsides to JSON, although IMO they're minor in comparison:
- JSON requires Unicode, but doesn't allow a byte order mark (BOM) at the beginning of the file. Argh!
- UTF-32 support is required, which adds two more required formats over XML (UTF-32 LE and BE).
- Duplicate keys are allowed in objects, but behavior isn't specified in that case.
- The production for numbers is the most complex part of the spec.
Still, JSON looks much more lightweight for interchange. I'm especially pleased that native parsing support is making it into the next round of browser versions, which hopefully will improve its uptake and therefore available tools.(Read more....)