Current version

v1.10.4 (stable)

Navigation

Main page
Archived news
Downloads
Documentation
   Capture
   Compiling
   Processing
   Crashes
Features
Filters
Plugin SDK
Knowledge base
Contact info
 
Other projects
   Altirra

Archives

Blog Archive

JSON >> XML (at least for me)

A few days ago, in a similar mood to the one that caused me to start an Atari emulator, I decided to write my own XML parser.

I've had an increasing interest in language parsers ever since I got to the point of parsing algebraic infix expressions and simple C-like languages. I've written about XML annoyances before, but I don't actually have much occasion to work with XML at the code level, because:

And yet, one of the advantages of XML is that it keeps people from creating their own interchange formats, which are typically far more broken. Since I occasionally do need to import and export little bits of metadata, I wanted to see just how much would be involved in having a little XML parser on the side. It wouldn't need to be terribly fast, as we're talking about a couple of kilobytes of data at most being parsed on a fast CPU, but it would need to be small to be usable. And I just wanted to see if I could do it. So I sat down with the XML 1.0 spec, and started writing a parser.

I have to say, my opinion of XML has dropped several notches in the process (er, lower than it already was), and I'm convinced that we need a major revision or a replacement. I got as far as having a working non-validating, internal-subset-only parser that passed all of the applicable tests in the XML test suite, but after writing more than 2000 lines of code just for the parser and not having even started the DOM yet, I had already run into the following:

All of this adds up to a lot of flexibility and thus overhead that simply isn't necessary for most uses of XML that I've seen. For those of who say who cares and modern systems are fast, I'd like to remind you that every piece of complexity is a piece that can go wrong in terms of an export/import failing, a parser glitch turning into an exploit, or a source of stability problems. This can be true even with a parser that is 100% compliant with the standard if the parser does not have guards against infinite expansion or parser recursion depth. It'd be so much easier if someone would just go through and strip down XML to an "embedded subset" that only contains what most programmers really think is XML and actually use, but I don't see this happening any time soon.

So, in the end, I stopped working on the XML parser and started working on a JSON parser instead. First, it's so much easier to work off of a spec that essentially fits on one page and doesn't have spaghetti hyperlinks like a Choose Your Own Derivation Adventure book. Second, it's so much simpler. Names? Parsed just like strings, which can contain every character except a backslashes and control codes. Entities? Just a reduced set of C-like escapes in strings, and thankfully sans octal. Comments? None. Processing instructions? None. Normalization? None. And as a bonus, it's ideal for serializing property sets or tables. The JSON parser and DOM combined was less than half the size of the XML parser at under 1K lines and took less than a day total to write, and half of that is just UTF-8/16/32 input code (surrogates suck).

To be fair, there are a few downsides to JSON, although IMO they're minor in comparison:

Still, JSON looks much more lightweight for interchange. I'm especially pleased that native parsing support is making it into the next round of browser versions, which hopefully will improve its uptake and therefore available tools.

Comments

This blog was originally open for comments when this entry was first posted, but was later closed and then removed due to spam and after a migration away from the original blog software. Unfortunately, it would have been a lot of work to reformat the comments to republish them. The author thanks everyone who posted comments and added to the discussion.