Dare Obasanjo recently posted a message to the xml-dev mailing list as part of the ancient and venerable binary XML permathread (just a bit down the list from attributes vs. elements, DOM vs. SAX, and why use CDATA?). His message including the following:
I don’t understand this obsession with SAX and DOM. As APIs go they both suck[0,1]. Why would anyone come up with a simplified binary format then decide to cruft it up by layering a crufty XML API on it is beyond me.
I supposed that I should rush to SAX‘s defense. I can at least point to my related posting about SAX’s good points, but to be fair, I have to admit that Dare is absolutely right — building complex applications that use SAX and DOM is very difficult and usually results in messy, hard-to-maintain code.
The problem is that I have not yet been able to find an XML API that doesn’t, um, suck. So-called simplified APIs like StAX or JDOM always look easier with the simple examples used in introductions and tutorials, but as soon as you try to use them in a real-world application, their relatively minor advantages disappear in the noise of trying to deal with the complexity of XML structure. For example, late last week I had decided to use StAX instead of SAX for a library I was writing, since it was getting very hard to manage context and flow control in a push parsing environment and my SAX handler had become (predictably) long and messy. After an hour I realized that my StAX handler had become even longer and harder to read than the original SAX-based code, even though StAX lets me use the Java runtime stack to manage context instead of forcing me to do context management on my own. Oh well. StAX looked so much easier in Elliotte Rusty Harold’s excellent tutorial, but as soon as I moved away from toy examples to a real XML data format, everything fell apart.
My old SGMLSpl library was also hard to use, so we have a long history of awkward APIs in the markup world. Only if you can restrict the kind of XML you’re dealing with somehow — say, by banning mixed content or even using a data metaformat like RDF or XTM (more on these in a later posting) — can the user APIs get a little simpler, because the library can do some preprocessing for you and give you a predigested view of the information.