The complexity of XML parsing APIs

Dare Obasanjo recently posted a message to the xml-dev mailing list as part of the ancient and venerable binary XML permathread (just a bit down the list from attributes vs. elements, DOM vs. SAX, and why use CDATA?). His message including the following:

I don’t understand this obsession with SAX and DOM. As APIs go they both suck[0,1]. Why would anyone come up with a simplified binary format then decide to cruft it up by layering a crufty XML API on it is beyond me.

[0] http://www.megginson.com/blogs/quoderat/archives/2005/01/31/sax-the-bad-the-good-and-the-controversial/

[1] http://www.artima.com/intv/dom.html

I supposed that I should rush to SAX‘s defense. I can at least point to my related posting about SAX’s good points, but to be fair, I have to admit that Dare is absolutely right — building complex applications that use SAX and DOM is very difficult and usually results in messy, hard-to-maintain code.

The problem is that I have not yet been able to find an XML API that doesn’t, um, suck. So-called simplified APIs like StAX or JDOM always look easier with the simple examples used in introductions and tutorials, but as soon as you try to use them in a real-world application, their relatively minor advantages disappear in the noise of trying to deal with the complexity of XML structure. For example, late last week I had decided to use StAX instead of SAX for a library I was writing, since it was getting very hard to manage context and flow control in a push parsing environment and my SAX handler had become (predictably) long and messy. After an hour I realized that my StAX handler had become even longer and harder to read than the original SAX-based code, even though StAX lets me use the Java runtime stack to manage context instead of forcing me to do context management on my own. Oh well. StAX looked so much easier in Elliotte Rusty Harold’s excellent tutorial, but as soon as I moved away from toy examples to a real XML data format, everything fell apart.

My old SGMLSpl library was also hard to use, so we have a long history of awkward APIs in the markup world. Only if you can restrict the kind of XML you’re dealing with somehow — say, by banning mixed content or even using a data metaformat like RDF or XTM (more on these in a later posting) — can the user APIs get a little simpler, because the library can do some preprocessing for you and give you a predigested view of the information.

This entry was posted in Uncategorized and tagged programming. Bookmark the permalink.

11 Responses to The complexity of XML parsing APIs

Pingback: Danny Ayers, Raw Blog
Pingback: Martins Notepad
Mike Champion says:

February 8, 2005 at 9:33 pm

Well since you admit that SAX sucks, I can agree that DOM sucks too 🙂 It’s very interesting and somewhat disheartening that you found the next generation of APIs better for simple things but no better for realistically hard problems. Obviously we have to do better, and its probably a good thing that nobody is trying to standardize XML APIs anymore so that innovation and competition can drive progress.

I wonder if you could suggest at least the outlines of a realistic application scenario that is complex enough to exhibit the problems you noted but simple enough to become sortof a common example that can be used as a reference point. Something more complex than Hello, World or the XQuery book database of course … maybe at the level of the Employee-Department-Manager-etc. Personnel database that one tends to see in every RDBMS book and tutorial.

We know how to make trivial XML programs trivial to write (in the next-generation-after-SAX/DOM APIs, anyway). It would be nice to make an interesting class of less trivial XML applications at least easy, and having a common reference example might help us all evolve in that direction.
David Megginson says:

February 9, 2005 at 7:18 am

Thanks, Mike — that sounds like a great idea for a future posting.
Pingback: Dare Obasanjo's WebLog
Anthony B. Coates says:

February 11, 2005 at 2:00 pm

Here’s the thing. I’ve been doing Java for rather longer than I’ve been doing XML, and yet, whenever I have to convert one kind of XML into another, I go straight for XSLT. Why? Well, I guess there are two things:
(i) XML is a genuine part of the data model in XSLT. So you don’t get stupid complications that arise from forcing the square peg XML data model through some round hole language data model. JavaBeans and C/C++/C# classes just don’t map neatly onto the XML infoset.
(ii) When you write XSLT, you can write fragments of XML directly, and embed bits of code within the XML fragments for the dynamic parts of the code. That often makes it vastly easier to see what the result will be like, and that just speeds up the whole development process.
What it all comes down to is that any XML API will suck where XML isn’t a first class data type, and where it isn’t integrated into the language syntax. Languages that allow XML (or even tree-structured data) are just easier to write and maintain (XSLT, E4X, Comega, Groovy), and some if not all can be compiled so that you don’t lose much in performance, but gain lots in development time.
That’s how I see it anyway. Cheers, Tony.
PS With JDK 5.0, it would be an afternoon task to write a compiler front-end that compiles both Java and XSLT 1.0 sources into Java bytecodes. Horses for courses? Could be.
David Megginson says:

February 11, 2005 at 2:07 pm

Thanks for the comment, Tony. The template approach is a pleasant way to generate XML, whether through XSLT transformations (when used in a templatey way) or JSP — you have everything in front of you at once, just like in a fullscreen text editor (vs. the old line-oriented editors, which have a lot in common with programming code to manipulate XML).

The trouble is that the templates don’t really work for reading XML into objects or data structures, or for taking complex programming actions based on the XML. Also, XSLT (or DOM) can bring a busy web server to its knees because of the processor and memory requirements, though for some applications smart caching can help a lot.

By the time people get to the point of using a streaming API like SAX or StAX, they often have a serious problem like a sluggish application server and are willing to accept a lot of pain for the sake of curing it. It would be nice if we could figure out some middle ground, something that wasn’t as hard to use as a streaming API but not as maddeningly inefficient as DOM or XSLT.
Pingback: Software Documentation Weblog
Dan Diephouse says:

February 20, 2005 at 7:05 pm

Are you refering just to W3C DOM in Java (which definitely sucks) or to all DOM toolkits? Have you looked at XOM?
Pingback: Dion Hinchcliffe's Blog - Musings and Ruminations on Building Great Systems
Pingback: Dion Hinchcliffe's Blog - Musings and Ruminations on Building Great Systems