Quoderat

The complexity of XML parsing APIs

Posted on February 8, 2005 by David Megginson

Dare Obasanjo recently posted a message to the xml-dev mailing list as part of the ancient and venerable binary XML permathread (just a bit down the list from attributes vs. elements, DOM vs. SAX, and why use CDATA?). His message including the following:

I don’t understand this obsession with SAX and DOM. As APIs go they both suck[0,1]. Why would anyone come up with a simplified binary format then decide to cruft it up by layering a crufty XML API on it is beyond me.

[0] http://www.megginson.com/blogs/quoderat/archives/2005/01/31/sax-the-bad-the-good-and-the-controversial/

[1] http://www.artima.com/intv/dom.html

I supposed that I should rush to SAX‘s defense. I can at least point to my related posting about SAX’s good points, but to be fair, I have to admit that Dare is absolutely right — building complex applications that use SAX and DOM is very difficult and usually results in messy, hard-to-maintain code.

The problem is that I have not yet been able to find an XML API that doesn’t, um, suck. So-called simplified APIs like StAX or JDOM always look easier with the simple examples used in introductions and tutorials, but as soon as you try to use them in a real-world application, their relatively minor advantages disappear in the noise of trying to deal with the complexity of XML structure. For example, late last week I had decided to use StAX instead of SAX for a library I was writing, since it was getting very hard to manage context and flow control in a push parsing environment and my SAX handler had become (predictably) long and messy. After an hour I realized that my StAX handler had become even longer and harder to read than the original SAX-based code, even though StAX lets me use the Java runtime stack to manage context instead of forcing me to do context management on my own. Oh well. StAX looked so much easier in Elliotte Rusty Harold’s excellent tutorial, but as soon as I moved away from toy examples to a real XML data format, everything fell apart.

My old SGMLSpl library was also hard to use, so we have a long history of awkward APIs in the markup world. Only if you can restrict the kind of XML you’re dealing with somehow — say, by banning mixed content or even using a data metaformat like RDF or XTM (more on these in a later posting) — can the user APIs get a little simpler, because the library can do some preprocessing for you and give you a predigested view of the information.

Tagged programming | 11 Comments

Blame Larry Wall

Posted on February 4, 2005 by David Megginson

Larry Wall

Late yesterday I was working on a mind-numbingly simple XML data library in Java for use with a larger project. I spent about an hour on the first iteration, which could read and write through an event-interface and/or into a data tree but used only simple names. After supper, I came back and spent another hour writing a beautifully elegant XMLName class and refactoring the rest of the code to support namespace-qualified names. The class supported getters and setters for the namespace URI and local name, equals, and hashCode methods, and at one point, support for the Comparable and Serializable interfaces, but it went even further — to support the flyweight design pattern it was declared final and had a weak-reference lookup table for internalization, like the Java String class. To go even further, it had a static intern method that took two arguments, so that you could create an internalized XMLName directly without having to construct a non-internalized version first:

XMLName name = XMLName.intern("http://http://www.w3.org/1999/xlink", "href");

In other words, it was pretty cool — fast, memory-efficient, and properly designed. I’m sure that many of the people reading this posting have designed similar classes for XML work and taken similar pride in them. Unfortunately, before I went to bed, I realized I’d have to delete the class when I got up in the morning.

Why? I blame Larry Wall for all my grief, because it was his voice that started playing in my head, saying “easy things should be easy, and hard things should be possible.”

I messed up because I was focussing on the harder part of the problem. For simple XML configuration files, most people won’t be using namespaces most of the time, so forcing them to write

branch.setName(XMLName.intern(null, "foo"))

instead of

branch.setName("foo")

is a bad idea. Of course, I could hide that behind the scenes by adding extra method calls, say, setNameString and getNameString, but then I end up cluttering up my code (harder to learn, more bugs, trickier maintenance, etc.), again, just in an attempt to make the hard case easier.

The right solution for this particular library is one that James Clark suggested back in 1998 or 1999 when we were first trying to figure out how to get namespace support into SAX, and one that I sometimes wish we had taken up (though it’s not one of my biggest regrets): represent any XML name as a single string, with the namespace URI and the local name merged together. James preferred surrounding the namespace URI in braces, like this: “{http://http://www.w3.org/1999/xlink}href”; other option is to separate the two with a space, like this: “http://http://www.w3.org/1999/xlink href”. Of course, any library that does this should provide helper functions for splitting the string into its two parts or recombining.

So, while I’m still channelling Larry’s voice, let’s see how well this solution fits. First, the easy case:

String name = branch.getName();
branch.setName("foo");

OK, looks good: the easy thing is easy. Now, the hard case:

String name = branch.getName();
String parts[2] = Utils.splitName(name);
branch.setName("{http://www.example.org/ns}foo");

The hard thing is not easy, but it’s possible. Perhaps Larry’s voice will leave my head now, and I can get on with life and coding, in that order.

Tagged programming | 3 Comments

Perl XML::Writer has a good home

Posted on February 3, 2005 by David Megginson

I just stumbled on this posting, and was happy to see that the Perl version of my XML writer (a library for creating XML) has found a good home. I originally wrote the XML writer in both Java and Perl versions, but the Perl version was always the neglected sibling — I just don’t use perl that much any more, and wasn’t motivated to fix bugs, add features, etc.

Over the years, several people offered to take over maintenance of the Perl branch, but usually nothing came of it, and I lost track of who, if anyone, was supposed to be managing it. I recently revived the XML-Writer Sourceforge project and have been doing some maintenance on the Java branch, but again, hadn’t looked at the Perl.

So I’ll do some work on the Java, but will leave the Perl in better hands. This is a small but nice example of how open source is supposed to work: the people who care the most are the ones who do the work, and when the original maintainer loses interest, others are ready to step in.

Tagged programming | 2 Comments

SAX: biggest satisfactions

Posted on February 2, 2005 by David Megginson

Recently, I mentioned my biggest regrets about SAX. When we were building SAX, however, there were an awful lot of things that went right. Here are the three things that I’m happiest about:

SAX was useful right from the start

Not just useful, in fact, but more useful than any alternative at the time. When I wrote the first draft of SAX over Christmas 1997 and put it up on the xml-dev mailing list for discussion and review in January 1998, the package included not only an interface definition but driver/adapters for all four existing Java XML parsers: Jame’s Clark‘s XP, Tim Bray‘s Lark, Microsoft‘s MSXML (I don’t think a Java version is still available), and my own AElfred (now maintained by others). That meant that right away, a Java developer would be able to write code that worked with any existing XML parser.

This was an important point because I was afraid that the big computer companies (IBM and Oracle were also working on parsers) were going to try to lock developers into their platforms through proprietary parser interfaces. XML is an open format, but if all your code and all the libraries available to you work only with (say) IBM’s or Microsoft’s parser interface, then you haven’t gained much over using a proprietary format.

Another advantage, that I hadn’t anticipated, was that people started developing large-scale projects with SAX right away, so they shook out bugs and design problems very quickly. Running code is always a good thing, but running code that actually makes developers’ lives easier trumps anything else.

SAX is efficient

There are so many things that we could have done to kill SAX’s efficiency: we could have returned strings for character data instead of arrays (which can be indexed directly into the parser’s buffer); we could have returned elaborate objects for events, managed from some kind of pool; we could have managed a context stack for the user, whether she needed it or not; but we did none of those things. I was tempted, sometimes, but the other volunteers in the project quickly slapped me back into line.

The rationale was simple: it is easy to build all of those things on top of SAX if you need them (and, in fact, Michael Kay‘s SAXON started life as a friendly SAX helper library, before it evolved into an XSLT framework), but there is no way to remove them if you don’t need them. As a result, SAX concentrated on standardizing the way that parsers deliver information rather than providing a friendly user experience — once that was standardized, it would be easy to build layers on top that would work with any parser. In short, the motto was do no harm rather than make it fun and simple; it turned out being a perfect example of worse is better.

I had assumed that just about everyone would work through those higher-level libraries, but in the end (to my surprise), lots of developers learned to love the clumsy, low-level SAX interfaces in all their ugly glory. I myself have messed around with writing higher-level libraries on top of SAX, only to go back to the raw ContentHandler and its friends every time. For some reason, hard-core XML developers like to stay close to the metal, no matter how many friendly high-level tools people offer them.

SAX supports filter chains

SAX filter chains may seem obvious now, but I doubt I would ever have been able to think them up. I cannot remember who first suggested using SAX handlers in chains, like a Unix pipeline — perhaps the idea just evolved gradually as a kind-of group think — but it was well established by SAX2 and officially supported by a dedicated interface. We don’t support filters perfectly (error handling is a bit kludgy), but people make beautifully simple yet powerful systems using them.

I don’t think that there will ever be substantial changes to SAX. Now that I’ve resumed maintaining it, I’ll try to fix bugs and keep it up to date with any new XML versions, but otherwise, it is what it is. Perhaps something newer, like StAX or some other pull interface will eventually displace SAX, and that would be fine too. For now, though, it is an essential part of the XML infrastructure, used at tens or hundreds of thousands of sites, and the best thing I can do is keep it stable and make as few changes as possible.

Tagged programming | 3 Comments

Wikipedia URLs as blog subject codes

Posted on February 1, 2005 by David Megginson

[Updated] Over in my aviation weblog, I find myself more and more linking to Wikipedia whenever I’m discussing a concept, person, place, or anything else that doesn’t have its own, canonical home page. If, as I suspect, lots of other bloggers are doing the same, then links to Wikipedia articles may soon be the blogsphere’s answer to subject codes.

News wire services like Reuters or Dow Jones put a lot of time and money into maintaining long lists of subject codes to attach to their news products. Unlike the simple categories used in blogs, subject codes tell you not just that an article is about (say) computer technology, but that it is about specific companies, industries, people, places, and concepts. News customers use the codes to classify stories automatically, routing them to the appropriate editorial sections, displaying them on trading screens, sorting them into categories on web sites, or using them to improve searches. The providers are constantly sending out updated lists, keeping their customers’ technical departments very busy.

Should weblogs be using some kind of subject code (beyond categories)? Some areas already have standard identifiers that we could use, such as ICAO codes for airports, UPCs for retail products, ISBNs for books, CUSIPs for financial instruments, or ISO codes for countries, languages, and currencies. However, each of those requires some surrounding context: you need not only the code, but some indication that it refers to a currency or an airport. They’re also managed by central authorities, making them less attractive to the weblog community.

Enter Wikipedia. If I’m posting about Washington the U.S. state, I can link to the Wikipedia article about the state; if I’m posting about Washington the U.S. president, I can link to the article about the president; if I’m posting about Washington the U.S. capital, I can link to the article about the city; and if I’m using the word Washington by metonymy to refer to the U.S. government, I can link to the article about the government.

Bingo — subject codes, just like the big newswires use, only a lot more useful and totally open. I can link to abstraction subjects like love or communism or to time periods like the middle ages just as easily as I can link to concrete people, places, or things; if there’s not already a Wikipedia article on my subject, I can always start a stub. If people keep linking to Wikipedia, search engines like Technorati and aggregators like Bloglines might start taking advantage of those links to do some automatic categorization, right down to offering links to other postings on the same subject (“Click here for other postings about Open Source“). Once people know the search engines are doing that, they’ll be bound to link to Wikipedia even more than they already are, creating a virtuous circle where both Wikipedia and the blogsphere become more valuable.

Of course, like anything that people actually do in the web (as opposed to drawing-board architectures that never get implemented), this approach is far from perfect. Once the search engines are paying attention to Wikipedia links, some people will deliberately include misleading links to have their weblog entries miscategorized, though rankings like Technorati’s should help make sure that the most relevant ones stay near the top of the list. Furthermore, Wikipedia URLs do change, especially for the sake of disambiguation, so the Wikipedia URLs will never be 100% accurate as subject codes. And finally, the Wikipedia project itself could shut down, leaving all of the subject codes orphaned. Still, since linking to Wikipedia is something many of us do anyway, it looks like a good, quick-and-dirty webby alternative to the news industry’s subject codes — it might even work better.

Update: James Tauber posted the same idea with slightly different language back in October, and has just put up a followup.

Tagged blogging | 6 Comments

SAX: biggest regrets

Posted on January 31, 2005 by David Megginson

It’s seven years ago this January that I put out the first prerelease of SAX for consideration by the xml-dev mailing list. The final SAX releases contain the wisdom of a lot of people, but in the end, I had to make the final decisions about how it would work, and my record was mixed. Now that SAX is a standard (if unremarkable) part of the XML infrastructure, I thought it would be worth making two or three posts about what went wrong and what went right. In this post, I’ll start with my three biggest regrets about SAX/Java:

SAXException does not extend IOException: XML parsing is a kind of I/O, and the exception should have reflected that. If we had done things that way, any library that does XML parsing could simply have thrown IOException, without having to expose any XML stuff at all or to force tunnelling of exceptions inside other exceptions, etc. This one bugs me every time I code with SAX.
SAX uses callbacks instead of a pull interface: In this case, though, I probably wouldn’t do things differently if I could go back in time. To get acceptance, SAX had to work with all existing Java/XML parsers. They used callbacks, and the only way to get a pull interface would have been to run the parser in a separate thread, an approach wasn’t all that stable back in early 1998 (especially not on Windows). Callbacks are not a serious problem for most applications, but they do make event dispatching much more difficult and sometimes they make for messy, hard-to-maintain code. Now that Java thread support is rock-solid on all platforms, it’s easy enough to write a good pull-parsing adapter for SAX (I have one that I can release, if anyone cares). I’ve played around with StAX a bit, but none of the StAX drivers seems as stable as the SAX ones.
SAX2 isn’t really simple: The original vision for SAX was to keep it dead simple. The XML 1.0 REC required that we report certain information, like processing instructions, but otherwise, I wanted to keep it as close to elements-attributes-content as humanly possible. SAX1 didn’t do too bad a job of that. SAX2 had to add support for namespaces, which messed up all the interfaces; at that point, people were screaming for all kinds of esoteric stuff that about 12 people in the world care about (i.e. entity boundaries). Instead of making SAX even more complicated, I invented the property and extension interfaces so that people could invent new things without cluttering the core. Then SAX ended up with all kinds of new, optional interfaces in the distribution anyway, so it’s quite nightmarish for a new user trying to figure out what matters and what doesn’t. If I ever put out a SAX3, I’ll do most of the work using the delete key, but that’s probably not possible when things like JAXP depend so heavily on SAX.

Tagged programming | 10 Comments

The weblog stack

Posted on January 31, 2005 by David Megginson

Networking people love to talk about the network stack, like the 4-layer DoD model or the 7-layer OSI model, and web services boosters have picked up on that with their talk about the web services stack (an example from Judith Myerson at IBM , an example from David Orchard at BEA, and a bit of skepticism from Kendall Grant Clark) .

Should we be talking about a weblog stack? The web services stack almost always starts with HTTP rather than going all the way down into the lower-level networking protocols, so a similar weblog stack using RSS 2.0 would look something like this:

HTTP

XML

Namespaces

RSS 2.0

RSS 2.0 extensions (like the well-formed web extensions)

A diagram like this helps me to write an RSS library or aggregator, but does it leave me any more aware of how the blogsphere ticks? Not really, because not everything passes through this stack. For a non-full-text feed, for example, the headline and description show up this way, but then the main posting gets tome through a normal web HTTP+HTML route, totally independent of XML or RSS. Other kinds of communication bypass my proposed stack completely, like trackback and pingback, or even Technorati rankings for that matter.

Building a stack provides a cute technical model of one step in the weblog process, but it doesn’t explain how the whole thing works, much less why it works. In fact, human social products are almost always too messy to capture in simple trees or stacks. I faced exactly the same issue when I used to teach the history of the English language at university — technically, English is descended in a straight line from Old English, which is descended from proto-Germanic, which is descended from Indo-European. In reality, though, English borrowed an enormous amount of vocabulary and even syntax from languages like Latin, Greek, and French, which are not direct ancestors: imagine that you had your grandmother’s ears, but the nose of someone your mother happened to pass by on the sidewalk one day and a heart condition inherited from your father’s favourite 17th century Dutch painter, and you’ll see the problem.

Maybe the fact that weblog activities do not fit into a simple stack is not an unfortunate sign of a lack of intellectual rigour but the very reason for its success. Web services people, take note — you might want to try thinking less about new specifications and more about human behaviour.

Posted in General | Comments Off

Linking XML documents

Posted on January 29, 2005 by David Megginson

[Update: help is on the way.] If you start with an XML document online (and granted, there are precious few of them), how do you use it to find other XML documents? If they’re XML+XHTML documents, you can follow the URLs in any xhtml:a/@href attributes you find in the document; if they’re XML+RDF documents, you can follow the @rdf:about and @rdf:resource attributes; if they’re XML+Docbook documents, you can follow the ulink/@url attributes; and so on.

But what about plain old XML? The best candidate seems to be XLink. While the specification is excessively complicated, it does offer the global xlink:href attribute as a simple linking attribute that any type of XML document can use: some document types, like XML Topic Maps, have taken full advantage of it.

Unfortunately, there is no conformant way to use just xlink:href in an XML document; every time it appears, you also need to have the xlink:type attribute set to the value “simple“. Oops! XTM gets around that by declaring the attribute with a #FIXED value in its DTD, so that it does not have to be repeated in the document itself, but we can hardly require every XML document online to use a DTD or schema, and if they don’t include xlink:type, they’re not conformant. So we cannot simply have

<musician xlink:href="http://www.example.org/bach/"/>
<musician xlink:href="http://www.example.org/beethoven/"/>
<musician xlink:href="http://www.example.org/vivaldi/"/>

but rather, we are forced to use

<musician xlink:type="simple" xlink:href="http://www.example.org/bach/"/>
<musician xlink:type="simple" xlink:href="http://www.example.org/beethoven/"/>
<musician xlink:type="simple" xlink:href="http://www.example.org/vivaldi/"/>

That gets extremely annoying after a few hundred times, probably enough to prevent it from getting universal acceptance. So what do we do? Is there any way to cheat and say something like all XML documents that do not have a DTD are assumed to have an implied DTD with a fixed declaration of xlink:type for every element? I don’t think so. The XLink recommendation was written by some of the brightest people in XML, and I’m sure that they didn’t intend for it to be so awkward for the simplest (and most common) case. It would be wonderful if the W3C could put out some kind of corrigendum stating that when xlink:type is missing, it defaults to “simple“. That’s all we need. Really.

[Update: I forgot to mention that the W3C’s XML Linking working group no longer exists to make any changes to the spec.]

[Update: it turns out that the XML Core WG is working on this very issue: two days after my original posting, entirely by coincidence, Norm Walsh posted that xlink:type will likely become optional.]

Posted in General | 2 Comments

Welcome to Quoderat