I was listening to Tim Bray‘s excellent talk On Language Creation today at the XML 2005 conference in Atlanta. Tim was talking about creating new XML-based markup languages (summary: “please don’t”), and in passing he mentioned the must-ignore/must-understand design pattern. For the first time, it occured to me that this pattern has a serious flaw.
The pattern works this way: you want to let people extend your XML-based language with new elements, and you want to allow forward-compatibility so that systems don’t break if or when you upgrade the language, so it’s usually a good idea to let applications simply ignore what they don’t understand (as is the case with HTML). That’s called must-ignore. For example, if your application sees this XML document
<record> <a>xxx</a> <b>xxx</b> <w>xxx</w> <c>xxx</c> </record>
but it does not understand the w element (maybe you added it to hold extra information for a different application), it will just pretend that the w element wasn’t there, and might process the document as if it read
<record> <a>xxx</a> <b>xxx</b> <c>xxx</c> </record>
On the other hand, if w contained some kind of crucial information that would change the application’s processing — say, by reversing the outcome or specifying an essential prerequisite (“turn off the oxygen first“) — it would be better to have the application quit and report an error instead of chugging on ahead. That’s called must-understand. Some specifications, like SOAP, actually specify these rules inside the XML instance on an instance-by-instance basis, but most simply frame them in general terms in the specification.
I realized today, however, that there’s a huge problem with this approach: must-ignore and must-understand are properties of a processing model, not a markup language. Consider an XML language for a business report: if I designate an element as must-understand, what do I really mean?
- An application must understand this element to copy this information into a database?
- A search engine must understand this element to index it?
- A formatting engine must understand this element to generate a PDF?
- An XML editing tool must understand this element to open the document?
- An XSLT engine must understand this element to do a transformation?
- An archiver must understand this element to save the report for auditing purposes (say, Sarbanes-Oxley requirements)?
Each of these represents a different processing model for the same XML document. The must-understand and must-ignore constraints will likely be different for each one, so they’re obviously not properties of the XML-based markup language. Some XML languages, like SOAP and Atom, are specified explicitly as parts of protocols, so the must-understand/must-ignore constraints are part of the protocol specification, but even then, once you have XML, you never know what clever things people will decide to do with it.
If I understand you rightly, Walter Perry has been saying this sort of thing for years, and I think it’s a very important insight for making sure that XML is not too strongly tied to far more ephemeral application models. Schematic and modeling systems should basically talk about what they know to talk about, and not try to philosophize about every possible hypothetical characteristic of data in the system. It’s up to the application layer to use the right modeling tools to satisfy the *local* business semantics.
It seems to me that the true meaning of “must-understand” conventions is this: if you get something that is implicitly or explicitly marked “must-understand”, then if you don’t understand it, you don’t understand anything else either. The fact that Atom has no “must-understand”, and treats everything not documented as belonging in an Atom document (even things in the Atom namespace) as “must-ignore” means that no new Atom element can change the intended semantics of an existing Atom element. (Tim says this is okay with him/them.)
But if adding a new element might change the purpose of old elements, then “must-understand” is a sensible notion: RELAX NG and XML Catalogs are “must-ignore” for elements and attributes outside the RNG namespace, but “must-understand” for things in the namespace: if you see a (currently unknown) element, you are probably trying to process a new version, and all bets are off as far as the interpretation of existing elements. (Just my opinion, not attributable to any OASIS TCs.)
Thanks for the comment, John. I agree with what you write, but only in reference to a specific processing context. The elements that an application must understand to complete a business transaction are not necessarily the same elements that an application must understand to index a document for search, or to format it as PDF.
mU/mI are a way to tunnel evaluators through XML. It’s like obfuscated code-sharing 🙂
Atom is different insofar as the best thing to do with it is squint your eyes and try and see the serialization of a dictionary. It’s by-design additive, so it makes no sense (imo) to hobble it with mumi directives – indeed it would be nuts – imagine throwing a runtime exception because a hashmap has keys you can’t dispatch on. Atom drags the XML crowd about halfway to RDF.
Actually, there is two different way to “ignore” an element as well, to complicate matters further … Either you can ignore the element and the contents of that element (as you do in your example), or you can ignore the element and treat the child elements as beloning to the parent element (like in html).
Or actually, there’s a third possibility; just treat the element as an element with no semantics (differs from method #2 in how selectors work on child elements to the unknown element).
Pingback: THOUSANDMINDS » Making Mistakes with XML