Thinking about structure

Douglas Crockford left an excellent comment on my recent posting All markup ends up looking like XML, which he later made into its own blog posting, For the trees. I agree with his reworking of the structure: given the data that I provided, the JSON, LISP, and XML markup all could have been simpler.

If he’s right about the examples, though, he’s wrong about two things. First, my posting doesn’t represent any kind of softening to JSON among its opponents in the XML community, simply because I’ve never been one of those opponents. Second, I spend at least one order of magnitude more time working with SQL and programming languages (not processing XML) than I do with XML, so if anything, my perspective on XML would likely be tainted by them rather than the other way around. Instead, I think the examples were complicated because I built for tomorrow instead of today.

Tomorrow

So what might tomorrow look like for an application dealing with names? Consider, for example, this XML markup, moving gender out of the element/property name as Doug suggests, and eliminating the other attributes (since they don’t add much to the discussion):

<names>
  <name gender="male"><surname>Saddam</surname> Hussein</name>
  <name gender="female">Susan B. <surname>Anthony</surname></name>
  <name gender="male">Al <surname>Unser</surname> Jr.</name>

  <name gender="male">Don Alonso <surname>Quixote</surname>
    de la Mancha</name>
</names>

It’s surprisingly messy breaking each name down into a simple property list. If we tried the approach Doug used for my simpler examples, we’d end up with this (note that this is a list of names, not of people):

{"names": [
    {"gender": "male", "given-name": "Hussein", "surname": "Saddam"},
    {"gender": "female", "given-name": "Susan B.", "surname": "Anthony"},
    {"gender": "male", "given-name": "Al Jr.", "surname": "Unser"}
    {"gender": "male", "given-name": "Don Alonso Quixote de la",
      "surname": "Mancha"}
]}

This list needs a bit of patching. First, if we reconstruct the names as strings, we don’t want to end up with “Hussein Saddam” instead of “Saddam Hussein”, so we’ll have to add a property specifying whether the surname comes first or last:

{"gender": "male", "given-name": "Hussein", "surname": "Saddam",
  "surname-after-given-name": false}

Great — that’s all we need to fix that, and now we know to print “Saddam Hussein”. Now, let’s look at Susan — there’s no problem recreating the string “Susan B. Anthony” from these properties, but we probably should rename the property given-name to given-names, just to avoid confusion:

{"gender": "female", "given-names": "Susan B.", "surname": "Anthony",
  "surname-after-given-names": true}

Al Unser Jr. is a bit trickier, because there was no obvious place to put the “Jr.”. Strictly speaking, it’s neither a given name nor a surname, so for now, let’s just call it a postfix (although that assumes a physical position that might not apply to all languages):

{"gender": "male", "given-names": "Al", "surname": "Unser",
  "surname-after-given-names": true, "postfix": "Jr."}

Don Quixote, however, forces us to reconsider some of our assumptions, because “Don” is not a given name but an honorific. Assuming, however, that we don’t care whether it’s a name or an honorific, lets just call it prefix for now, to go with postfix:

{"gender": "male", "prefix": "Don", given-name: "Alonso",
  "surname": "Quixote", "surname-after-given-names": true,
  "postfix": "de la Mancha"}

Finally, just to throw a wrench into things, let’s assume that our list might contain things other than names, so that we need to add a type property:

{"type": "name", "gender": "male", "prefix": "Don",
  "given-name": "Alonso", "surname": "Quixote",
  "surname-after-given-names": true, "postfix": "de la Mancha"}

Granted, that sort-of works, but it’s really not very nice, and it’s extremely brittle: there are names with extra words in the middle (such as “de”) that are properly not part of the given name or surnames, for example. Then again, why overtag it? Perhaps we don’t need to know what’s a given name or honorific, as long as we can distinguish the surname. One possibility is simple to break it down to four properties:


{"type": "name", "gender": "male", "presurname": "Don Alonso",
  "surname": "Quixote", "postsurname": "de la Mancha"}

While I’m a big fan of Agile development in principle, however, I’ve worked on enough broken legacy systems to leave a little wiggle room for future requirements, like, say, a need to isolate the primary given name for a mail merge or index, even if we’re not going to isolate it right now. Fortunately JSON, like XML, has a natural ability to represent ordered information much more elegantly — let’s make the name into an ordered array:

{"type": "name", "gender": "male",
  "value:" ["Don Alonso", {"type": "surname", "value": "Quixote}, "de la Mancha"]}

This approach provides us with almost limitless flexibility (for example, if we start isolating honorifics, we can deal with a language where the honorific comes at the end of the name with no extra trouble), and is just as simple and easy to read as the much less flexible presurname/postsurname approach. Building for today is great, but if you have a choice between two roughly equivalent approaches where one provides an easy future upgrade path and the other doesn’t, which is the best choice? JSON is new enough that the JSON community hasn’t yet had to deal much with the life cycle of information — once enough people have built apps relying on specific JSON formats, it will be very, very hard to make any changes: v.2 of any popular data format generally results in enormous costs (in money and goodwill), and v.3 rarely happens.

Some people might prefer to shorten the above example a bit by following a simple convention: the first member of each array is a label, the second is a map with properties describing the rest of the array, and the remainder is the value, where order may be significant:

["name", {"gender": "male"},
  "Don Alonso", ["surname", {}, "Quixote"],  "de la Mancha"]

That is trickier to dump straight into a data structure or database table, but it’s a much more natural way to represent the information, and a lot easier to read on the screen. And just in case it doesn’t look look familiar, compare:

<name gender="male">Don Alonso <surname>Quixote</surname>
  de la Mancha</name>

If your information isn’t this complicated, JSON, XML, or LISP can be simple, as Doug pointed out — the XML could just as easily be


<name gender="male" presurname="Don Alonso" surname="Quixote"
  postsurname="de la Mancha"/>

The reason you don’t see that much is not because XML people never thought of it — read the xml-dev archives from ten years ago to read megabytes of discussion — but because it kept breaking in production systems as soon as the customer (or users) thought of a new requirement. When the information gets complicated, as I pointed out, there’s a bit of a tendency for all markup to end up looking like XML; when the information is simple, of course, XML can just as easily look like JSON or LISP.

10 Responses to Thinking about structure

John Cowan says:

January 29, 2007 at 3:18 am

Yeah, really what you need is “full name” (display name) and “sort name”. In general, you get the right answer most of the time if you mark up the primary sort key of the name, thus:

[name]John [sort]Cowan[/sort][/name]

[name][sort]Szilard[/sort] Leo[/name]

[name][sort]Mao[/sort] Zedong[/name]

[name][sort]Sukarno[/sort][/name]

[name]Vicente [sort]Fox[/sort] Quesada[/name]

[name]Alexis de [sort]Tocqueville[/sort][/name]

[name]Charles [sort]de Gaulle[/sort][/name]

[name]Henry [sort]Ford[/sort] II[/name]

[name][sort]Elizabeth II[/sort][/name]
Pingback: Martins Notepad » Blog Archive » Markups raison d’être
Martin says:

January 29, 2007 at 4:33 am

John: but there you let application knowledge slip into the data structure. Didn’t we all agree that is a bad idea? And you didn’t even specify secondary and tertiary sort keys, and maybe the collation to use …
Jonathan Buchanan says:

January 29, 2007 at 5:08 am

Have you thought about how the JSON formats you’re proposing would actually be used?

e.g. the proposed “surname-after-given-names” property would never, ever be necessary, as you’d be doing something like this when actually using the data:

person.given-name + ” ” + person.surname

It seems to me like you’re thinking too much about representing the data in a way which is somehow equivalent with the XML, and not enough about how the data structures defined in your JSON (“That is trickier to dump straight into a data structure” – JSON *is* your data structure) will actually be used – in which case, an object for each name with properties for each section of the name really is a natural and intuitive way to represent the information.

In this case, all you’d really need to do would be to add prefix, and postfix properties to the format Douglas Crockford proposed in his blog post.
david says:

January 29, 2007 at 8:09 am

Thanks, Jonathan. My main point — to both the JSON and XML communities — is that if the information structure is simple, the XML markup or JSON representation can both be simple, and if the information structure is complex, the XML markup or JSON representation will both end up being more complex.

It’s misleading to argue that JSON is better because it’s simpler, or that XML is better because it’s more expressive, because there’s practically no difference on either count. It just happens that XML is generally used in situations where the information is more elaborate than a simple data structure, or, more importantly, in situations where you cannot constrain how the formats are actually used (one person might index the information, one might publish it, one might build a web app around it, etc.). JSON can work there too, but when it does, it will end up looking a lot like the XML. For the simple, RPC-like data structures that are currently JSON’s domain, XML can just as easily look a lot like JSON. I’ve made some changes to the posting along these lines.
Dave Newton says:

January 29, 2007 at 9:04 am

Why not just have a sequence that defines the order of the names? This way it can be customised on a per-object basis if necessary:

{“names!”: [
{“name-order”: [“honorific”, “given-name”, “middle-name”, “surname”],
“given-name”: “Anna”,
“middle-name”: “Maria”,
“surname”: “Mozart”,
“honorific”: “Dr.”}]}

While I generally shudder at mixing semantics with data, I’m not so sure I care in this case.
Qea says:

January 29, 2007 at 11:48 am

(names
(Saddam (Hussein) male)
(Al (Unser) Jr. male)
(Don Alonso (Quixote) de la Mancha male))
David Carver says:

January 29, 2007 at 3:26 pm

The whole JSON/XML debate is less scarey to me then seeing the above markup out in the real world. The sad thing here is that too many webservice XML ends up looking like the above because of the reliance on code generators instead of data architects to construct the XML. The prevelance of the above markup is still scary for systems that need expandability.
pwb says:

March 3, 2007 at 1:09 am

The problem with XML is that it encourages unnecesary data complexity. At the end of the day, data ends up needing to be representable in 2D and 3D so it’s not wise to design such complex structures. The biggest databases in the world still use flat files after all.
Len Bullard says:

March 12, 2007 at 7:07 am

Actually you will be better off separating your 2D and 3D data for reasons that don’t become apparent without trying that. There is a reason that 3D languages provide 2D layers. Flattening data at the source doesn’t change that. Like David, I spent more time working with SQL when working on production systems and grew to appreciate clean comma-delimited ASCII unless I wasn’t the generating source.

That said, David gets it exactly right: the expressiveness of the structure has to match the complexity of the information, something that can be expressed in path metrics, not to be described here. JSON is a low cost alternative. That’s fine. One thing years of markup experience taught me was to recant the wall-to-wall markup advocacy. It’s wrong. Can it all be wrangled in under the InfoSet? To be demonstrated by someone with more time and need to do that.