REST design question #4: how much normalization?

[Update: why this has to do with REST] Here is the fourth in a series of REST design questions: how much should the XML data files returned by a REST web application be normalized into separate XML files? For example, if an application is returning information about the film Sixteen Candles, should it try to put most of the relevant information into a single XML file, like this?

<film>
  <title>Sixteen Candles</title>
  <director>John Hughes</director>
  <year>1984</year>
  <production-companies>
    <company>Channel Pictures</company>
    <company>Universal Pictures</company>
  </production-companies>
</film>

Or should it link to separate XML documents containing information about people, companies, and so on, like this?

<film xml:base="http://www.example.org/objects/014002.xml" xmlns:xlink="http://www.w3.org/1999/xlink">
  <title>Sixteen Candles</title>
  <director xlink:href="487847.xml"/>
  <year>1984</year>
  <production-companies>
    <company xlink:href="559366.xml"/>
    <company xlink:href="039548.xml"/>
  </production-companies>
</film>

(Of course, you can take this a lot further, making the relationships themselves, like isDirectorOf, into separate XML files, but this is enough to give a good flavour.)

Presumably, the REST server is creating the XML information from a relational database that is normalized, so the regular arguments about maintainability, etc. are not an issue. Still, each example has its disadvantages:

  • In the first example, the client application cannot be certain that two separate records are referring to the same director or production company, or to a different one that happens to have the same name. It will also be hard for the server to handle a PUT request to update the (normalized) database.
  • In the second example, the client application will have to make a ridiculous number of GET requests to assemble enough information for even the most basic application, like a cast list: complete information for information like cast, crew, and locations even for a single movie will likely involve retrieving over hundreds or thousands of tiny XML files.

Would imitating HTML be the best compromise? HTML links (the a element) typically include both a reference to an external resource and a short, local description of the resource at the other end of the link (i.e. the blue, underlined text). There is no reason that XML data files in a REST application cannot do the same thing, combining the advantages of the normalized and unnormalized approaches, as in this example:

<film xml:base="http://www.example.org/objects/014002.xml" xmlns:xlink="http://www.w3.org/1999/xlink">
  <title>Sixteen Candles</title>
  <director xlink:href="487847.xml">John Hughes</director>
  <year>1984</year>
  <production-companies>
    <company xlink:href="559366.xml">Channel Pictures</company>
    <company xlink:href="039548.xml">Universal Pictures</company>
  </production-companies>
</film>

Now, a simple REST client application does not need to retrieve extra data files simply to find the name of the director or production company, but it still knows where to look for more complete information. It can also use the link URLs as identifiers for disambiguating people, companies, and so on. The approach will also be familiar to web developers, the ones who will eventually decide whether to use REST for data retrieval.

Now, what about a REST application that supports not only GET but PUT? What should it do when someone tries to check in this document? I’d suggest that any information under an element with an xlink:href attribute should be considered non-canonical and ignored during the checkin — you don’t want to rename John Hughes on the basis of the description of one of his films — and that the label information inside the link be autogenerated at the next GET (presumably from the resource at http://www.example.org/objects/487847.xml).

This particular design question comes from personal experience during the late 1990s — the project involved moving precisely this kind of information in very large quantities to eCommerce customers. In that case, PUT was not an issue, since the customers did not have write access to the provider’s database.

(Josh Sled quite reasonably asks what this question has to do specifically with REST. The main selling point of REST is linking resources together, so I believe that figuring out when to link and when to embed will be critical to making REST-based applications work. Josh also mentions RDF. The project I mentioned actually was trying to use RDF [first the 1.0 WD, then the REC]; unfortunately, RDF makes an example like my third one difficult, since in 1.0 at least, a property had to have either a link or content, but not both; you end up having to create a new, inline resource for every link, which is messy. I’m not too familiar with the newer RDF version, so I don’t know if they’ve fixed that by allowing labeled links.)

About David Megginson

Scholar, tech guy, Canuck, open-source/data/information zealot, urban pedestrian, language geek, tea drinker, pater familias, red tory, amateur musician, private pilot.
This entry was posted in Uncategorized and tagged . Bookmark the permalink.

3 Responses to REST design question #4: how much normalization?

  1. Pingback: AsynchronousBlog

  2. Pingback: AsynchronousBlog

  3. Danny says:

    I think the RDF version does force you to be more explicit about the nature of the label (and really it uses the URI as an identifier, there’s not inbuilt notion of linking), but that doesn’t make it much more messy:

    <director rdf:about=”http://www.example.org/objects/014002.xml/487847.xml” xxx:name=”John Hughes” />

    There is quite a bit of information being conveyed here, the direct statements being:

    487847.xml rdf:type yyy:director
    487847.xml xxx:name “John Hughes”

    but also by inference:
    487847.xml rdf:type rdfs:Resource
    yyy:director rdf:type rdfs:Class
    xxx:name rdf:type rdf:Property
    etc.

Comments are closed.