REST design question #2: listing and discovering resources

The second in my series of REST design questions is how to handle listing and paging, or, in fancier jargon, resource discovery. I prefer concrete examples, so I’ll start with one that I know is flawed and then try to find ways to fix it.

Let’s say that I have a large collection of XML data records with URLs like http://www.example.org/airports/cyow.xml and http://www.example.org/airports/cyyz.xml, and so on. Since they all share the same prefix, it would be reasonable to assume that performing an HTTP GET operation on that prefix (http://www.example.org/airports/) would return a list of links to all of the data records (though I acknowledge that URLs are opaque and no one should rely on that, etc. etc.):

<airport-listing xmlns:xlink="http://www.w3.org/1999/xlink" xml:base="http://www.example.org/airports/">
  <airport-ref xlink:href="cyow.xml"/>
  <airport-ref xlink:href="cyyz.xml"/>
  <airport-ref xlink:href="cyxu.xml"/>
  ...
</airport-listing>

This is a wonderfully RESTful example, since it shows how (say) a search-engine spider could eventually find and index every XML resource. However, anyone who’s ever worked on a large, production-grade system can see that there’s a huge scalability problem here (I’m leaving out other possible issues like privacy and security). For a listing of a few dozen resources, this is a great approach. For a listing of a few hundred, it’s manageable. A listing of a few thousand resources will start to consume serious bandwidth every time someone GETs it, and a listing of a few million resources is simply ridiculous.

HTML-based web applications designed for humans typically employ a combination of querying and paging to deal with discovering resources from a large collection. For example, I might start by specifying that I’m interested only in airports with instrument approaches within 500 nautical miles of Toronto; then the application will return a single page of results (say, the first 20 matches), with a link to let me see the next page if I’m interested.

How would this work for a REST-based data application? Clearly, we want to use GET rather than POST requests, since pure queries are side-effect free, so presumably, I’d end up adding some request parameters to limit the results:

http://www.example.org/airports/?ref-point=cyyz&radius=500nm&has-iap=yes

That’s certainly not the kind of pretty REST URL that we see in the examples, but it does look a lot like the ones used in Amazon’s REST web services, so perhaps I’m on the right track. Of course, there will have to be some way for systems to know what the available request parameters are. Now, perhaps, the result will look something like this (assuming 20 results to the page):

<airport-listing xmlns:xlink="http://www.w3.org/1999/xlink"
    xml:base="http://www.example.org/airports/?ref-point=cyyz&radius=500nm&has-iap=yes">
  <airport-ref xlink:href="cyow.xml"/>
  <airport-ref xlink:href="cyyz.xml"/>
  <airport-ref xlink:href="cyxu.xml"/>
  ...
  <next-page-link xlink:href="http://www.example.org/airports/?ref-point=cyyz&radius=500nm&has-iap=yes&start=21"/>
</airport-listing>

As far as I understand, this is good REST, because the XML resource contains its own transition information (i.e. a link to the next page). However, this is pretty unbelievably ugly. Presumably, the same kind of paging could work on the entire collection when there are no query parameters, so that

http://www.example.org/airports/

or

http://www.example.org/airports/?start=1

would return the first 20 airport references, followed by a link to http://www.example.org/airports/?start=21, which will return the next 20 entries, and so on. The potential power of REST and XLink together is clear: it is still possible to start at a single URL with a simple crawler and discover all of the available resources automatically, and unlike WS-*, I did it without having to deal with extra, cumbersome specs like UDDI and WSDL. Still, this looks a bit like an ugly solution to me. I’ll look forward to hearing if anyone can come up with something more elegant.

Tagged | 4 Comments

The best Firefox extension

Anchor in Firefox.

The Firefox browser has a lot of well-loved extensions like AdBlock and ImageZoom (especially useful for looking at weather maps online), but my personal favourite is a little-known one called Show Anchors

Anyone writing for the web — and especially a blogger — needs to link to web pages a lot. Often, the web pages contain anchors that would let us link to the exact spot we need rather than to the top of a long document, but unless you can grab them from a table of contents or you are willing to spend a while reading through View Source, those anchors are pretty hard to find. For example, here is see a screenshot of Firefox viewing the W3C’s XML Recommendation (click on the thumbnail for full size):



.

The page is full of anchors, but you cannot find them. With the Show Anchors extension in Firefox, I simply right click on the browser window, select Show Anchors from the pop-up menu, and the display in Firefox changes (again, click on the thumbnail for full size):



Inside the Firefox window, clicking on one of the anchor icons copies a full URL, with fragment identifier, to the clipboard. It’s a real timesaver for writing weblog entries.

Tagged | 3 Comments

xml:lang is an accessibility issue

Charl van Niekerk has an interesting posting on a topic that should have been be more obvious to me: that the xml:lang attribute (and HTML lang) are critical for making online information accessible to the visually-impaired. Voice synthesizers that read documents aloud need to know what language they’re reading, and it wouldn’t take much effort for us to tell them.

Obviously, this is a less critical issue for data-oriented XML, but even then, XML data often contains large chunks of prose (like product descriptions) that are, eventually, intended for human consumption. I won’t promise to rush and fix all of this today in my existing XML and HTML, but I’m certainly going to try harder in the future.

Posted in General | 1 Comment

REST design question #1: identification

My first REST design question is about the fact that RESTafarians seem to consider identification and location to be the same thing, and following from that, the question of how to make identification persistent in XML resources. For example, assume that http://www.example.org/airports/ca/cyow.xml is both the unique identifier of an XML data object and the location of that object on the web. That’s the whole point of REST, really. RESTafarians don’t like interfaces where identifiers are hidden inside XML objects returned from POST requests to unrelated URLs, for example (in fact, they get angry in quite an amusing way).

GET and PUT

So, here’s a simple use case. Let’s say that I download the XML data file at http://www.example.org/airports/ca/cyow.xml and it looks like this simple example:

<airport>
 <icao>CYOW</icao>
 <name>Macdonald-Cartier International Airport</name>
 <political>
  <municipality>Ottawa</municipality>
  <region>ON</region>
  <country>CA</country>
 </political>
 <geodetic>
  <latitude-deg>45.322</latitude-deg>
  <longitude-deg>-75.669167</longitude-deg>
  <elevation-msl-m>114</elevation-msl-m>
 </geodetic>
</airport>

I then copy it onto a USB memory stick, bring it home from work, copy it onto my notebook computer, and work on it while offline during a business flight. The file no longer has any direct connection with its URL: it has gone through other transfers since the HTTP GET request I used to download it. How do I know what I’m working on or where I should PUT it when I’m done?

If this information has to be kept out of line, then some of REST’s advantages are evaporating, because now I have to start using custom-designed clients again instead of simply piggybacking on existing web technologies. As an identifier, the URL is clearly part of the resource’s state, and belongs in the XML data file; as a location, however, it is superfluous information and belongs only in the protocol (HTTP) level.

Where does the document identifier go?

Let’s assume that I get over my squeamishness and decide that the URL is a proper identifier and belongs in the XML representation. Now, how do I do that in a fairly generic way? xml:id is out of the question, since it’s designed only to hold an XML name for identifying part of a document, not a URL to identify an entire document. I could use (or abuse) xml:base, like this:

<airport xml:base="http://www.example.org/airports/ca/cyow.xml">
 ...
</airport>

I’m not certain, though, how XLink processors would deal with that. Would the relative URL “cyyz.xml” end up being resolved to http://www.example.org/airports/ca/cyyz.xml or http://www.example.org/airports/ca/cyow.xmlcyyz.xml? There’s also the possibility that some highly-cooked APIs might predigest the xml:base attribute so that application code never sees it. Do the XML standards people believe this kind of an xml:base usage is legit?

If xml:id is unusable, and xml:base is problematic, it looks like there might be no standard way to identify RESTful XML documents, and each XML document type will need its own ad-hoc solution. Any suggestions? Does the world need one more xml:* attribute (I hope not)?

I’d be interested in hearing how REST developers have dealt with identifier persistence and round-tripping when the identifier is the URL.

Posted in General | 15 Comments

REST design questions

[Update: fifth and final question added] I’ve been thinking a bit about REST recently while working on a new data-oriented application. REST in its now-broadened meaning is easy to explain: pieces of data (likely XML-encoded) sit out there on the web, and you manipulate them using HTTP‘s GET, PUT, and DELETE methods (practically CRUD, except that the Create and Update parts are combined into PUT). Try explaining SOAP, much less the essence of the whole WS-* family in one easy sentence like that, and you’ll see the difference.

This very simplicity should raise some alarm bells, though. RDF also has an apparently simple data model, but for RDF 1.0, at least, the model turned out to be painfully incomplete, as I found out when I implemented my RDF parsing library. Is REST hiding any of the same traps? RESTafarians point out that REST is the basis of the Web’s success, but that’s really only the GET part (and its cousin, POST). Despite WebDAV, we have very little experience using PUT and DELETE even for regular web pages, much less to maintain a data repository. Even the much-touted RESTful web services from Amazon and eBay are GET-only (and POST, in eBay’s case); in fact, many, if not most firewalls come preconfigured to block PUT and DELETE, since web admins see them mainly as security holes.

My gut feeling is that REST is, in fact, more manageable than XML-RPC or WS-* for XML on the Web, but that we have a lot of issues we’ll need to work out first. Data management is never really simple, and while WS-* makes it harder than it has to be, even the simplest REST model cannot make it trivial. I’m going to post some of my own questions about REST design from time to time in this weblog, as I think of them, and I’ll look forward to hearing from people who have already dealt with or at least thought about these problems on their own.

Here are my questions so far:

Posted in General | 10 Comments

Open Web, Closed Databases?

Web site developers seem to be getting open specifications: more and more, I’m seeing sites developed for specifications like (X)HTML, CSS2, DOM, etc., not sites developed for applications like MSIE or Firefox or Opera; I’m seeing Java-based web apps that work with any J2EE-enabled web server, instead of apps that work only with Tomcat or WebSphere or WebLogic; and so on.

After all this, then, I’m surprised to see how many open source web apps specifically require MySQL rather than just “a SQL database. ” MySQL is a fine database, of course, but here we have an open specification, SQL, that’s been around far longer than most of the web specs, and many open source developers are choosing to lock themselves into a single database anyway.

I wonder what gives. I don’t have a lot of experience with PHP, which is the platform for many simpler web apps (including WordPress, which drives this weblog, though it offers an alternative) — is there no generic SQL database interface for PHP, or do the developers just not care? Are there serious performance issues using generic database interfaces? Or are my observations not representative, and in fact most open source web app developers do avoid locking themselves in to MySQL?

Tagged , | 5 Comments

Rumours of xml:id trouble in the W3C

W3C logo

[Updated: see below] Norman Walsh has just posted an unusual essay. The gist of it seems to be that the W3C (at some level) has decided to modify the xml:id specification (released only days ago as a Candidate Recommendation, as I mentioned here) — there is some other specification (not named) that has a bug, most likely an incorrect closed enumeration of all the possible attributes in the XML namespace. At some level, the W3C has decided that the attribute will be renamed to the unqualified xmlid to avoid upsetting the people who messed up the other spec.

Norm sounds mad, and I don’t blame him. I remember when I was on the original XML working group and we were ordered from above to rewrite the XML Namespaces spec substantially for extremely questionable reasons (mainly the ability to embed XML inside non-XML HTML documents for v3 browsers — seriously).

[Update: Norm has revised the essay, adding enough extra information to the essay to let us figure out the problem — it has to do with the interaction between the XML Canonicalization (C14N) and xml: attributes, where C14N mistakenly assumes that all xml: attributes should be automatically inherited. Here’s the official request to deal with the issue.

I had already mentioned the incompatibility with C14N in my first posting on xml:id, then forgot it completely when reading Norm’s essay. So far, this is just a dispute, not a done decision.]

Posted in General | 1 Comment

Hub URLs and feudalism in the blogsphere

Web pages, and especially weblogs, include apparently unnecessary links all the time. For example, is there really any need to link to Microsoft every time I mention the company’s name? Is anyone reading this posting going to follow the link (and if so, would that person have had trouble finding the site otherwise)?

Hub and Spoke

The best term I can think of to describe these links is hub URLs. They’re very much like airport hubs — connections from many smaller places feed into them, and often the only way to get from one small place to another is by passing through the hub as an intermediate point: for example, if I link to Microsoft and you link to Microsoft, someone can trace a route from my web page to your web page by changing planes, so to speak, at the Microsoft hub. One way to make the trip is to put http://www.microsoft.com/ into Technorati or a similar search engine that can supply ongoing results in an RSS or Atom feed, then read the postings that congregate around this hub URL in the blogsphere. The weblog postings are not linking to Microsoft so that you can find Microsoft; they’re linking to Microsoft so that you can find them. The nature of a hub URL is that the spoke web sites need it more than it needs any one of the spokes.

To take a less hackneyed example, here is a Technorati RSS feed of all weblog postings that link to Roy Fielding’s famous dissertation on web architecture. Granted, that’s not a very active hub URL, but still, all of the postings that link there form a community of interest, and a RESTafarian will almost certainly want to subscribe to such a feed. I expect that, more and more, the blogsphere will start grouping itself around hub URLs at least as much as it groups itself around individual personalities today.

Travel agencies

So far, so good. Search engines, the travel agencies of the web and blogsphere, already know how to take advantage of these hub URLs, as in the Technorati example I just cited above. Unofficial rumour has it that Google, for example (there I go again with a hub URL), makes great use of hub URLs for determining the relevance of search results. In fact, the whole push towards tags and folksonomies by sites like Technorati, Flickr, and del.icio.us is really an attempt to set up their own hub URLs.

In Technorati’s case, the travel agent wants not only to plan trips but to own the airport hub itself: that’s why they’re encouraging bloggers to link to the tags section of their site, making URLs like http://www.technorati.com/tags/web into hub URLs that are entirely under their control; it does not seem likely that their competitors will go along with that idea, though.

Castles and Boroughs

On problem is that the most popular URLs might end up becoming not only hubs but castles. Castles are cute tourist attractions today, associated mainly with pseudo-medieval romantic kitsch like knights and tournaments, but in the Middle Ages they were often instruments of oppression. While free landowners may originally have congregated around them for protection, they often lost their freedom (either by choice or coersion) and became feudal serfs, little more than the property of the powerful thugs who controlled the castles. If we start building our weblogs and sites in clusters around powerful hub URLs the way that free peasants built their huts around castles, are we risking the same fate?

Castles don’t show up automatically whenever people congregate together, of course. The alternative is the borough. Most of us in the developed world crowd together in suburbs, towns or cities, the ideological descendants of the boroughs, so that we can share services like water, electricity, roads, and shopping. While we have to make some compromises to live in close proximity, we do not have to give up fundamental freedoms the way that serfs around a castle did. The reason for that is that most economically-advanced countries have cities that are governed democratically rather than by a single strongman like a feudal lord; even in the Western European Middle Ages, boroughs enjoyed many freedoms and privileges, and were at least partly self-governing. So, getting back to the blogsphere, the question is this: do we want our hub URLs to be more like castles or boroughs?

This is an important question, because it is not farfetched to suggest that the owners of the most popular hub URLs could eventually start limiting the rights of the sites or blog entries linking to them. The entertainment industry has already had great success shutting down Bittorrent trackers, which simply link to files rather than actually hosting them; several courts have issued rulings against deep linking, like this one in Munich in 2002. Even when specific rulings are later overturned, it should be clear that linking is not off limits for legal action, and it is not impossible to imagine a future where someone has to agree to restrictive terms of service or even pay for the right to link to a popular hub URL like a Technorati tag or the Microsoft web site.

Wikipedia-boro

I have already suggested that Wikipedia would be a good source of subject codes, and in essence, that means using Wikipedia URLs as hub URLs. Wikipedia is not the only choice, of course, but it seems to be a particularly good one for a few reasons:

  1. it is a collaborative site where anyone can add new potential hub URLs and modify the information in the pages they point to
  2. our rights to use it now and in the future are guaranteed by the Gnu Free Documentation License (though to be strictly pedantic, that applies to the content rather than the URL itself)
  3. linking to the Wikipedia is more likely to give you a fair description of a subject than linking the subject’s own website (think of the difference between a politician’s own web site and the Wikipedia article on the politican, and you’ll see what I mean)

If enough people start linking to Wikipedia articles in their weblog postings, topic-based RSS or Atom feeds will become very easy: for example, Feedster will happily give you an RSS feed of weblog postings linking to the current U.S. President Bush or a feed with postings that explicitly link to the country Canada: presumably, these articles are treating these topics as major subjects, rather than just mentioning them in passing, so the contents of the search feeds should be highly relevant (imagine how many false hits you’d get from mailing address, etc., just searching for the word “Canada”).

Tagged , | 3 Comments

L10N out of control

[update: a mitigating factor] Localization (L10N) is a good thing in general: people like to see the languages, punctuation, and systems of measure that they’re used to. So, hats off to Google’s new beta map service for putting most of the streets names in Ottawa’s west end in French.

The only trouble is that the street names are actually English — we have Carling St, Holland Ave, and the Island Park Drive, not Rue Carling, Avenue Parkdale, or Promenade Island Park.

What went wrong? My guess is that Google (or their data provider) uses a vector map for L10N, either in real time or (more likely) pregenerated. Ottawa is right on the Quebec border, and the streets might have been misidentified as located in Quebec because the map doesn’t have enough resolution to follow the bends in the provincial border. [Update: to be fair, I should mention that some streets in the west end of Ottawa do have bilingual signs that say both rue and St., for example — since we’re the capital of a bilingual country, the city tries to set an example.]

Over all, Google’s mapping service is very impressive, especially for a beta, and this particular glitch is more funny than disruptive. I’m grateful that they even included Canadian cities in the first release.

Tagged | 2 Comments

xml:id

Anne van Kesteren’s is the first report to reach me that the W3C’s xml:id spec has just moved up the food chain to Candidate Recommendation. I’m usually one of the first people to whine about too many XML-related specs, but I think this is a good one, despite a few minor problems like an incompatibility with XML Canonicalization.

Why does this matter? Any use of XML over the web that requires DTD or schema processing is broken because of all the extra security and availability risks involved in processing external files, especially when they’re hosted at other sites. The xml:id spec gives a quick and dirty way of identifying parts of an XML document without requiring a schema, DTD, or even a namespace declaration (since the xml: prefix is predeclared for XML documents). Basically, you just use something like this inside your XML document:

<employee xml:id="dmeg123">
 <name>David Megginson</name>
 <role>Housekeeping</role>
</employee>

and you’re done. Other XML documents can refer to part of yours using a fragment identifier, as in http://www.example.org/employees.xml#dmeg123, and that’s that — no schemas are harmed in the making of this link. I don’t know if XML data on the web ever will take off, but this small spec is a critical step in the right direction. Congrats to the editors and the working group for pushing it through this far.

If only we could make everything in XML this simple.

Posted in General | 7 Comments