The XML 2006 site is now pickled and preserved for long-term storage. Almost all of the presenters got their papers or slides in for the proceedings, if not on time, at least in time. Unfortunately, if you want to see a paper or slides from one of the few who didn’t send us anything, you’ll now have to pester them directly.
Recipe for pickling a web site
The original site was a hand-rolled LAMP implementation, but it was designed from the start to be amenable to a static copy. To pickle it, I started by doing a recursive slurp of the live site using wget (with the -m
option) — that generated permanent, static HTML copies of the dynamic, database-driven pages on the site. At that point, I had an almost, but not quite perfect static copy of the site, because there were two things that wget missed:
- Images referred to only in CSS stylesheets (such as the banner).
- CSS stylesheets referred to by other CSS stylesheets.
It took only a few minutes to add all of that by hand, and the site was ready to go.
Why it worked
This will be old news to a lot of people reading, but a few simple advance steps (during site design) made later static preservation easy. Here’s what I did:
- Every page has its own URL, period, end of discussion. No AJAX, no POST.
- Every page (or at least, every page that we want to archive) is reachable, directly or indirectly, from the home page.
- Script names are not shown to the public, so there are no URLs ending in “php” (hint: exposed script extensions like “php”, “asp”, or “jsp” are signs of gross incompetence in web design).
- No web pages rely on exposed GET request parameters: for example, the URLs looked like
/programme/presentations/123.html
, not/programme/presentation?code=123
, or even worse,/show-presentation.php?code=123
.
And that’s it. Of course, if the site had included live forms, I would have had to remove those as well (and any links to them), but that wouldn’t have been much extra work.
On a final note, while the live site was hosted on an Apache server (the “A” in “LAMP”), the pickled site is hosted on a Microsoft IIS server. It made no difference at all — that’s the way Web standards are supposed to work.
An issue you don’t mention is preservation of the media type. A cooler implementation would use “/programme/presentations/123” without the .html extension leaving the choice of media type to content negotiation. However, that doesn’t work so well when the files are stored in most file systems which do not preserve media type.
I assume that, in practice, you required a reliable mapping between filename/URL extension and media type.
Thanks, Ed. Content negotiation sounded cool back in the 1990s when the Tim B-L and others at the W3C were pushing it so hard, but outside of their own site, I haven’t seen it much (if at all) in the wild. I wonder if it’s one abstraction too far.
Relying simply on well-known file extensions (html, png, jpg, pdf) for media identification worked very well in this case, with one exception, which I didn’t notice until after I made the posting: I was relying on the web server to send out the right character encoding (I had used .htaccess in Apache), so I’ve had to send a note to IDEAlliance’s ISP asking all files to be served out as UTF-8. Right now, IIS is sending them simply as ‘Content-type: text/html’ with no encoding specified. Firefox guesses UTF-8 correctly, but MSIE doesn’t. That should be fixed early next week. IIS also sends out stupid caching headers, but I’ve also requested that those be fixed.
Whats your rationale behind
“Script names are not shown to the public, so there are no URLs ending in “php” (hint: exposed script extensions like “php”, “asp”, or “jsp” are signs of gross incompetence in web design). “?
Deepak:
Exposing scripting extensions leads to a huge range of problems for web sites:
The extension exposes both the scripting environment being used and the architecture of the site (e.g. get-bank-account.asp?account=12345), providing valuable information to any would-be cracker.
The scripting extension makes long-term maintenance of the site very difficult, since you will have to either break all existing links and bookmarks or add a complicated, high-maintenance set of redirects as the site’s architecture or web framework changes over the years (even something as simple as splitting one script into two or vice-versa could break thousands of external links and bookmarks).
Search engines cannot index pages that rely on POST parameters, and often won’t index pages that rely on GET parameters, so you’re damaging the site’s search-engine placement.
(Less important) scripting extensions look tacky and amateurish, and reflect especially badly on big companies like banks and merchants who rely on online trust (if they don’t know enough to hide the script extensions, do they really know enough to protect my credit card number?).
For a more detailed and thoughtful discussion of this topic, see Tim Berners-Lee’s famous paper, Cool URLs don’t change — there are some real-world examples at the end of sites bombing miserably from not following the advice.
I David, just found your blog and I like it…
I agree that ideally you shouldn’t show {.asp,.jsp,.php} extensions… Yet many of the planet’s most successful websites (both in terms of availability and users) do use them (well, at least they don’t bother to hide these extensions).
Regarding the security concern, hiding the extension is just “security by obscurity”. Now I’m certainly not saying the website is *less* secure… but you’re not gaining much (some would say that with security by obscurity you’re gaining nothing).
Regarding the banks websites: I know that out of security concern most of the banking industry is using Java and Java backed Webapp servers. I like that, for I know that there hasn’t been a single buffer overflow targetting any single JVM since Java exist (there have been exploits in thirdparties, C-written, libs, like zlib, though). So when I’m on a website and I see “.jsp” I tend to think: it’s backed by Java, it’s probably not as insecure as a .asp or .php website.
But that may be just me.
Anyway this is a false argument for only a very small percentage of the population, actually an insignificant one, knows what it means when they see .php or .asp or .jsp etc.
I agree it’s better not show the extensions, but I wouldn’t consider showing them to be such a huge problem.