The XML 2006 site is now pickled and preserved for long-term storage. Almost all of the presenters got their papers or slides in for the proceedings, if not on time, at least in time. Unfortunately, if you want to see a paper or slides from one of the few who didn’t send us anything, you’ll now have to pester them directly.
Recipe for pickling a web site
The original site was a hand-rolled LAMP implementation, but it was designed from the start to be amenable to a static copy. To pickle it, I started by doing a recursive slurp of the live site using wget (with the
-m option) — that generated permanent, static HTML copies of the dynamic, database-driven pages on the site. At that point, I had an almost, but not quite perfect static copy of the site, because there were two things that wget missed:
- Images referred to only in CSS stylesheets (such as the banner).
- CSS stylesheets referred to by other CSS stylesheets.
It took only a few minutes to add all of that by hand, and the site was ready to go.
Why it worked
This will be old news to a lot of people reading, but a few simple advance steps (during site design) made later static preservation easy. Here’s what I did:
- Every page has its own URL, period, end of discussion. No AJAX, no POST.
- Every page (or at least, every page that we want to archive) is reachable, directly or indirectly, from the home page.
- Script names are not shown to the public, so there are no URLs ending in “php” (hint: exposed script extensions like “php”, “asp”, or “jsp” are signs of gross incompetence in web design).
- No web pages rely on exposed GET request parameters: for example, the URLs looked like
/programme/presentation?code=123, or even worse,
And that’s it. Of course, if the site had included live forms, I would have had to remove those as well (and any links to them), but that wouldn’t have been much extra work.
On a final note, while the live site was hosted on an Apache server (the “A” in “LAMP”), the pickled site is hosted on a Microsoft IIS server. It made no difference at all — that’s the way Web standards are supposed to work.