Today, the US government’s data.gov temporarily went dark, and along with it, what is likely the world’s most important collection of open data sets:
The US government’s data.gov home page on 1 October 2013.
You are welcome to use this as a chance to rail against the juvenile hijinks in the US Congress, but I think there’s a far more important lesson: if you depend on any centralised data source, even one run by the world’s richest and most-powerful government, it can fail and leave you cut off.
Nuclear bombs and censorship
There is a proverbial story that ARPAnet, which later grew into the Internet, was designed to route around failed nodes so that it could keep functioning after a nuclear attack. Even if that story is not strictly true (the design had more to do with the unreliable networking hardware of the time), the actual networking layers of the Internet are highly failure-tolerant.
Information-freedom activist John Gilmore took that ARPAnet creation myth a step further, and argued that
The Net interprets censorship as damage and routes around it.
Just as the Internet (as a network) can route around damage, “The Net” (as a culture) can route around censorship using Internet-the-network as a tool. History has proven Mr. Gilmore right: the entertainment industry, for example, has entirely failed to control and restrict the distribution of movies and music online, and the US government — which could reduce dozens of other countries to ash with the push of a button — could do nothing to prevent the spread of the (unauthorized) WikiLeaks data.
Replication, not routing
The Internet consists of a large collection of specifications and standards that define and enable its ability to route around damage; there’s no similar set of standards for getting information around censorship barriers (whether related to intellectual property or restriction of basic human rights). So how does it work? Why can’t the music industry, for example, take a song offline once people have started sharing it? How does “The Net” route so-called “pirated” content around huge, angry corporations spending millions of dollars hiring lawyers and lobbying legislators?
The trick with content seems not to be routing, but replication. To survive online, a piece of information simply has to be copied faster than its opponents can take it offline. Because it’s possible to make perfect, lossless copies of digital content, it becomes irrelevant whether a copy is first generation or 10th generation. For example, if five people make copies of content, then five more people make copies of each of those copies, etc., by the 10th generation you have 510 — or nearly 10 million — perfect copies spread out around the world, and with extremely-popular content, that process can take place in minutes.
Could we rebuild data.gov?
It’s likely that most Americans won’t suffer any real harm from today’s shutdown of data.gov: open data is still in its infancy. However, if we in the open-data community realize our hopes and succeed at making open data a critical part of how the world works, then the next shutdown could be far more harmful. Companies that rely on open data might have to close their doors and furlough employees; emergency responders in the field might have trouble helping victims of a flood or earthquake; maps or navigation systems might stop working; and so on. The more-successful open data becomes, the higher the cost of having it fail.
It would not be a complete disaster, however. A lot of the open data on data.gov exists in copies elsewhere, and if the site were to disappear, we could probably find copies of individual datasets on hard drives scattered around the world, and reproduce most of the data that was on it on 30 September. It would take time, and we wouldn’t know if the data was corrupt or fraudulent, but in most cases, it would probably be OK. As the world moved further and further beyond 30 September 2013, we’d also have to figure out how to get new data from the departments, offices, and organisations who had previously centralised their datasets in data.gov.
Learning from the pirates …
How can we make this recovery process easier? Let’s imitate the people who have already solved this problem: the so-called content “pirates.” We expect centralized open-data sites like data.gov to be available all of the time; the pirates expect their sources to vanish at any moment. We expect data providers to help and encourage us to use their data; the pirates expect legal action trying to shut them down. We get funding; they get fines or even sometimes go to jail. Yet they flourish, while we’re vulnerable to any government’s or organisations internal financial squabbling.
The answer is to copy, copy, copy, and copy. Make copies of all the open data you can find, share your copies with as many people as you can, and keep the copies somewhere safe, just as you would with MP3s of your favourite artist. Open-data sites that discourage bulk downloading need to rethink their priorities, but if they don’t, find a way around any barriers that they throw up. We need 1,000 sites providing the data.gov data, spread around the world, some publicly-funded, and some private. In a sense, a litigious recording company and a government-funded open-data site present exactly the same risks to their users, and we have to learn not to trust the availability of any single site.
… but sailing under true colours
But still, we’re not pirates. Unlike content piracy, open-data sharing can stay out in the daylight. Ministers and heads of state support us, international organisations and foundations fund us, and the media praise us. That means that we have the opportunity to get together and come up with real standards or specs for keeping open data available, just as the ARPANet founders did for network resiliency. These processes are hard, they will take time, and most ideas will go nowhere, but eventually, we could come up with something as useful as the collection of the Internet standards and less-formal, ad-hoc that allow you to get to this blog even around a broken router.
Working in the open also allows us to address issues of trust that are difficult to deal with in the piracy world. If you download an unauthorised copy of Microsoft Word, for example, how do you know that it’s authentic? Is it going to introduce malicious software onto your computer? If you download an unauthorized movie Disney movie, how do you know that it won’t suddenly flash a Goatse on the screen at minute 51?
In the open, we can talk about how to sign digital content and build a web of trust, so that you can rely on a US government dataset even if you loaded it from a Russian web site. We can talk about standardizing how open-data sites notify other systems about new or updated datasets (e.g. using RSS or Atom), so that sites can easily and automatically mirror one-another. And we can talk about discarding — in our field — broken concepts like the Creative Commons attribution licenses, which actually discourage sharing and using open data.
If we get this stuff right, we’ll be ready the next time data.gov goes down, when open data really matters to the world. And maybe we’ll see the pirates starting to imitate us.