Comments on: CSV linked data

By: David Megginson

David Megginson — Fri, 21 Feb 2014 16:21:16 +0000

In reply to John Cowan. :-)

By: John Cowan

John Cowan — Fri, 21 Feb 2014 14:10:56 +0000

You know, I was going to point the person who wrote "CSV is a bugger for distributed, linked data" as part of a comment on Tim Bray's latest to this page for an example of how to do it — and then I saw that it was you!

By: John Cowan

John Cowan — Fri, 12 Apr 2013 20:26:08 +0000

In reply to Bob DuCharme. Bob, David's idea is stronger than you think. In fact, if the features of the Hausenblas draft are adopted, his scheme is completely equivalent to RDF triples, except that there's no way to specify b-nodes, xml:lang, or XML Schema typed literals. Each cell represents a triple: the subject is the URI of the row (using the #row:n or #where:name-value fragment identifiers), the verb is the URI of the column (using the #col:name or #col:n fragment identifiers), and the object is the content of the cell, which can be a literal or a URI. To add that power, I'd recommend switching to the Hausenblas draft wholesale, as it neatly distinguishes the different kinds of fragment identifiers that are needed, for interchange with conventional triple stores. What is more, this scheme applies not only to CSV documents but to relational database tables, provided you have a method for a URI to specify a table within a database. If the RDBMS provides typed columns, specifically a URI type, then you don't even need the double braces.

By: John Cowan

John Cowan — Fri, 12 Apr 2013 19:03:45 +0000

In reply to David Megginson. The slice-based selection syntax #where:name=value is definitely better, though more verbose.

By: John Cowan

John Cowan — Fri, 12 Apr 2013 18:56:03 +0000

In reply to Remy Porter. Indeed, a link is the simplest case of a query.

By: mera

mera — Tue, 09 Apr 2013 10:32:43 +0000

Presently there a few interesting points over time in this posting however I don’t know if I view these center to heart. There can be many validity but I’ll take hold viewpoint until I discover it even more. Excellent write-up, thanks and after that we really wish for much more! Placed into FeedBurner way too

By: David Megginson

David Megginson — Thu, 31 Jan 2013 13:04:22 +0000

In reply to David Megginson. Markus is absolutely right that I misread the draft -- it does allow selection of *rows* based on multiple *column* values.

By: Markus Lanthaler

Markus Lanthaler — Tue, 22 Jan 2013 22:41:24 +0000

In reply to David Megginson.

Isn’t that exactly what they call “slide-based selection” (http://tools.ietf.org/html/draft-hausenblas-csv-fragment-01#section-2.5)?

By: David Megginson

David Megginson — Sat, 12 Jan 2013 22:12:14 +0000

In reply to Remy Porter. Thanks, Remy — that's an interesting point. Yes, fragment identifiers really are queries, though they still play an primary role in REST when you're dealing with machine-readable data, since you're actually identifying a target within a resource, rather than just prescrolling to a position on an HTML page.

By: Remy Porter

Remy Porter — Sat, 12 Jan 2013 13:47:23 +0000

You haven’t created a system for linking CSV files. You’ve created a system for querying CSV files via RESTful operations. The only queries supported are equality comparisons, although there’s no reason that there couldn’t be more operations, aside from the fact that you’d lose some simplicity and really, you’re not trying to reinvent SQL.

You can use these queries to link data together because they’re simple text and can easily be stored via CSV.

That said, this RESTful query language you’ve built could just as easily be migrated to JSON or XML if one makes certain assumptions about the tree of those hierarchical data types. It could also query RDBMS data, or really anything.

By: David Megginson

David Megginson — Wed, 09 Jan 2013 01:59:45 +0000

In reply to Bob DuCharme.

Those are good points, Bob, but whether RDF has succeeded or failed is a matter of perspective (and as with all such topics, is ultimately a futile debate). Without asserting that I have a monopoly possession of “truth” (whatever that means), I’ll just state my own contrasting point of view here.

From my perspective, RDF, at best, to a (hypothetical) successful linked-data implementation what HyperCard was to the Web. HyperCard helped people get used to the idea of hypertext and hypermedia, and it had many more features than the Web (at least for the Web’s first decade or so), but it existed only in a very constrained environment.

Of course, there were much more-powerful (though less-popular) hypermedia systems than HyperCard. I remember reading the proceeds of the annual Hypertext conferences from 1988 or 1989 before I knew about Tim B-L’s work. It was fascinating stuff, and the systems presented addressed many of the ontological and epistemelogical problems that RDF, OWL, etc. try to address for linked data. These systems had seemingly huge success inside academia during my graduate-school years from 86-92, along with some marquee non-academic implementations in corporate and other environments, but when Tim B-L introduced stupid, brain-dead, unidirectional, untyped, single-linking HTML, it took off so fast that what had previously been considered “success” for all the other hypertext systems became basically a rounding error.

I don’t expect that scale of adoption for any linked-data initiative, because the potential audience is much smaller — the potential audience for the web is anyone who can read (or now, watch videos or listen to music), while the most-optimistic potential direct audience for linked data is anyone who can use a spreadsheet (i.e. millions, not billions). But the lesson holds — start simple, and don’t try to solve problems that people don’t actually have yet. Unfortunately, I think Tim B-L forgot that lesson with the Semantic Web, and in a kind-of reverse Stockholm Syndrome, became like the hypertext academicians who were fighting and mocking him so much when he first launched HTML.

By: David Megginson

David Megginson — Wed, 09 Jan 2013 01:46:37 +0000

In reply to Markus Lanthaler. Thanks, Markus. I hadn't read it, and was very excited about the possibility of using it, but I think it's fundamentally flawed — it lets me select a column by header name, but does not let me select a row by a cell's contents. Since CSV data generally represents each entity as a row, I'm assuming the most common use case will be to point to a specific row (e.g. the row where employee_id="12345"), and the RFC doesn't support that. I will update the blog post to include a link.

By: Markus Lanthaler

Markus Lanthaler — Tue, 08 Jan 2013 14:59:31 +0000

Are you aware of http://tools.ietf.org/html/draft-hausenblas-csv-fragment-01 ? No need to reinvent the wheel 😛

By: Bob DuCharme

Bob DuCharme — Tue, 08 Jan 2013 14:50:50 +0000

Hi David,

The idea that all links mean the same thing (a pointer to a more-complete and more-authoritative version of a piece of information) is very limiting. Even HTML can do better than that, because the anchor text can tell you something about why the link is there (e.g. earlier version of document, proposed revision<>

> it could bring 95% of the benefit of linked data to CSV for 5% of the effort.

95% is a very high number. The indication of link targets is only one part of linked data. Along with the ability to describe the relationships themselves, which as noted above you’ve dropped, the ability to unambiguously identify resources is key to letting anyone in the world create a link between any two resources. How do I know that “CA” in your examples doesn’t refer to California or Computer Associates? Because it has “Canada” on the same line? The country or the Richard Ford novel? If it said http://dbpedia.org/resource/Canada or http://dbpedia.org/resource/Canada_%28novel%29, I could be sure. If it was the former, I (or an automated query using a well-implemented, standardized query language) could follow links from there to http://www4.wiwiss.fu-berlin.de/factbook/resource/Canada and find CIA World Factbook data about the country. (factbook:airports_withpavedrunways_total 509, factbook:airports_withunpavedrunways_total 828. I thought you’d like those.) The fact that the CSV file has “countries” in its name won’t scale very far; “hierarchy” and “subdivisions” are both pretty vague.

RDF is more complex and lets you do more. That’s the tradeoff with just about any technology choice outside of RELAX NG vs. XSD. I don’t want to retread any arguments about whether each aspect of the complexity is worth the effort, but many, many companies and governments are getting great value from it. XLink failed, and Topic Maps failed, but RDF is churning along very nicely, and with several good reasons.

Bob

By: Martin Davis

Martin Davis — Mon, 07 Jan 2013 22:24:08 +0000

In reply to David Megginson. Ah, good point about enhancing "crawlability". I was thinking only of usage by humans, which is a narrower viewpoint. And (as you suggested by your mention of RDF) the bigger picture is to help link data into the "Semantic Web" (if, when, and how it ever comes to pass).

By: David Megginson

David Megginson — Mon, 07 Jan 2013 22:11:32 +0000

In reply to Martin Davis.

Martin: thanks for introducing two good points into the discussion.

1. Syntax — I’m still on the fence about whether an explicit syntax is necessary or not. You’re right that an unadorned URL would probably work most of the time. I just need to make sure that there aren’t rare-but-important cases that might bite us. Is it important to capture the user’s intention that this is supposed to be a link to more data (rather than just a URL)?

2. Use cases — I’m starting with two major classes of use cases: (1) humans browsing data the way they browse the web (say, to discover data to download for further analysis), and (2) machines crawling or pulling in data as needed. On the (mostly-unstructured-content/document) web, we know that crawling is useful for building search indexes. What else might crawling be useful for on a (structured-content/data) web? Hierarchical information springs to mind — for example, an index of major aid donors, each with a link to a list of their projects, and a link to the country/region where they’re talking place. In an ideal world, there would also be links to *other* donors’ projects.

By: Kingsley Uyi Idehen (@kidehen)

Kingsley Uyi Idehen (@kidehen) — Mon, 07 Jan 2013 20:06:17 +0000

Hence the following:

1. http://dbpedia.org/page/Linked_Data — note the footer which has a link to CSV representation of the entity description

2. http://bit.ly/SboANR — why Linked Data integration into Google Spreadsheet and Excel is trivial thanks to SPARQL Protocol and the ability to hook in CSV representation of output (be it the tree based query results or entity description graphs)

3. http://www.slideshare.net/kidehen/accessing-linked-open-data-sources-via-virtuoso-odbc/6 — excerpt from presentation showing CSV basic all the way to Linked Data (presented in tabular form) .

By: Martin Davis

Martin Davis — Mon, 07 Jan 2013 19:44:21 +0000

Does it even need the double-curly-bracket syntax? Can’t the convention be simply a value which starts with a recognizable URI prefix? (E.g http:// or file://). The descriptive text can simply be delimited by the first blank char.

And in order to make the utility of this convention fully apparent, can you talk a bit about how clients (“user-agents” in web-speak) would make use of this information? It seems to me that one reason that linked data on the web is so useful is that it corresponds to a very useful client action – that of displaying further web content, or else carrying out some action on a remote server. In order to show this isn’t just cargo-cultism some examples would be useful.

By: David Megginson

David Megginson — Mon, 07 Jan 2013 02:23:24 +0000

Paul: thanks — I’ve corrected the id.

Michael: also thanks — agreed about the fragility of URLs (a problem faced by all online resources, tightly- or loosely-structured), but I have not yet seen a widely-implemented solution to it. There’s no reason the URLs actually have to be file names, of course; they could be something like http://example.org/data/people (with or without content negotiation).

By: Michael Sokolov

Michael Sokolov — Mon, 07 Jan 2013 02:18:19 +0000

The syntax seems workable. The interpretation doesn’t make any sense to me. When I saw this my first thought was that it provides a mechanism for representing relational data in CSV. Nobody gets tied up in knots about what many-one relationships mean in SQL, or if they do, it’s probably a waste of time. Containment, ownership, parentage, whatever. Authority? I suppose, if you like.

I think it’s a problem that the filename is critical, but files tend to copied, moved around, renamed, versioned (get dates attached to them).