Sunday, 6 March 2011

The Missing Sixth Star of Open Linked Data?

In my previous posting I proposed the idea of the 5 stars of open access. There is of course one feature that the original "taxonomy" misses out completely - repositories! Not just "my favourite repository platform", but the idea of persistent, curated storage. Consequently, my proposal for open access doesn't mention repositories - a bit of an oversight!

At the moment, the entry level to the 5 stars is simply "put it on the web, with an open license". Perhaps we should change this to "put it in a repository with an open license"; perhaps we could designate a "zeroth star" for "just put it on the Web". However, the Linked Data Research Lab at DERI already propose a no-star level, which involves material being put on the web without an explicit license.

You can get away with putting material on the Web without any concern about their future safety - but not for long, especially if you want to build services on top of that material.

Services like CKAN (Comprehensive Knowledge Archive Network, are registries of open knowledge packages currently favoured by the open data community. This registry is built on a simple content management environment, and by November 2010 was already returning HTTP 400- and 500-class error codes for 9% of its listed data source URLs.

A more extreme example is seen in the UK, where police forces recently started to release data about crime reports. But "whenever a new set of data is uploaded, the previous set will be removed from public view, making comparisons impossible unless outside developers actively store it" (see The Guardian for more details).

Repositories have an opportunity to provide management, persistence and curation services to the open data community and its international collections of linked data. Whether our OA platforms are chosen (DSpace? EPrints? Fedora? Zentity?) is not the issue - it is the philosophy and practices of repository that are vital to the Open Data community, because the data is important and long-lived.

On the other hand, I have argued that reuse (and in this case retention) are the enemy of access. "Just putting it up on the Web" is an easier injunction than "deposit it in a repository" (especially if you haven't got a repository installed) and hence more likely to succeed. So we shouldn't put repositories on the Linked Data on-ramp (step/star 1), but if not there, then where should they go?

I would argue that by step 3 (using open formats) or 4 (adding value with identifiers and semantic web tech) the data provider is being asked to make a more substantial investment, and to boost the value of their data holdings. This seems to be an appropriate point to add in extra features, especially when they will help secure the results of that investment. So the 5 stars of Linked Data would mention repositories in Level 4, but the five stars of Open Access could do so in Level 1 because they are already an accepted part of OA processes.

I'm not sure I'm comfortable with mixing the levels - it makes for confusion. Wouldn't it be much better to have one set of processes that apply to all forms of openness - the basic principles of the Web? In my previous post I pointed out that you can add 5* links to 2* PDFs and spreadsheets, so I think possibly that the solution lies in the fact that the 5 stars are not sequential stages, but 5 more-or-less independent principles that each make openness more valuable and useful: licensing, machine readability, open standards, entity identification, interlinking. To which we could add "sustainability", making (see diagram above) is a constellation of linked data properties.


  1. Rather than stars, an HTML5-esque set of icons might be more appropriate, then? That way people could quickly see which principles were being upheld, without making any assumptions about which ones come first.

  2. I've only just noticed that Tim's Five Stars of Open Data document says "Now in 2010, people have been pressing me, for governmet data, to add a new requirement, and that is there should be metadata about the data itself, and that that metadata should be availble from a major catalog...Yes, there should be metadata about your dataset." That sounds like a vote for repository functionality to me!

  3. 1. "Use URIs to identify things..." (thats the fourth of the five Stars of Tim Berners-Lee deployment scheme for Linked Open Data) implies to use *stable* Http-URIs . Its the fundament of all linked (open) data: you cannot link if the URI is broken.
    I donot see the necessity of mentioning repositories. In contradiction - repositories can learn a lot of basics from LOD. Metadata about metadata is provided by some linked-data browsers like pubby. BTW, its easy within linked data to setup a resource that describes the metadata of the metadata. Again, no need for a repository. I am not that much in the repository world (as you may have guessed ;) ) so maybe I am wrong - but I think you can do everything with LOD that you can do with repositories (only better :) ).
    Of course you would need all these nifty tools for cataloging, archiving, versioning etc.
    I guess in the end a good LOD service is not to be discriminated to a good respository and vice versa.

    2. "You can get away with putting material on the Web without any concern about their future safety - but not for long, especially if you want to build services on top of that material."
    I disagree - put it "on the web" is enough - when you make your data "Open Data". If you want a service on top of it, you can cache that data locally so you are independent of the original provider.

    3. The CKAN 404 errors of their listed data source URLs can be explained in many ways - i.e. the data source may be a dump of the linked open data (which is not he same as the resource URIs). Or, i.e., in a breakdown of a sparql endpoint in case i.e. for (because it is an experimental service and when the one sysad is on vacation ...). But in case i.e. of that doesnot mean that the URIs will be lost forever - they are there to be used (and again resolvable of course).