RepositoryMan: February 2010

Thursday, 18 February 2010

Repository Benefits (again)

A well-respected colleague (who shall remain nameless) recently posted a Tweet which likened "repositories with end-user benefits" to "pigs with wings". Shome mishtake, shurely?

As part of my job I am course co-ordinator for the MSc in Web Science here at Southampton. Last week I ran an Industrial Liaison day for the course, at which the students made poster presentations of their work to the industry representatives. It was a fantastic time, and the attendees all commented on how stimulated they had been by the posters. Consequently, the secretary who had printed out all the posters was able to give me a ZIP file which I then uploaded to our teaching repository. (See http://www.edshare.soton.ac.uk/4790/.) Now I have a record of a key part of the event that I can share with the industrial delegates. It was quicker than me creating a bespoke website, or uploading them to my home page. Plus I can use it as a marketing and publicity resource for attracting other students. And the students are happy because they have learned that their contributions aren't just "marking fodder", but they have valuable statements to make to a public audience. Win, win, win.

Not so much pigs with wings as pigs with snouts: you have to sniff out the opportunities. Or even carpe suem!

Friday, 12 February 2010

How Repositories Can Contribute Linked Data

I've been working a lot with our repository and Linked Data teams (thanks to Hugh Glaser, Nick Gibbins and Iain Millard) on the JISC dotAC project. One of the great things about that project has been the opportunity to really get our heads around the role of repositories in the Semantic Web and the Linked data world. Now that the project has finished, I've finally had the opportunity to sit down with Chris Gutteridge and braindump our understanding so far. The following is a description of how EPrints (v3.2 beta) exports its holdings as Linked Data. It all comes down to how it uses and resolves URIs. (Like it says at the bottom of this posting, please send us comments!)

We ASSIGN a URI to all the significant entities that the repository owns: specifically, eprint, document, file, user, and "subject taxonomy" objects.

These URIs will generally be of the form http://repository.com/id/type/id where type is eprint, document etc and idis unique within the scope of the repository.
The official URI of an eprint is of the form http://repository.com/id/eprint/id whereas the official URL of an eprint is of the form http://repository.com/id . The latter has historically been the URL of the eprint's abstract or splash page.
Where possible, resolving a URI will result in a "303 See Also" redirection to the URL of the most appropriate format export, based on content negotiation of available exporters (disseminators).
However, if text/html is deemed most appropriate for an eprint, it is redirected to the standard URL of the abstract page.
For a sub-object (e.g. documents and files of an eprint) the URI is redirected to the ancestor object.
Similarly, subjects redirect to the top-level subject.
Eprint and document objects have special "relationships" fields which allow arbitrary predicates/objects to be attached to the document/eprint.
Documents of format text/n3 and application/rdf+xml (which like all documents have their own URLs inside the repository) are linked to the parent eprint via an rdfs:seeAlso statement. This allows arbitrary triples to be associated with any eprint, irrespective of the repository schema.

We MINT (or COIN) a URI for entites whose existence we infer from metadata.

Where there is a high degree of confidence from the metadata that two entities are the same (e.g. two conferences, journals or authors) then they will receive the same URI.
These URIs will generally be of the form http://repository.com/id/x-type/id where type is e.g. event, organisation, person, place etc and id is unique within the scope of the repository.
The unique id is generally a hash generated from the metadata, unless a better value is available e.g. an ISBN.
In the case of a book or a serial we can confidently add an owl:sameAs to the URN
In other cases, a repository administrator can add a mechanical process for creating the sameAs using specialised knowledge to construct (or map) the metadata to an external URI. An example of this would be looking up an author's URI in a staff database based on an email address in the author metadata. Another example would be a DOI being constructed from a query to the CrossRef database.
(The reason for using an x-publication URI as well as the public URN is that it may be useful to provide a local resolution service for non-local entities.)
All x-type URIs redirect to an RDF+XML document, describing everything known locally about that entity. Content negotiation for other formats is not currently supported for these non-core entities.

A set of standard triples containing rights information about the exported metadata appears in every single RDF export, to facilitate linked data reuse.

Does this look sensible? Have we got it right? Please let us have any comments and feedback!