RepositoryMan: How Repositories Can Contribute Linked Data

Friday, 12 February 2010

How Repositories Can Contribute Linked Data

I've been working a lot with our repository and Linked Data teams (thanks to Hugh Glaser, Nick Gibbins and Iain Millard) on the JISC dotAC project. One of the great things about that project has been the opportunity to really get our heads around the role of repositories in the Semantic Web and the Linked data world. Now that the project has finished, I've finally had the opportunity to sit down with Chris Gutteridge and braindump our understanding so far. The following is a description of how EPrints (v3.2 beta) exports its holdings as Linked Data. It all comes down to how it uses and resolves URIs. (Like it says at the bottom of this posting, please send us comments!)

We ASSIGN a URI to all the significant entities that the repository owns: specifically, eprint, document, file, user, and "subject taxonomy" objects.

These URIs will generally be of the form http://repository.com/id/type/id where type is eprint, document etc and idis unique within the scope of the repository.
The official URI of an eprint is of the form http://repository.com/id/eprint/id whereas the official URL of an eprint is of the form http://repository.com/id . The latter has historically been the URL of the eprint's abstract or splash page.
Where possible, resolving a URI will result in a "303 See Also" redirection to the URL of the most appropriate format export, based on content negotiation of available exporters (disseminators).
However, if text/html is deemed most appropriate for an eprint, it is redirected to the standard URL of the abstract page.
For a sub-object (e.g. documents and files of an eprint) the URI is redirected to the ancestor object.
Similarly, subjects redirect to the top-level subject.
Eprint and document objects have special "relationships" fields which allow arbitrary predicates/objects to be attached to the document/eprint.
Documents of format text/n3 and application/rdf+xml (which like all documents have their own URLs inside the repository) are linked to the parent eprint via an rdfs:seeAlso statement. This allows arbitrary triples to be associated with any eprint, irrespective of the repository schema.

We MINT (or COIN) a URI for entites whose existence we infer from metadata.

Where there is a high degree of confidence from the metadata that two entities are the same (e.g. two conferences, journals or authors) then they will receive the same URI.
These URIs will generally be of the form http://repository.com/id/x-type/id where type is e.g. event, organisation, person, place etc and id is unique within the scope of the repository.
The unique id is generally a hash generated from the metadata, unless a better value is available e.g. an ISBN.
In the case of a book or a serial we can confidently add an owl:sameAs to the URN
In other cases, a repository administrator can add a mechanical process for creating the sameAs using specialised knowledge to construct (or map) the metadata to an external URI. An example of this would be looking up an author's URI in a staff database based on an email address in the author metadata. Another example would be a DOI being constructed from a query to the CrossRef database.
(The reason for using an x-publication URI as well as the public URN is that it may be useful to provide a local resolution service for non-local entities.)
All x-type URIs redirect to an RDF+XML document, describing everything known locally about that entity. Content negotiation for other formats is not currently supported for these non-core entities.

A set of standard triples containing rights information about the exported metadata appears in every single RDF export, to facilitate linked data reuse.

Does this look sensible? Have we got it right? Please let us have any comments and feedback!

5 comments:

Dorothea12 February 2010 at 19:41
What happens when an entity you're identifying has another Linked Data identity elsewhere? E.g. somebody who has a VIAF authority record.
ReplyDelete
Replies
Anonymous13 February 2010 at 20:12
If I understand you right...Do you risk exponential increase in metadata volume? Triples of triples of triples of...? That is, will you see an asymptotic track of degrees of separation from significant entities that might interfere with your ability to maintain the reference system?
ReplyDelete
Replies
Garret14 February 2010 at 11:24
Thanks for providing an approach of great clarity to this challenge for repository services. Focusing on the ongoing development of processes needed to meet an increasingly complicated and integrated data management challenge this is a great start. It strengthens the hand of the local repository by facilitating service development approaches that could integrate well with content aggregations external to the institution. This will be important given the drift toward a primary point of research output deposit that is centralised in subject based repositories with the institutional repository often playing second fiddle.
ReplyDelete
Replies
Kingsley Idehen23 February 2010 at 16:21
Anonymous,

The URI explosion issue is a myth. Co-reference is how you handle multiple identifiers for the same thing. This is why OWL exists since it enables to construction and use of context specific rules such as: owl:sameAs relations between co-referenced entities.

Kingsley
ReplyDelete
Replies
Anonymous23 February 2010 at 20:19
Les, thanks for this. A couple questions:

1. Do you have a graphically representation of this?

2. How do you think the resolution of persistent identifiers (esp. DOI-based HTTP URIs) fits into this model? I'm wondering esp. of the discussions on my blog http://bit.ly/a7S5qD and Tony Hammond's blog http://bit.ly/bgLHLV

3. Finally, do you see this model as being consistent with, perhaps a special case of, the OAI-ORE aggregation model?

Thanks!
ReplyDelete
Replies

Add comment