Friday, 4 March 2011

The Five Stars of Open Access (aka Linked Documents)

Yesterday I was having a discussion about Scholarly Communications, Open Access, Web 2 and the Semantic Web with some colleagues in our newly formed "Web and Internet Science Research Group" at Southampton. As we were comparing and contrasting more than a decade's experience of open access/open data/OER/Open Government Data, we made the following observation: reuse is the enemy of access.

There have been efforts to replace PDF with HTML as a scholarly format to make data mining more easy, and movements to establish highly structured Learning Objects rich in pedagogic metadata to facilitate interoperability of e-learning material. (I have been involved in both of these!) But both have been ignored by the community - they are too hard, they fly in the face of current practice, they involve users learning new skills or making more effort. Some would argue that similar comments could be made about preservation and open access, or even just repositories and open access.

Although "reuse is the enemy of access" is quite a bold statement it's really just a reformulation of the old saw "the best is the enemy of the good". Attempts to do something with the material we have available are always more complex than just looking at the material we have available. Adding services, however valuable and desirable, are more problematic than "just making material available". In the repository community we've worked hard to help users get something for nothing (or something for as little effort as possible), and I'm proud that people recognise that philosophy in EPrints. But it's still a tension - you have to present Open Access as a bandwagon that's easy to climb on!

So I'm particularly impressed with Tim Berners-Lee's Five Stars of Linked Data as a means of declaring an easy onramp to the world of Linked Data, while at the same time setting out a clear means of evaluating and improving contributions and the processes required to support them. It allows the community to have their cake and eat it; to claim maximum participation (a bigger community is a more successful community) and appropriate differentiation (better value is a better agenda).

I think this approach would have served the Open Access communities (OA/OER/Open Data) very well (why didn't we think of it?) But it could yet do so, and so in the spirit of reuse I offer some early thoughts on the Five Stars of Open Access.
★ Available on the web (whatever format), but with an open licence
★★ Available as machine-readable editable data (e.g. Word instead of PDF page description)
★★★ as above plus non-proprietary format (e.g. HTML5 instead of Word)
★★★★ All the above plus, use open standards from W3C (RDF and microformats) to identify things, so that people can understand your stuff
★★★★★ All the above, plus: link your data to other people’s data to provide context i.e. link citations to DOIs and other entities to appropriate URIs (e.g. project names, author names, research groups, funders etc).
These are directly taken from Tim's document, with some subtle variations, and are intended for discussion. For a start, it shows that we haven't even got very far into 1-star territory, as we mainly fudge the licensing issue. (This comes from the fact that unlike data, our documents are often re-owned by third parties.) Pressing on, the second star is available for editable source documents rather than page images and this is also a minority activity. In our repository, there are 7271 PDFs vs 820 Office/HTML/XML documents. So a long way to go there. The third star seems even more remote (376 documents). And as for the fourth star's embedded metadata?
But the fifth star: this seems to be so valuable. If we could just get there - properly linked documents, no chasing down references, the ability to easily generate citation databases, easy lookup of the social network of authors. Sigh. What's not to like? And you can even add 5* facilities to PDF, so perhaps we will find some short cuts!

If we develop these five stars, it will help us to function as positive Open Access evangelists, while also promoting the future benefits that we would like to work towards. No mixed messages. No confusion.


  1. Comparing word and PDF docs in the repository may skew your perception

    They are likely to be artefacts reflecting:
    publishing requirements
    Political correctness (of varied motivations)
    Bandwagon pushing

    I have pressed for some time to get an automated multi-format translator in our various repositories in order to increase human and computer interoperability

    I think we have to look at as many sorts of automated augmentation as possible in order to increase the default star rating of our collections

  2. Microformats provide an effective mechanism to add structure to HTML content, but I would not include them in your Fourth Star of Open Access.

    While microformats are well-known and widely-supported, 1) they are not controlled by the W3C (as your parentheses suggest, although they work within the confines of HTML 4's semantic functionality) and 2) they are not extensible in the spirit of linked data. Microformats rely on defined vocabularies (e.g., vCard, hRecipe) as opposed to the ability to define your own vocabulary if needed (e.g., OWL). Having said that, Microformats excel at making data readable by machines and humans alike leading to faster and more widespread adoption.