RepositoryMan: March 2011

Monday, 21 March 2011

I Won't Review Green OA, It's Spam - I DO NOT LIKE IT Sam-I-Am

According to the Times Higher, Michael Mabe (chief executive of the International Association of Scientific, Medical and Technical Publishers and a visiting professor in information science at University College London) fears that repositories are essentially "electronic buckets" with no quality control. He also expressed doubts that the academy would be able to successfully introduce peer review to such repositories, partly because it would be difficult to attract reviewers who had no "brand allegiance" to the repositories.

Let's think about this....

Q: Who are the authors of papers?

A: Researchers.

Q: Who put papers in repositories?

A: The authors.

Q: Who review papers?

A: The authors of other papers.

Q: Where do they get papers to review?

A: From a URL provided by the journal editorial board.

Q: Who are the editorial board?

A: Authors of other papers.

Q: Just remind me what the publishers do?

A: Their most important job is to organise the processes that get the peer review accomplished by the other authors (see above).

Q: Where does the brand value of a journal come from?

A: It's a bit complicated, but mainly from the prestige of the authors on the editorial board and the prestige of the papers that the authors write. There is a default brand that comes from the publishing company that owns the journal, but of course that comes recursively from the brand value of all the journals that it owns.

Q: "Electronic buckets" don't sound very valuable, do they?

A: No they certainly don't - I mean, imagine the kind of material that normally ends up in a bucket! Who would want to peer-review that? But hang on - who stores stuff in buckets anyway? That's a bit of a problematic metaphor for a storage system! Try replacing "buckets" with "library shelves" and the statement becomes more accurate. What kind of material do you find on library shelves? Things that people might want to read. Things that people might want to review.

Q: But how would authors know what to review in a repository without the publishing company's branding?

A: I suppose an editorial board would send them a URL.

Friday, 11 March 2011

You Can't Trust Everything You Read on the Web

Houston, we have a problem. It turns out that trusting repositories as authoritative sources of research information is all very well and good, except when the repository is an authoritative source of demonstration (fake) documents. Sebastien Francois (one of the EPrints team at Southampton) has just reported that Google Scholar is indexing the fake documents that we make available in demoprints.eprints.org.

So when your weaker students start citing

Freiwald, W. and Bonardi, X. and Leir, X. (1998) Hellbenders in the Wild. Better Farming, 1 (4). pp. 91-134.

you know that it's just a teensy misunderstanding, OK? But if anyone needs their citation count artificially boosting, I have a repository available to monetize.

Monday, 7 March 2011

Google, Content Farms and Repositories

In recent news, Google has altered its ranking algorithms to favour sites with original material rather than so-called content farms that simply redistribute material found on other sites. Although users report satisfaction with improved results, this action has caused quite a furore with some genuine sites losing significant business as well.

I have been worried about how this would affect repositories, after all we technically fit into the definition of content farms: sites that exist to redistribute material that is published elsewhere. Bearing in mind that Google delivers the vast majority of our visitors to us, if the changes were to impact on our rankings, we might suffer quite badly. Now that there's been a couple of weeks for the changes to migrate around the planet, our usage stats point to business as usual.

First of all, downloads over the last quarter - no dramatic tailoffs in the last week.

And a comparison with last year (apologies the different vertical scale) shows year-on-year stability.

So good news there: our repositories haven't been classed as valueless redistribution agents. That would have been a bit of a blow to our morale!

Sunday, 6 March 2011

The Missing Sixth Star of Open Linked Data?

In my previous posting I proposed the idea of the 5 stars of open access. There is of course one feature that the original "taxonomy" misses out completely - repositories! Not just "my favourite repository platform", but the idea of persistent, curated storage. Consequently, my proposal for open access doesn't mention repositories - a bit of an oversight!

At the moment, the entry level to the 5 stars is simply "put it on the web, with an open license". Perhaps we should change this to "put it in a repository with an open license"; perhaps we could designate a "zeroth star" for "just put it on the Web". However, the Linked Data Research Lab at DERI already propose a no-star level, which involves material being put on the web without an explicit license.

You can get away with putting material on the Web without any concern about their future safety - but not for long, especially if you want to build services on top of that material.

Services like CKAN (Comprehensive Knowledge Archive Network, http://ckan.net/) are registries of open knowledge packages currently favoured by the open data community. This registry is built on a simple content management environment, and by November 2010 was already returning HTTP 400- and 500-class error codes for 9% of its listed data source URLs.

A more extreme example is seen in the UK, where police forces recently started to release data about crime reports. But "whenever a new set of data is uploaded, the previous set will be removed from public view, making comparisons impossible unless outside developers actively store it" (see The Guardian for more details).

Repositories have an opportunity to provide management, persistence and curation services to the open data community and its international collections of linked data. Whether our OA platforms are chosen (DSpace? EPrints? Fedora? Zentity?) is not the issue - it is the philosophy and practices of repository that are vital to the Open Data community, because the data is important and long-lived.

On the other hand, I have argued that reuse (and in this case retention) are the enemy of access. "Just putting it up on the Web" is an easier injunction than "deposit it in a repository" (especially if you haven't got a repository installed) and hence more likely to succeed. So we shouldn't put repositories on the Linked Data on-ramp (step/star 1), but if not there, then where should they go?

I would argue that by step 3 (using open formats) or 4 (adding value with identifiers and semantic web tech) the data provider is being asked to make a more substantial investment, and to boost the value of their data holdings. This seems to be an appropriate point to add in extra features, especially when they will help secure the results of that investment. So the 5 stars of Linked Data would mention repositories in Level 4, but the five stars of Open Access could do so in Level 1 because they are already an accepted part of OA processes.

I'm not sure I'm comfortable with mixing the levels - it makes for confusion. Wouldn't it be much better to have one set of processes that apply to all forms of openness - the basic principles of the Web? In my previous post I pointed out that you can add 5* links to 2* PDFs and spreadsheets, so I think possibly that the solution lies in the fact that the 5 stars are not sequential stages, but 5 more-or-less independent principles that each make openness more valuable and useful: licensing, machine readability, open standards, entity identification, interlinking. To which we could add "sustainability", making (see diagram above) is a constellation of linked data properties.

Friday, 4 March 2011

The Five Stars of Open Access (aka Linked Documents)

Yesterday I was having a discussion about Scholarly Communications, Open Access, Web 2 and the Semantic Web with some colleagues in our newly formed "Web and Internet Science Research Group" at Southampton. As we were comparing and contrasting more than a decade's experience of open access/open data/OER/Open Government Data, we made the following observation: reuse is the enemy of access.

There have been efforts to replace PDF with HTML as a scholarly format to make data mining more easy, and movements to establish highly structured Learning Objects rich in pedagogic metadata to facilitate interoperability of e-learning material. (I have been involved in both of these!) But both have been ignored by the community - they are too hard, they fly in the face of current practice, they involve users learning new skills or making more effort. Some would argue that similar comments could be made about preservation and open access, or even just repositories and open access.

Although "reuse is the enemy of access" is quite a bold statement it's really just a reformulation of the old saw "the best is the enemy of the good". Attempts to do something with the material we have available are always more complex than just looking at the material we have available. Adding services, however valuable and desirable, are more problematic than "just making material available". In the repository community we've worked hard to help users get something for nothing (or something for as little effort as possible), and I'm proud that people recognise that philosophy in EPrints. But it's still a tension - you have to present Open Access as a bandwagon that's easy to climb on!

So I'm particularly impressed with Tim Berners-Lee's Five Stars of Linked Data as a means of declaring an easy onramp to the world of Linked Data, while at the same time setting out a clear means of evaluating and improving contributions and the processes required to support them. It allows the community to have their cake and eat it; to claim maximum participation (a bigger community is a more successful community) and appropriate differentiation (better value is a better agenda).

I think this approach would have served the Open Access communities (OA/OER/Open Data) very well (why didn't we think of it?) But it could yet do so, and so in the spirit of reuse I offer some early thoughts on the Five Stars of Open Access.

★ Available on the web (whatever format), but with an open licence
★★ Available as machine-readable editable data (e.g. Word instead of PDF page description)
★★★ as above plus non-proprietary format (e.g. HTML5 instead of Word)
★★★★ All the above plus, use open standards from W3C (RDF and microformats) to identify things, so that people can understand your stuff
★★★★★ All the above, plus: link your data to other people’s data to provide context i.e. link citations to DOIs and other entities to appropriate URIs (e.g. project names, author names, research groups, funders etc).

These are directly taken from Tim's document, with some subtle variations, and are intended for discussion. For a start, it shows that we haven't even got very far into 1-star territory, as we mainly fudge the licensing issue. (This comes from the fact that unlike data, our documents are often re-owned by third parties.) Pressing on, the second star is available for editable source documents rather than page images and this is also a minority activity. In our repository, there are 7271 PDFs vs 820 Office/HTML/XML documents. So a long way to go there. The third star seems even more remote (376 documents). And as for the fourth star's embedded metadata?

But the fifth star: this seems to be so valuable. If we could just get there - properly linked documents, no chasing down references, the ability to easily generate citation databases, easy lookup of the social network of authors. Sigh. What's not to like? And you can even add 5* facilities to PDF, so perhaps we will find some short cuts!

If we develop these five stars, it will help us to function as positive Open Access evangelists, while also promoting the future benefits that we would like to work towards. No mixed messages. No confusion.