RepositoryMan: 2010

Thursday, 16 September 2010

Visibility of OER Material: the Jorum Learning and Teaching Competition

The recent ALT-C 2010 conference saw the final six winners in the Jorum Learning and Teaching Competition present their resources, and receive their prizes, with those taking the top three places announced at the gala dinner.

Louise Egan, JISC-Repositories Email

This competition was designed to promote people sharing learning resources, and that's fantastic. Since none of the six winners' resources are actually deposited in the Jorum repository (just metadata records with a web link to the actual location), this situation provides an interesting insight into the advantages of depositing material (or a reference to material) into a repository. Since copies of all the winners material reside elsewhere on the Web, can we find out which link gets the higher ranking: Jorum or the original university? Is there any consistent pattern that emerges? In the following breakdown I'll list the Google rankings of each resource when searching for the title of the resource.

First prize: The Molecular Basis of Photosynthesis

There are actually three places to find of this work: Jorum, the creator's personal website and a Cambridge support site which contains the original version of this resource (one that hasn't been split into separate sections).

Jorum: 3,4

Author: 6

Institution site (cam.ac.uk): 1,2

Second prize: The Open Dementia E-learning Programme: Living with dementia

Jorum: 3

Institution (scie.ac.uk): 1,2

Third prize: Making the Creative process visible

The home of this material is on Vimeo, although it is also referenced by the HEAcademy who sponsored the project

Jorum: 1,2

HE Academy: 3,4

Vimeo: 6

Fourth Prize: Ayo Gorkhali

Jorum: 3,4

Professional Society Page: 1,2,7

Fifth Prize: Interpreting Skills Map

Jorum: 3,4

Professional Society Page: 1,2

Sixth Prize: Plagiarism Tutorial

There are a LOT of Plagiarism Tutorials offered by Universities all over the world!

Jorum: 66

Institution: 89

So in some cases JORUM boosts the ranking, and hence the visibility, of an item, whereas in other cases it doesn't. Can we draw any general patterns from this small sample? To be honest, I don't think so! The range of institutions is too diverse. Some of the alternative locations are highly visible, so it is not surprising that Jorum is eclipsed by their ranking (e.g. Cambridge, very newsworthy Gurkhas international organisation). Some 49% of Open Jorum's records provide links to external sources rather than holding bitstream contents directly. It would be very interesting to see the bigger picture of OER visibility by undertaking a more comprehensive survey.

Wednesday, 25 August 2010

More on Mendeley and Repositories

Yesterday's post Comparing Social Sharing of Bibliographic Information with Institutional Repositories created a few comments, so I thought I'd make some more observations from an outsider's point of view.

I think that Mendeley are a fascinating example of the Open Access problem. OA is about moving knowledge from researchers' private environments (their laptops, hard disks, CDs and filing cabinets) into the public space (repositories, websites, search engines). Mendeley's software spans both those environments - bibliography management for the desktop feeding researcher profiles and CVs on the Web.

As Victor Henning pointed, Mendeley are part of Cambridge's JISC DURA project, which aims to take advantage of Mendeley's position bridging the desktop/Web to try and encourage more public repository deposits. This is a very interesting proposition: maybe a users of a such a service will be more inclined to make their work Open Access? Perhaps the simple act of buying into the "Mendeley proposition" will cause them to be be more favourable to Open Access than they would otherwise have been?

From the outside it's difficult to understand the extent of Mendeley's penetration into a University. What is visible is the public profiles that Mendeley users have created. Although the Mendeley API doesn't allow searching for users, I have been able to identify 53 public profiles from the University of Cambridge through Google (and a lot of manual verification!) Incredibly, only TWO of those 53 researchers have any existing deposits in Cambridge's institutional repository.

This is potentially great news: Mendeley's software has gained takeup from users who aren't repository users. They aren't preaching to the converted, they are getting new users to work in the open, to start to make the transition from the desktop to the Web.

But the OA battle hasn't been won yet. Of those 53 profiles, 21 contain no publication information, and of the 32 list their publications, only 9 have made any of their publications open access through the Mendeley service (a total of 40 PDFs).

The social bibliographic approach that Mendeley are promoting is a promising way forward. It's offering people something that they haven't seen from the repository, but it's not a principally Open Access offering, and it's no silver bullet for providing open access. Commentators who have suggested that repositories are old-fashioned, and that everything can be solved by Web 2 solutions, are being over-optimistic. Repositories are hard work because changing researchers' working practices is hard work and I guess there's no single magic solution that's going to make that effort disappear!

Tuesday, 24 August 2010

Comparing Social Sharing of Bibliographic Information with Institutional Repositories

Everyone seems to have been talking about Mendeley over the past year! They have won a string of prizes, most recently the Guardian "Activate Future Technologies" workshop award for the project most likely to change the world for the better. They have achieved these accolades by providing bibliographic database software for the desktop ("like iTunes for research papers"), coupled with a social web site through which researchers can share their bibliographic collections. They have been successful to the tune of 47,5671 users and 34,852,751 documents (according to figures on their home page), with some commentators suggesting that they may soon provide access to more bibliographic data than Thomson ISI!

Now a lot of these "documents" are private material that are just stored on researcher's desktops. I am not interested per se in which software is being used to manage private bibliographic metadata. But the extent to which the "social sharing" agenda is successful is obviously crucially important to the repository community - to what extent is research being shared publicly, and in particular, to what extent are full texts of scientific papers being provided as Open Access through Mendeley's site.

To investigate these issues, I took a snapshot of some of their user profiles. Mendeley have 33678 public Computer Science profiles listed, so I took a 10% (3423) sample of those. Of that sample, 2918 or 85% have no publications listed at all, while 6% have only 1 or 2. Just 2% have 10 publications or more listed. The whole sample has a total of 2317 publications listed, with 681 providing PDFs from the Mendeley website. If this sample scales up (and the method I used does not constitute a proper random or representative sample), then the computer science part of Mendeley would have about 23,000 publications listed, with just shy of 7000 full texts.

By contrast, our departmental repository (eprints.ecs.soton.ac.uk) has about 15,500 publications listed, of which 7121 have public full texts. So, based on my quick investigation, it looks like that the part of Mendeley's social sharing site which deals with Computer Science (insert boilerplate text about disciplinary differences) seems to be functioning on a similar level to a dedicated departmental repository. They have more bibliographic records; we have more (just) records with public full texts.

The interesting contrast between the Mendeley approach and the repository approach is that the one starts with services on the researcher's desktop that are then used as the basis for offering open access, whereas the other starts with web-based open access that leads to desktop bibliographic tools. It might appear, if my partial and approximate study is anything to go by, that neither approach trumps the other in terms of open access outcomes.

The repository community should certainly embrace and work with services like Mendeley, but we should see them as complementary to our activities, not a replacement for them!

Sunday, 16 May 2010

Visualising Repository Contents

Repositories are great for acquiring material, providing access to that material and securing it for the future. There are also other systems that purport to deal with scientific and scholarly communication of one sort or another: journals web sites, Google, email, blogs, twitter, social networks.

Scholarly communication operates on a grand scale: we have monthly and quarterly issues of journals, annual cycles of conferences, three-year cycles of project management, decade-long cycles of employment and forty-year long careers. The published literature is vast, and growing at an alarming rate. Our informal electronic communications are even more prolific - with hundreds of tweets, blogs and emails to keep up with each day.

The problem with all of the systems that supposedly support scholarly communication is that they help us keep up with approximately 1 hour's worth of recent material - a couple of dozen emails or tweets, a page of search results, a single screen of professional social network commentary. What about making sense of that material in the context of the other 99.999% of the last decade's scientific discussion in that area? What we need is really good ways of visualising really (really) large amounts of material, and exploring it in real time, so that we can make sense of and contribute effectively to current discussions. Especially when we come to topics that are just outside our areas of expertise.

I have been really excited to discover Microsoft Labs' Pivot project (see www.getpivot.com), a system for interacting with huge amounts of visual data. You can see an example of using Pivot to view the contents of an EPrint repository below.

This example only demonstrates using Pivot with the outputs of a part of a single repository, so it's not exactly showing off "the grand scale of scholarly communication". But it is a compelling example of how our respositories might be able to show us the wood and the trees of scientific endeavour, both at the same time.

(This example was constructed by Jiadi Yao, the EPrints-sponsored postgraduate student from the Web Science Doctoral Training Centre. The rest of his time is spent in investigating the ways that social networks underpin citation networks.)

Thursday, 18 February 2010

Repository Benefits (again)

A well-respected colleague (who shall remain nameless) recently posted a Tweet which likened "repositories with end-user benefits" to "pigs with wings". Shome mishtake, shurely?

As part of my job I am course co-ordinator for the MSc in Web Science here at Southampton. Last week I ran an Industrial Liaison day for the course, at which the students made poster presentations of their work to the industry representatives. It was a fantastic time, and the attendees all commented on how stimulated they had been by the posters. Consequently, the secretary who had printed out all the posters was able to give me a ZIP file which I then uploaded to our teaching repository. (See http://www.edshare.soton.ac.uk/4790/.) Now I have a record of a key part of the event that I can share with the industrial delegates. It was quicker than me creating a bespoke website, or uploading them to my home page. Plus I can use it as a marketing and publicity resource for attracting other students. And the students are happy because they have learned that their contributions aren't just "marking fodder", but they have valuable statements to make to a public audience. Win, win, win.

Not so much pigs with wings as pigs with snouts: you have to sniff out the opportunities. Or even carpe suem!

Friday, 12 February 2010

How Repositories Can Contribute Linked Data

I've been working a lot with our repository and Linked Data teams (thanks to Hugh Glaser, Nick Gibbins and Iain Millard) on the JISC dotAC project. One of the great things about that project has been the opportunity to really get our heads around the role of repositories in the Semantic Web and the Linked data world. Now that the project has finished, I've finally had the opportunity to sit down with Chris Gutteridge and braindump our understanding so far. The following is a description of how EPrints (v3.2 beta) exports its holdings as Linked Data. It all comes down to how it uses and resolves URIs. (Like it says at the bottom of this posting, please send us comments!)

We ASSIGN a URI to all the significant entities that the repository owns: specifically, eprint, document, file, user, and "subject taxonomy" objects.

These URIs will generally be of the form http://repository.com/id/type/id where type is eprint, document etc and idis unique within the scope of the repository.
The official URI of an eprint is of the form http://repository.com/id/eprint/id whereas the official URL of an eprint is of the form http://repository.com/id . The latter has historically been the URL of the eprint's abstract or splash page.
Where possible, resolving a URI will result in a "303 See Also" redirection to the URL of the most appropriate format export, based on content negotiation of available exporters (disseminators).
However, if text/html is deemed most appropriate for an eprint, it is redirected to the standard URL of the abstract page.
For a sub-object (e.g. documents and files of an eprint) the URI is redirected to the ancestor object.
Similarly, subjects redirect to the top-level subject.
Eprint and document objects have special "relationships" fields which allow arbitrary predicates/objects to be attached to the document/eprint.
Documents of format text/n3 and application/rdf+xml (which like all documents have their own URLs inside the repository) are linked to the parent eprint via an rdfs:seeAlso statement. This allows arbitrary triples to be associated with any eprint, irrespective of the repository schema.

We MINT (or COIN) a URI for entites whose existence we infer from metadata.

Where there is a high degree of confidence from the metadata that two entities are the same (e.g. two conferences, journals or authors) then they will receive the same URI.
These URIs will generally be of the form http://repository.com/id/x-type/id where type is e.g. event, organisation, person, place etc and id is unique within the scope of the repository.
The unique id is generally a hash generated from the metadata, unless a better value is available e.g. an ISBN.
In the case of a book or a serial we can confidently add an owl:sameAs to the URN
In other cases, a repository administrator can add a mechanical process for creating the sameAs using specialised knowledge to construct (or map) the metadata to an external URI. An example of this would be looking up an author's URI in a staff database based on an email address in the author metadata. Another example would be a DOI being constructed from a query to the CrossRef database.
(The reason for using an x-publication URI as well as the public URN is that it may be useful to provide a local resolution service for non-local entities.)
All x-type URIs redirect to an RDF+XML document, describing everything known locally about that entity. Content negotiation for other formats is not currently supported for these non-core entities.

A set of standard triples containing rights information about the exported metadata appears in every single RDF export, to facilitate linked data reuse.

Does this look sensible? Have we got it right? Please let us have any comments and feedback!