RepositoryMan: September 2007

Sunday 30 September 2007

Self Deposit - It Jolly-Well Works You Know!

First some impartial statistics, and then some ballyhoo.

To follow up the previous posting about our School repository's apparent 100% success rate for capturing the School's research outputs, I have expanded the scope of the investigation. If you recall, I focused on a single society publisher (the ACM) as it provides and important publication venue for half of the research groups in our School. Thanks to Alma Swan's assistance, I have managed to get a more comprehensive (and more representative) report of 157 journal publications for our school from ISI's Web of Knowledge in 2006. Comparing this against the repository's holdings for that year I find that 128 items match deposits in the database and 29 do not - that makes a deposit rate of 82%. Of those 128 deposits, 118 have full texts and 105 are open access - making an overall OA success rate of 67%.

Running the same experiment on the WoS Conferences data I get 108 conference or workshop papers reported, of which 86 seem to be deposited in the repository (an 80% success rate, almost identical to the journal figures above). However, only 60 of those are full text and 54 are open access, meaning that only 50% of this source of material is being OAed. That seems to be a significant difference (2/3 above versus 1/2 here) which may be partly explained by the attitude of the electronics community towards conferences and workshops. [Note that the figure of 86 items deposited has yet to be carefully checked.]

Now the ballyhoo!

As I explained in one of my opening posts back in July, the ECS repository is a research school's working repository with minimum investment - having been established six years ago it now attracts about 2% FTE effort in management, editorial and technical support, and it is still running on the same old server (well, workstation). The policy is that the repository shouldn't put a noticeable load on the research staff - it is there to serve research, not vice versa.

So to find that self-archiving works even in such a neglectful (lax, sloppy) environment is very exciting indeed. We don't have complex processes, careful editorial QA procedures nor any extensive administrative oversight. It just happens, day by day. To find that the effectiveness is running at 80-100% without any management effort on your part, is just amazing.

So let's hear it for self-deposit!

Just How Long Does it Take to Establish a Successful Repository?

On Thursday I went to the Open University's official opening of Open Research Online, their Institutional Repository. It's currently the third biggest IR in the UK according to ROAR (if you ignore our ECS school repository which isn't run at institutional scales), and it looks to be running at a good sustainable rate of growth, with daily deposits coming from across the institution. It's been five years since the OU first started their eprints repository, and in that time it has gone through several iterations, and several management teams.

Like Southampton, the OU's repository is driven by the needs of the UK's Research Assessment Exercise - broadly speaking that means very high quality metadata and a preference for paper evidence with a distrust of electronic documents. So that's not helping their full text ambitions, but after November 2007 we will all be able to revert back to chasing content! Of course, research assessment and research management will continue to be a driver for repositories in the UK (and Australia, and sooner or later everyone else) but there are bigger fish to fry.

Brigid Heywood, the OU's Pro-Vice Chancellor for Research, spoke powerfully at the opening ceremony about the joint missions of a university and its repository as knowledge sharing environments and the need (through constant management support - and haranguing if necessary) to encourage and stimulate change in researchers' perceptions and actions.

Trish Heffernan is their repository manager and has done a fantastic job. I've seen her speak at a publisher's meeting where she told the delegates that they had to change or go out of business (while playing Bob Dylan's "The Times They Are A-Changin"). The rest of their library team are just as highly charged!

It is so exciting to see an institution that has started to embed the repository into its institutional psyche. They've got a long way to go yet, but a well-informed and talented library team, an impressive set of contents and an energetic and supportive senior management sound like a winning formula!

Thursday 27 September 2007

correction: self deposit rates - external calibration

I've just started to recheck those results from last night and it seems that the 1 missing deposit is actually in the repository after all. The ACM had changed the hyphenation on a key word in the title, which meant that the repository search didn't return any results when it ought to have. That means we have a 40/40 success rate from a sample in 2006. For the record, I ignored a handful of items in the ACM that aren't in the scope of our repository - edited conference or workshop proceedings, panel sessions and trip reports etc.

I'm just starting to count up the number of full texts we have with those 40 items.

self deposit rates - external calibration

Southampton University, and our school in particular, has never had a CRIS or Research Management System in which to report all publications before the repository came along. Consequently we genuinely can't answer questions about the percentage of our research output that gets put into our repository, because we have know independent way of knowing what the size of our research output is! Consequently we have always reported a figure of "100%" in surveys, or admitted our ignorance in interviews.

My first posting listed "batch importing articles from publishers' web sites" as a summer task for me. It's not something that I got around to in a serious way - I did do a batch upload of several dozen articles and then got stuck when I realised that I would have to manually check them for duplicates.

Anyway, my colleague Stevan Harnad pushed me for a figure of the proportion of our research available in the repository as he is refining our methods for measuring the "OA Citation advantage". Since it's impossible to refuse one of Stevan's requests I manually checked a "representative sample" of ECS-affiliated publications in the ACM digital library from the year 2006 against our eprints.ecs.soton.ac.uk repository holdings. After allowing for trip reports, proceedings edited and oddities like people publishing a paper and immediately taking a post at another University, I could only detect 1 missing deposit from 40 publications - that's a success rate of 97.5%.

To be honest, I was stunned. I expected to find a lot of missing items. I still need to examine these 39 items closer and see what percentage have the full texts uploaded! I also ought to check a second sample from ISI's Web of Science, but these are both tasks for another day. Or perhaps later on today, when I take the train to the Open University in Milton Keynes where I have been invited to the official opening of their Repository and Newest Research Building. See http://oro.open.ac.uk/ for the former and http://www.ridge.co.uk/sectors_and_projects/education/open_university.aspx for the latter.

Monday 24 September 2007

Outcomes of Light Touch QA attempt

The results of my attempt at applying light-touch QA via emails. No-one complained about the emails, or whined about being asked to do something unreasonable. In fact, several people thanked me for keeping them on the ball and went on to request more quality alerts. (That was unexpected!)

A couple of people were still chary about putting up their full texts (or opening them up to public access). In fact, one professor sent me a list of 15 publications that he wouldn't put in the repository becaue "he had signed his copyright away". The startling thing was that 13 of those 15 publications (some journals, some conferences and some workshops) were published by "ROMEO green" publishers, ie publishers with repository-friendly policies. The other two items were book sections about which we have no deposit policy. So I had an excuse to email the whole school and remind them about our deposit policy and encourage them about their OA practices - a very useful opportunity indeed.

As to the actual effect - after a week (with no reminders and no followup) 16% of the errors that I reported had been dealt with. To be honest, I was hoping for more, but I think that these QA reminders need to be built into a proper process which includes reporting Quality statistics to the Research Committee. As a one-off it was a useful exercise, but on reflection I think that 16% is probably a realistic rate of returns for a voluntary, one-shot activity request.

Tuesday 11 September 2007

Adding Quality Assurance

I guess the first big challenge for repositories is getting people to use them; the second big challenge is coping with what they put in them! As I said in one of my earlier postings, our researchers voted to have the editor roles removed, so that there would be no delay in material appearing in the repository. A consequence has been that there is no quality control of the metadata on deposit.

Although that means that the detail of the metadata entered for each item can be very variable (e.g. ambiguous or incomplete journal or conference information, missing page numbers, spelling mistakes in titles, creators' family name and given name swapped, no email or staff ids given for local authors) the sky hasn't fallen in. We occasionally get complaints (I got emails from colleagues to tell me that I mixed up the words "ontology" and "oncology" in the title of one of my own submissions) or criticism from library colleagues, but for our principal requirements the metadata given is good enough. Our principle requirements are (of course) visibility in Google and high profile on the Web for individuals and the School as a whole. So what if we don't manage to transcribe the full name of the journal - Google Scholar indexes the full text and seems to tie everything up just fine.

Even so, our bibliographies (automatically taken from the repository) do tend to look somewhat uneven. And sometimes it's really difficult to locate the conference website using the mangled conference name recorded in the metadata. As academics we insist that our students adopt proper bibliographic standards and then we go and flout them ourselves. So I think it's time to adopt some QA processes! I did do some experiments at the beginning of the summer that produced me listings of the most common errors, but it convinced me that there were too many problems for the Repository Manager to deal with alone. The design principle for our school repository is "low cost/low impact", so how to do QA without an editorial team? Ultimately, it has to be by sticking with the "self deposit" model and forcing (encouraging? mandating?) the authors to fix their own deposits. The model that I am adopting is that the repository manager runs a program that identifies a variety of metadata problems and inconsistencies, it generates a set of emails that are sent to notify the authors of the problems that they need to address, and the authors go and fix them.

It all sounds very easy, but the proof of the pudding will be in the eating. Will everyone ignore or overlook or "fail to prioritise" my requests? I have the chair of the Research Committee backing me up as a signatory to the emails, so hopefully that will add some weight. I am halfway through the first run-through - I write this entry while pausing before hitting the Enter key to send off all the emails.

Before I screw up the courage to hit "Enter", let me give you a breakdown of the situation. Firstly, I am concentrating on items deposited last year (2006) because they should have had time enough to go through any extant parts of the publication cycle. In 2006 there were 1128 items deposited in the repository. Of those, 475 don't have any full text associated with them and 111 have full texts but aren't set as Open Access. 56 items don't have any ECS authors identified by their email addresses. 21 are still marked as 'submitted' and 119 as 'in press'. None of the published items have completely missed off their conference or journal name. There are 782 entries in my "problems" file referring to 700 eprints (or 62% of the year's total). I am sending emails to 239 individuals who are depositors or authors of these items.

Just a word about why these problems exist. If this were a "preservation repository" then the items would be collected after publication. There would only be one version - the final version - and once submitted and approved there would be no changing of the metadata or data.By contrast, this is an open access repository of live research - items are deposited before publication, while still works in progress. COnsequently the metadata and accompanying data/documents change throughout the publishing lifecycle. In particular, a piece may be submitted to one journal then another. Once accepted the version may become fixed, but the pubication details (volume, issue, page numbers) may not be known for 6-12 months. Consequently, QA processes for an eprint need to be regularly revisited and not just performed as a one-off on ingest.

I know that I'm going to need to extend the range of my problem identification, but that can be done later. I can always send out other requests to fix eprints! And for those problems that are too subtle for a program to spot, I can add a button to each metadata page that says "Report problems of missing or incorrect bibliographic information".

I'm going to press the button now. Wish me luck!

Monday 10 September 2007

Decommissioning Repositories

I had an interesting discussion with Chris (our repository technical guy) today. We host a couple of repositories in ECS that have been used by long-term projects which have now ended. A repository was appropriate to create for them because both were multi site projects - one UK project with six collaborating universities and one EU one with dozens and dozens of partners. Each repository formed a useful way of collecting project outputs and other publications that were relevant to the project's goals, and because each project was relatively long-lived (six years or so) then they were thought of as autonomous quasi-organisations in their own right. And for that very reason, the anticipated contents would not have fitted into a single institutional repository - the majority of course coming from the other institutions!

But now the party's over, there is no more funding, and none of the partner institutions has offered to keep the repository going in perpetuity. Not even the hosting institution or the ex-manager wants to keep their repositories going. We know that even if we don't turn them off their hosting hardware will fail in a few of years. That sounds like very bad news because a repository is supposed to be forever! Was it irresponsible to create these repositories in the first place? Should it be forbidden to create a public repository whose life is guaranteed to be less than a decade? Or perhaps that should be factored into the original policy-making - "this repository and all its contents are guaranteed up to 31st December 2017 but not after". If that were machine readable then the community could have decided whether they want to mirror the collection, or selected bits of it.

However, an easier solution appears to be at hand (or at least for EPrints). A repository has two functions - (a) collection / management of information by registered users and editors and (b) dissemination of that information to all and sundry. Once a repository is decommissioned and its managers and depositors have ceased to use it then the former activity ceases, but the latter can go on in perpetuity. A static website is much easier to run that a repository - it is just a set of files, overseen by a web server instead of a database and a hundred active Perl / Java classes. The dissemination (public) part of the repository can be turned into a static website and simply grafted on to the hosting institution's static web space (using an apache virtual host to keep the URLs identical).

To activate this change, the EPrints repository template needs to be edited to delete all reference to "logging in" or "dynamic site searching" and then all of the static pages need to be regenerated to use the new template. Once that has happened, the repository's 'html' and 'documents' subdirectories can just be transferred to a new web server. The URLs will all be retained intact, the metadata and documents will all be retained intact, the 'collections' will all be retained intact (e.g. view by research group, view by project, view by subject or view by year) and to an external user the repository will look and act much the same.

Only two extra considerations are left - firstly an OAI-PMH static file will have to be generated by the old repository for its holding to still be usable by OAI services in their new location. But more importantly, the hosting institution should consider establishing some light-touch policies for this repository "fossil" - especially with regard to continuing preservation of access and preservation of the documents.

According to the Veterinary Dictionary on answers.com, 'senescence' is the "depression of body functions as part of the process of growing old." I think this accurately describes the process outlined above, so I shall start referring to repository senescence. Of course, this procedure could be applied to a live repository, in order to create a static copy for distribution by CD. This would be ideal for conference / workshop series or electronic journal publishers.

NB The procedure applied above is easy to achieve with EPrints precisely because it was designed to eliminate processing load on the server by making as much of the repository as possible servable as pre-generated static web files.

NNB An alternative approach might be to import all the holdings from the project repository into the institutional repository. But since each of these projects consisted of so many partners, most of contents would fall outside the collections policy of the host IR. Very few IRs actually want a collection of material that is 95% created by other institutions, and nor do the 95% of authors want to see their work bolstering another University's profile! Various IRs have managed to square the circle when it comes to articles of journals that they publish in their presses, but it seems for now that project partnership is a different relationship.

Thursday 6 September 2007

Metadata Planning

After a couple of weeks' holiday, I've been spending some time developing some programs to help visualise the metadata that is used by a repository. There are two requirements -one is just to see all the fields that users are being asked to enter for the various kinds of deposits. That is useful for planning a repository, or just keeping an eye on the usefulness of what you planned some years ago. For example, our repository has a "comments" field that is supposed to be used to provide feedback on any problems that were encountered depositing each article. I don't think that it has ever been used!

The second requirement is to see an overview of the process that depositors have to walk through. Because EPrints allows the deposit workflow to be customised according to the attributes of the deposited item or the attributes of the depositing user then understanding the interplay between the various conditions can be quite taxing! Especially when the official workflow document is expressed in XML.

You can see an the first (tabular) listing at http://users.ecs.soton.ac.uk/lac/ecs.html and an extract of the second (diagram) at http://users.ecs.soton.ac.uk/lac/ecs.png