Tuesday, 11 September 2007

Adding Quality Assurance

I guess the first big challenge for repositories is getting people to use them; the second big challenge is coping with what they put in them! As I said in one of my earlier postings, our researchers voted to have the editor roles removed, so that there would be no delay in material appearing in the repository. A consequence has been that there is no quality control of the metadata on deposit.

Although that means that the detail of the metadata entered for each item can be very variable (e.g. ambiguous or incomplete journal or conference information, missing page numbers, spelling mistakes in titles, creators' family name and given name swapped, no email or staff ids given for local authors) the sky hasn't fallen in. We occasionally get complaints (I got emails from colleagues to tell me that I mixed up the words "ontology" and "oncology" in the title of one of my own submissions) or criticism from library colleagues, but for our principal requirements the metadata given is good enough. Our principle requirements are (of course) visibility in Google and high profile on the Web for individuals and the School as a whole. So what if we don't manage to transcribe the full name of the journal - Google Scholar indexes the full text and seems to tie everything up just fine.

Even so, our bibliographies (automatically taken from the repository) do tend to look somewhat uneven. And sometimes it's really difficult to locate the conference website using the mangled conference name recorded in the metadata. As academics we insist that our students adopt proper bibliographic standards and then we go and flout them ourselves. So I think it's time to adopt some QA processes! I did do some experiments at the beginning of the summer that produced me listings of the most common errors, but it convinced me that there were too many problems for the Repository Manager to deal with alone. The design principle for our school repository is "low cost/low impact", so how to do QA without an editorial team? Ultimately, it has to be by sticking with the "self deposit" model and forcing (encouraging? mandating?) the authors to fix their own deposits. The model that I am adopting is that the repository manager runs a program that identifies a variety of metadata problems and inconsistencies, it generates a set of emails that are sent to notify the authors of the problems that they need to address, and the authors go and fix them.

It all sounds very easy, but the proof of the pudding will be in the eating. Will everyone ignore or overlook or "fail to prioritise" my requests? I have the chair of the Research Committee backing me up as a signatory to the emails, so hopefully that will add some weight. I am halfway through the first run-through - I write this entry while pausing before hitting the Enter key to send off all the emails.

Before I screw up the courage to hit "Enter", let me give you a breakdown of the situation. Firstly, I am concentrating on items deposited last year (2006) because they should have had time enough to go through any extant parts of the publication cycle. In 2006 there were 1128 items deposited in the repository. Of those, 475 don't have any full text associated with them and 111 have full texts but aren't set as Open Access. 56 items don't have any ECS authors identified by their email addresses. 21 are still marked as 'submitted' and 119 as 'in press'. None of the published items have completely missed off their conference or journal name. There are 782 entries in my "problems" file referring to 700 eprints (or 62% of the year's total). I am sending emails to 239 individuals who are depositors or authors of these items.

Just a word about why these problems exist. If this were a "preservation repository" then the items would be collected after publication. There would only be one version - the final version - and once submitted and approved there would be no changing of the metadata or data.By contrast, this is an open access repository of live research - items are deposited before publication, while still works in progress. COnsequently the metadata and accompanying data/documents change throughout the publishing lifecycle. In particular, a piece may be submitted to one journal then another. Once accepted the version may become fixed, but the pubication details (volume, issue, page numbers) may not be known for 6-12 months. Consequently, QA processes for an eprint need to be regularly revisited and not just performed as a one-off on ingest.

I know that I'm going to need to extend the range of my problem identification, but that can be done later. I can always send out other requests to fix eprints! And for those problems that are too subtle for a program to spot, I can add a button to each metadata page that says "Report problems of missing or incorrect bibliographic information".

I'm going to press the button now. Wish me luck!


  1. Good luck. Just so you're prepared, I will point out that another potential negative consequence is depositors rearing back on their hind legs and balking about depositing anything else!

  2. Wise words and good counsel from dorothea, so I decided to send out an explanatory email to everyone beforehand, just to ease people into the swing of things and to stop them feeling picked on. For the record, here it is:

    I am just about to send around some messages about items in the EPrints database that need some attention. Most people will receive a mail outlining some missing information for a handful of papers. DON'T PANIC - these aren't major problems - they are mainly items that have been marked as "submitted" or "in press" for over a year, or records with missing full texts. If you don't have the information to update the record, perhaps you could ask one of your co-authors to assist. There is no need to reply to the emails, unless you have any queries.

    Thanks for your help!
    Les Carr