Monday 30 July 2007

Importing Frustrations

It all sounds so easy "just import it from BibTeX". But of course, the ACM's idea of what should go where in BibTeX doesn't fit with mine / my repository. So for example, my repository has an "Official URL" field to indicate where the "official publisher's version" (ahem) is to be found. The ACM (bless 'em) instead provide a "DOI" field. That's a straight-forward-enough mismatch of information and easy to work around, but to make matters confused they don't put a DOI in the DOI field, they put a URL there. The URL happens to be the URL of a DOI resolution service (their own) with the DOI stuck on the end. This (as it happens) is very easy for a human to use, but a bit of a pain for a service to interpret. Only a little bit of a pain, I hear you cry! But these import scripts are supposed to be little pieces of easy-to-write code that adapt a well-understood interop format to my database schema. Am I supposed to write a different BibTeX importer for each blooming publisher? Ick! Or am I to write a mega-disambiguation script that can understand what the data provider should have said?

Also, there's that little matter of the missing abstract, so I have to roll my own BibTex by data scraping anyway. Roll on RDF! (But then of course you can make the same mistakes with RDF and all the hordes of Semantic Web technologists that you can with BibTeX.)

Or, do I just make do with whatever little scraps of help the importer does get right and manually enter the rest (using my army of self-archiving slaves)? What's the Zen thing?

Thursday 26 July 2007

Getting Rid of Lots of Material

Sometimes an import from an external data source works technically, but you rather wish you hadn't done it. This happened to me yesterday when I tried to import details of all the new publications of our staff in the ACM digital library (ACM = scholarly and professional society for Computer Scientists and sundry technophiles). It can export each item to Bibtex, and EPrints imports happily from Bibtex. Yippee I thought! Unfortunately, the ACM do not include an article's abstract in its export, so this makes the mass deposit less useful than I thought.

But I didn't discover this until I had imported a batch of 20 items. Clicking each item, going to its Action page, pressing the "Delete" button and then *confirming* the delete left me without the will to live after dealing with only two items. (Very low pain threshold us academics - not like librarians who seem to be able to withstand banging their heads against the wall for years on end.)

Anyway, an attribute of Computer Science Geeks, is that we would rather write a program capable of doing something 1000 times than actually do it 10 times. So I wrote a script called "BATCH" which allows me to delete arbitrary lists of eprints from any Eprints3 repository - assuming that I have the correct login and password! In theory it would also allow me to do *anything* to that list of eprints, but I can't think of anything else that I would want to do. I'll sleep on it. Who knows, it might be useful to other repository managers.

Monday 23 July 2007

Welcomed to the Community

I am proud to have been officially welcomed to the community of blogging repository managers by Caveat Lector. Although I don't think I'm up to saving anybody quite yet, I hope that we will see some more blogs from repository practitioners following in her footsteps.

So in response, let me thank her with these words/anagrams:
Dorothea Salo,
Solo Data Hero,
A Haloed Torso,
Has Loot - Adore!

Sunday 22 July 2007

Bad News and Good News

The portfolio server went down sometime last week, and we realised on Thursday that it couldn't be resurrected. Luckily the disks (two RAID mirrored disks) were fine and so we could transplant them into another of our servers (Tim Brody's development server). Unfortunately he was away on holiday last week, but he had shared the root password with another EPrints developer, so everything worked out alright! Tim's going to have a bit of a shock tomorrow morning though.

The good news, is that our exams officer says that the School policy on third and fourth year project reports and dissertations is that they are to be considered public material (after examination, of course). Hence, I am advised, we don't need to get individual permission from students if we want to host anything on a school repository. So I have spent some time this weekend uploading the highest scoring reports, presentations and posters onto portfolio. We will inform the students, of course, but not having to manage permission makes things a lot easier!

Of course, it's not all plain sailing. Try as I might, I can't turn the A1 PowerPoint posters into PDFs on my Mac. Goodness only knows what the problem is, so I am resorting to exporting them to PNG images instead.

Tuesday 17 July 2007

Community Solutions!

I've just found out that my Institutional Repository counterparts ( have extended their EPrints installation to include a thing called a "Problem Buffer" that seems to do many of the QA things that I have been trying to do. I'm going to arrange for a demo!

I've known for a long time about their Problem Buffer, but I didn't realise that they had made it quite so sophisticated. I thought that it was just a 'dumping ground'! I'm always telling people to look around and learn from other repositories, and so I'm embarrassed to have been hoist by my own petard.

Saturday 14 July 2007

Midnight Reflections

Like many repository managers, I have another job to do. Since the repository is only part of my work (and the School has certainly aimed to make sure that the repository is an important part of researchers' work without being a burden) then I find myself working on it after hours or at weekends. There's just too much admin to do 9-5! This weekend my wife is away at the Larmertree Festival" with our youngest daughter, so I have been able to devote some time to the repository this Saturday without guilt.

I thought I'd make a start on the QA (#2 on my list) and I've managed to put together some programs that address most of those topics. So I have some visual reports on potential duplicates, missing metadata fields and stalled publication. Chris has also run me up an EPrints plugin that allows me to embed an eprints metadata field input component into an ordinary, hand-generated web page (so that I can script up my own page designs that happen to include a journal input boxes and the like). My original idea was that I would do all the metadata correction and editing, manually. However, there's so many fields to correct in so many eprints that I really think that I need to go back to the self-archiving ideals and get the depositors to sort out their own mess.

So rather than clever batch editing, I think that I'll need to work on some methods for identifying specific problem records (e.g. missing journal titles) and then assigning them to the depositors/authors as tasks, and then getting the repository to track the users' progress against each of the tasks. A new kind of workflow - the user will see a message saying "please fix the following mistakes on this record" with the necessary input boxes embedded on the message. That'll make it nice and quick. And I will need to be able to track the status of all the 'repairs' that all the users have been asked to do. (Completed, in progress, outstanding, refused.)

Some things will need to be handled by me. I have noticed, for example, that it is so common for abstracts to be cut and paste with explicit line breaks (ie very short lines that don't reflow in a wider window) that it would be too onerous to expect the depositors to fix them properly.

Anyway, enough of this for now. It's midnight on a Saturday evening, I have the house to myself and I want to catch up on my unwatched sci-fi DVDs (Bicentennial Man, Battlestar Galactica and Revelation of the Daleks).

Tuesday 10 July 2007

Some Background

Just for the record, I ought to explain a bit about the repository that I manage.

The repository contains about 10,000 records and gets about 600 new deposits per year. There are 2400 papers published since 2004, of which 1400 have open access full texts. I'll have more to say on this percentage later.

It started off life as a bibliography database and was migrated to EPrints in Spring 2000. Its use as a bibliographic record of all school output was already well established but a full text mandate was added in January 2003. The explicit aim has always been for 'light touch' repository management, with all eprints being self-deposited and no editorial workflow to check the metadata. For the last two years the role of 'repository manager' has been an official school administrative task, that is, one of those jobs that are assigned to academics to take on as part of their 30% admin. Up to now, the extent of my work has really been to generate termly reports for the Research Committee that summarise the deposits made by each individual in the school and their compliance with the mandate. This is half a day's effort, three times a year or 0.68% FTE.

I also work with the repository administrator who is also our webmaster. He keeps an eye on the repository, managing backups and fixing occasional problems (estimated "a couple of hours per month" or 1.5% FTE). He also runs about six more repositories on the same basis (on the same server) for various of the school's EU and UK funded projects. The server in question is a five-year-old PC running Linux RedHat 7.3 - it was not particularly powerful at the time.

In the steady state, therefore, once you have got your repository up and running and everyone is used to using it, you can see that the resource requirements for a school repository are not onerous!

Having said that, we have just bought (but not commissioned) a new server and we have just spent some significant time configuring the repository to handle our RAE returns in exactly the way that we wanted. Although EPrints has a module for RAE support, our Head of School (Research) had very specific requirements for handling the data. So I haven't included that as a repository expense per se.

Monday 9 July 2007

And Another Thing

I expect that there will be lots of additions to the list while I try and get my brain (and office) in order.

  1. analyse use of ECS repository: the most popular eprints in our repository are downloaded 200 times per day, whereas the more normal rate is 2 or 3 times per day. I would like to use the tools that we have developed for the JISC IRS project to investigate the reasons behind the recorded download profiles.

Sumer Is Icumen In

All the exam boards have passed and the summer dawns on me and my academic colleagues. I am now old enough and wise enough to know that the apparently enormous stretch of free space in my diary will only allow me to accomplish two or three things before the evenings start drawing in and freshers' week arrives. I'd really like to get some things done on our repository, and I'm trying to make a list.

  1. Migrate the repository (now seven years old) into EPrints v3.

  2. Set up some QA procedures for the repository. Since the staff don't want to have any editorial oversight, we need to do this post hoc. It can all be done manually, but I'd like to have some help for (a) identifying potential duplicates (b) checking for missing full texts (c) checking for items that have been 'submitted' or 'in press' for more than a year (d) looking for missing metadata (e.g. page ranges).

  3. Set up an automatic alert / deposit startup from sources like the ACM and IEEE digital libraries, so that I can regularly find out what new things have been published in recent journal issues and conference proceedings that haven't been deposited into our repository.

  4. Set up a new student repository as an e-portfolio for undergraduate and masters coursework and related activities.

  5. Pre-deposit the best third and fourth year project reports and presentations into the students' individual work areas on the repository

  6. Have a big publicity push at Graduation, and get the students to sign permission forms for us to make the above work public

I've already got a lot of the work for #1 done thanks to Chris Gutteridge's effort over the last couple of months, but hopefully we can finish this off soon. I've also done a lot of work on #4 myself, so that we had something to show the students when their results were published. (See for the new repository and also the poster we put up next to their degree results.)