RepositoryMan: 2007

Saturday, 24 November 2007

Getting It Out of your System

At the moment I am torn between the model of a repository as a theme park (there's all the rides for you to enjoy in one place) or the repository as a DVD lending library (there's all the films there in one place, but you take them away to enjoy them on your DVD player and your iPod and your laptop and at your friend's house).

I've thought hard about this with EPrints in mind - should it offer a rich and engaging user experience there, inside the repository, with as many built-in services as possible, or should it just let you take its contents and use them in as many external places as possible?

At Southampton we build repositories (EPrints) and we build OAI services (citation analysis, open access monitoring, preservation assistance), so we're constantly asking ourselves the question "where does this go? Inside a repository or in a service?" My natural inclination is to go for the external service model - it's global and interoperable rather than parochial and platform-specific.

Put in those terms the answer seems to be a no-brainer. But the problem is that while it is getting easier to get repositories funded and supported, it is really difficult to get services funded and supported. What is the natural home for an international service? Difficult to say! If it doesn't cater to a particular institution, region or country then who is going to put their hands up and host it? Or rather who is going to put their hands in their pocket and bankroll it?

But when it comes to an institutional repository there's a different story. It has a natural home (the institution) and with it a support infrastructure and a mechanism for applying for further support to achieve new developments/updates, all because it's serving a local need. So innovations and services may perhaps emerge in the local repository, rather than in a global service.

Tim O'Reilly made a recent criticism of this approach (It's The Data Stupid) in the context of social networking, arguing that it is more important to allow users to use their information in lots of third party services than it is to make it easy for developers to create lots of applications local to a particular site.

And he may be right - the information captured by a single repository is going to be a very very small part of "the global literature". What researcher would want to be locked up with only the work authored by him/her and his/her research group, however interesting the ride!

So on the one hand external services seem to be the proper solution and on the other hand local repositories seem to be the pragmatic solution. Like I said at the beginning, I'm torn. I think that EPrints had better back both approaches!

Tuesday, 13 November 2007

Repository Upgrade

We switched the repository just before the weekend, so that the EPrints v3 version has now replaced the old repository. We had done all the configuration migration previously and every so often we would migrate the data contents, just to check. We suspended all new deposits and editing facilities on Wednesday, did the final data migration and changed the DNS so that eprints.ecs now points to what used to be eprints3.ecs. Having done that we left the editing & depositing locked down for a few days in case any problems became apparent. They didn't, apart from a few people reporting that the editing had stopped working :-)

So last night (Monday 12th Nov) we switched editing back on and all systems are go once again! There are a couple of niggles to sort out - the citation format isn't quite the same as it used to be (it's missing out some conference information) but apart from that it went very smoothly.

Sunday, 4 November 2007

Even Less Exciting Times

I demonstrated the new repository to our research committee but it was all a bit of a letdown in the sense that they couldn't see any problems with moving to the new repository straight away. So all of a sudden the ball is back in our court, after having been waiting since the beginning of the summer.

Wednesday, 31 October 2007

Less Exciting Times: When Your Server Goes Down

Having just spent three months preparing for the EPrints Call for Plugins (see http://www.eprints.org/software/cfp.php) I find that firstly the JISC mailing list server is having problems and it takes a whole day to get the message sent to the repository mailing list which then makes the delayed delivery of the mail coincide with a hardware fault taking our server down!!! I just don't believe it! After all that prep, just when everyone receives the email, they can't click on any of the links to find more information. ARRRGGGHHH!

More to the topic of this blog, the fault also took our school repository down for the day. This is a wakeup call, because it probably means that the hardware is on its way out. I did explain in a post at the beginning of the summer that it was running on a fairly old machine, that has done OK by us for about 5 years.

We do have a replacement waiting in the wings, because it is the EPrints v3 migration. However, we did imagine that we had a more leisurely timeframe to roll it out in - and to do user training so that people would understand the new interface. Looks like it's going to be a bit more rushed than I thought. I'll demo the new setup to our research committee today (that's a good coincidence!), so hopefully they'll greenlight it with minor changes at most.

Monday, 29 October 2007

Exciting Times: The Repository Desktop Experience

I battled with my laptop over the weekend to upgrade it to the latest version of the Mac environment (why is it that my teenage children had no problem with their upgrades?) but I think that the brave new world of "Leopard" is probably worth it.

How is this related to repositories, I hear you cry? Aren't we a little "off-piste"? Well, stick with me, because Apple have been addressing the issue of browsing through collections. First of all it was music collections (iTunes) and then picture collections (iPhoto), but now they have put some of that experience into viewing large collections of documents and files. The Finder (the mac equivalent of the Windows Explorer) has stolen the so-called "cover flow" visualisation from iTunes, to allow you to get the experience of quickly flicking through a stack of albums to identify the one you want by its artwork. The result is that I can flick through the contents of dozens or hundreds of files on my hard disk (powerpoint slideshows, article PDFs, conference posters, funding proposals, committee minutes, photos, videos, the lot). I don't have to open them one at a time in the application that created them. I don't have to stare at lists of file names or grids of icons any more. I can just flick through the contents.

So, by using the simple "Zip" export plugin in EPrints, I can get the files associated with any set of eprints and "Cover Flow" browse them on my laptop. See a video demonstration of what I'm talking about. Please excuse the cheesy voiceover!

Is this "quite cute" or is this "really useful"? Well, it's already really useful for some of the applications that we have at the moment - cover flow or slide shows or similar visualisations are good ways to show off our repository contents. Whether someone is trying to sell the repository to the faculty, or sell the faculty to the funders, or sell the funders to the government, or (as in the video example) sell their own educational achievements to their prospective employers then good presentations are essential.

But I think that this kind of visualisation might well prove useful for helping researchers interact with large collections of research material. Time (and experience) will tell. What is clear is that the user's desktop experience is going to become more multimedia and more interactive and that repositories will need to have a closer integration with the desktop, both for information upload and for information reuse.

Wednesday, 10 October 2007

Never Mind the Quality, Feel the Width

It's a new academic year and there's a New Head of School in charge. He's visiting all the research groups and so when he turns up to our group we want to have a good display of our research on show, and it seems that if the recent posters presented at conferences and workshops would be good. We've decided to put them all in the repository (those that weren't already in there) and then they can be displayed on a plasma screen.

But, he's also not had the chance to become au fait with the repository or with EPrints yet, and I wanted to try and give some impression of the size of the collection that it represents. So I have created a 'thumbnail wall' of all the files stored in it and I'll get it printed as big as I can for display. There's an interactive version linked to the image above - its in PDF and each thumbnail is linked to the associated record in the repository. It's a bit big - there's over 4000 page images there! The background image between the documents is sky and clouds - for "blue sky research".

I'm going to contact some artists to see if they can help me develop a more interesting and perhaps practical way of looking at large collections!

Sunday, 30 September 2007

Self Deposit - It Jolly-Well Works You Know!

First some impartial statistics, and then some ballyhoo.

To follow up the previous posting about our School repository's apparent 100% success rate for capturing the School's research outputs, I have expanded the scope of the investigation. If you recall, I focused on a single society publisher (the ACM) as it provides and important publication venue for half of the research groups in our School. Thanks to Alma Swan's assistance, I have managed to get a more comprehensive (and more representative) report of 157 journal publications for our school from ISI's Web of Knowledge in 2006. Comparing this against the repository's holdings for that year I find that 128 items match deposits in the database and 29 do not - that makes a deposit rate of 82%. Of those 128 deposits, 118 have full texts and 105 are open access - making an overall OA success rate of 67%.

Running the same experiment on the WoS Conferences data I get 108 conference or workshop papers reported, of which 86 seem to be deposited in the repository (an 80% success rate, almost identical to the journal figures above). However, only 60 of those are full text and 54 are open access, meaning that only 50% of this source of material is being OAed. That seems to be a significant difference (2/3 above versus 1/2 here) which may be partly explained by the attitude of the electronics community towards conferences and workshops. [Note that the figure of 86 items deposited has yet to be carefully checked.]

Now the ballyhoo!

As I explained in one of my opening posts back in July, the ECS repository is a research school's working repository with minimum investment - having been established six years ago it now attracts about 2% FTE effort in management, editorial and technical support, and it is still running on the same old server (well, workstation). The policy is that the repository shouldn't put a noticeable load on the research staff - it is there to serve research, not vice versa.

So to find that self-archiving works even in such a neglectful (lax, sloppy) environment is very exciting indeed. We don't have complex processes, careful editorial QA procedures nor any extensive administrative oversight. It just happens, day by day. To find that the effectiveness is running at 80-100% without any management effort on your part, is just amazing.

So let's hear it for self-deposit!

Just How Long Does it Take to Establish a Successful Repository?

On Thursday I went to the Open University's official opening of Open Research Online, their Institutional Repository. It's currently the third biggest IR in the UK according to ROAR (if you ignore our ECS school repository which isn't run at institutional scales), and it looks to be running at a good sustainable rate of growth, with daily deposits coming from across the institution. It's been five years since the OU first started their eprints repository, and in that time it has gone through several iterations, and several management teams.

Like Southampton, the OU's repository is driven by the needs of the UK's Research Assessment Exercise - broadly speaking that means very high quality metadata and a preference for paper evidence with a distrust of electronic documents. So that's not helping their full text ambitions, but after November 2007 we will all be able to revert back to chasing content! Of course, research assessment and research management will continue to be a driver for repositories in the UK (and Australia, and sooner or later everyone else) but there are bigger fish to fry.

Brigid Heywood, the OU's Pro-Vice Chancellor for Research, spoke powerfully at the opening ceremony about the joint missions of a university and its repository as knowledge sharing environments and the need (through constant management support - and haranguing if necessary) to encourage and stimulate change in researchers' perceptions and actions.

Trish Heffernan is their repository manager and has done a fantastic job. I've seen her speak at a publisher's meeting where she told the delegates that they had to change or go out of business (while playing Bob Dylan's "The Times They Are A-Changin"). The rest of their library team are just as highly charged!

It is so exciting to see an institution that has started to embed the repository into its institutional psyche. They've got a long way to go yet, but a well-informed and talented library team, an impressive set of contents and an energetic and supportive senior management sound like a winning formula!

Thursday, 27 September 2007

correction: self deposit rates - external calibration

I've just started to recheck those results from last night and it seems that the 1 missing deposit is actually in the repository after all. The ACM had changed the hyphenation on a key word in the title, which meant that the repository search didn't return any results when it ought to have. That means we have a 40/40 success rate from a sample in 2006. For the record, I ignored a handful of items in the ACM that aren't in the scope of our repository - edited conference or workshop proceedings, panel sessions and trip reports etc.

I'm just starting to count up the number of full texts we have with those 40 items.

self deposit rates - external calibration

Southampton University, and our school in particular, has never had a CRIS or Research Management System in which to report all publications before the repository came along. Consequently we genuinely can't answer questions about the percentage of our research output that gets put into our repository, because we have know independent way of knowing what the size of our research output is! Consequently we have always reported a figure of "100%" in surveys, or admitted our ignorance in interviews.

My first posting listed "batch importing articles from publishers' web sites" as a summer task for me. It's not something that I got around to in a serious way - I did do a batch upload of several dozen articles and then got stuck when I realised that I would have to manually check them for duplicates.

Anyway, my colleague Stevan Harnad pushed me for a figure of the proportion of our research available in the repository as he is refining our methods for measuring the "OA Citation advantage". Since it's impossible to refuse one of Stevan's requests I manually checked a "representative sample" of ECS-affiliated publications in the ACM digital library from the year 2006 against our eprints.ecs.soton.ac.uk repository holdings. After allowing for trip reports, proceedings edited and oddities like people publishing a paper and immediately taking a post at another University, I could only detect 1 missing deposit from 40 publications - that's a success rate of 97.5%.

To be honest, I was stunned. I expected to find a lot of missing items. I still need to examine these 39 items closer and see what percentage have the full texts uploaded! I also ought to check a second sample from ISI's Web of Science, but these are both tasks for another day. Or perhaps later on today, when I take the train to the Open University in Milton Keynes where I have been invited to the official opening of their Repository and Newest Research Building. See http://oro.open.ac.uk/ for the former and http://www.ridge.co.uk/sectors_and_projects/education/open_university.aspx for the latter.

Monday, 24 September 2007

Outcomes of Light Touch QA attempt

The results of my attempt at applying light-touch QA via emails. No-one complained about the emails, or whined about being asked to do something unreasonable. In fact, several people thanked me for keeping them on the ball and went on to request more quality alerts. (That was unexpected!)

A couple of people were still chary about putting up their full texts (or opening them up to public access). In fact, one professor sent me a list of 15 publications that he wouldn't put in the repository becaue "he had signed his copyright away". The startling thing was that 13 of those 15 publications (some journals, some conferences and some workshops) were published by "ROMEO green" publishers, ie publishers with repository-friendly policies. The other two items were book sections about which we have no deposit policy. So I had an excuse to email the whole school and remind them about our deposit policy and encourage them about their OA practices - a very useful opportunity indeed.

As to the actual effect - after a week (with no reminders and no followup) 16% of the errors that I reported had been dealt with. To be honest, I was hoping for more, but I think that these QA reminders need to be built into a proper process which includes reporting Quality statistics to the Research Committee. As a one-off it was a useful exercise, but on reflection I think that 16% is probably a realistic rate of returns for a voluntary, one-shot activity request.

Tuesday, 11 September 2007

Adding Quality Assurance

I guess the first big challenge for repositories is getting people to use them; the second big challenge is coping with what they put in them! As I said in one of my earlier postings, our researchers voted to have the editor roles removed, so that there would be no delay in material appearing in the repository. A consequence has been that there is no quality control of the metadata on deposit.

Although that means that the detail of the metadata entered for each item can be very variable (e.g. ambiguous or incomplete journal or conference information, missing page numbers, spelling mistakes in titles, creators' family name and given name swapped, no email or staff ids given for local authors) the sky hasn't fallen in. We occasionally get complaints (I got emails from colleagues to tell me that I mixed up the words "ontology" and "oncology" in the title of one of my own submissions) or criticism from library colleagues, but for our principal requirements the metadata given is good enough. Our principle requirements are (of course) visibility in Google and high profile on the Web for individuals and the School as a whole. So what if we don't manage to transcribe the full name of the journal - Google Scholar indexes the full text and seems to tie everything up just fine.

Even so, our bibliographies (automatically taken from the repository) do tend to look somewhat uneven. And sometimes it's really difficult to locate the conference website using the mangled conference name recorded in the metadata. As academics we insist that our students adopt proper bibliographic standards and then we go and flout them ourselves. So I think it's time to adopt some QA processes! I did do some experiments at the beginning of the summer that produced me listings of the most common errors, but it convinced me that there were too many problems for the Repository Manager to deal with alone. The design principle for our school repository is "low cost/low impact", so how to do QA without an editorial team? Ultimately, it has to be by sticking with the "self deposit" model and forcing (encouraging? mandating?) the authors to fix their own deposits. The model that I am adopting is that the repository manager runs a program that identifies a variety of metadata problems and inconsistencies, it generates a set of emails that are sent to notify the authors of the problems that they need to address, and the authors go and fix them.

It all sounds very easy, but the proof of the pudding will be in the eating. Will everyone ignore or overlook or "fail to prioritise" my requests? I have the chair of the Research Committee backing me up as a signatory to the emails, so hopefully that will add some weight. I am halfway through the first run-through - I write this entry while pausing before hitting the Enter key to send off all the emails.

Before I screw up the courage to hit "Enter", let me give you a breakdown of the situation. Firstly, I am concentrating on items deposited last year (2006) because they should have had time enough to go through any extant parts of the publication cycle. In 2006 there were 1128 items deposited in the repository. Of those, 475 don't have any full text associated with them and 111 have full texts but aren't set as Open Access. 56 items don't have any ECS authors identified by their email addresses. 21 are still marked as 'submitted' and 119 as 'in press'. None of the published items have completely missed off their conference or journal name. There are 782 entries in my "problems" file referring to 700 eprints (or 62% of the year's total). I am sending emails to 239 individuals who are depositors or authors of these items.

Just a word about why these problems exist. If this were a "preservation repository" then the items would be collected after publication. There would only be one version - the final version - and once submitted and approved there would be no changing of the metadata or data.By contrast, this is an open access repository of live research - items are deposited before publication, while still works in progress. COnsequently the metadata and accompanying data/documents change throughout the publishing lifecycle. In particular, a piece may be submitted to one journal then another. Once accepted the version may become fixed, but the pubication details (volume, issue, page numbers) may not be known for 6-12 months. Consequently, QA processes for an eprint need to be regularly revisited and not just performed as a one-off on ingest.

I know that I'm going to need to extend the range of my problem identification, but that can be done later. I can always send out other requests to fix eprints! And for those problems that are too subtle for a program to spot, I can add a button to each metadata page that says "Report problems of missing or incorrect bibliographic information".

I'm going to press the button now. Wish me luck!

Monday, 10 September 2007

Decommissioning Repositories

I had an interesting discussion with Chris (our repository technical guy) today. We host a couple of repositories in ECS that have been used by long-term projects which have now ended. A repository was appropriate to create for them because both were multi site projects - one UK project with six collaborating universities and one EU one with dozens and dozens of partners. Each repository formed a useful way of collecting project outputs and other publications that were relevant to the project's goals, and because each project was relatively long-lived (six years or so) then they were thought of as autonomous quasi-organisations in their own right. And for that very reason, the anticipated contents would not have fitted into a single institutional repository - the majority of course coming from the other institutions!

But now the party's over, there is no more funding, and none of the partner institutions has offered to keep the repository going in perpetuity. Not even the hosting institution or the ex-manager wants to keep their repositories going. We know that even if we don't turn them off their hosting hardware will fail in a few of years. That sounds like very bad news because a repository is supposed to be forever! Was it irresponsible to create these repositories in the first place? Should it be forbidden to create a public repository whose life is guaranteed to be less than a decade? Or perhaps that should be factored into the original policy-making - "this repository and all its contents are guaranteed up to 31st December 2017 but not after". If that were machine readable then the community could have decided whether they want to mirror the collection, or selected bits of it.

However, an easier solution appears to be at hand (or at least for EPrints). A repository has two functions - (a) collection / management of information by registered users and editors and (b) dissemination of that information to all and sundry. Once a repository is decommissioned and its managers and depositors have ceased to use it then the former activity ceases, but the latter can go on in perpetuity. A static website is much easier to run that a repository - it is just a set of files, overseen by a web server instead of a database and a hundred active Perl / Java classes. The dissemination (public) part of the repository can be turned into a static website and simply grafted on to the hosting institution's static web space (using an apache virtual host to keep the URLs identical).

To activate this change, the EPrints repository template needs to be edited to delete all reference to "logging in" or "dynamic site searching" and then all of the static pages need to be regenerated to use the new template. Once that has happened, the repository's 'html' and 'documents' subdirectories can just be transferred to a new web server. The URLs will all be retained intact, the metadata and documents will all be retained intact, the 'collections' will all be retained intact (e.g. view by research group, view by project, view by subject or view by year) and to an external user the repository will look and act much the same.

Only two extra considerations are left - firstly an OAI-PMH static file will have to be generated by the old repository for its holding to still be usable by OAI services in their new location. But more importantly, the hosting institution should consider establishing some light-touch policies for this repository "fossil" - especially with regard to continuing preservation of access and preservation of the documents.

According to the Veterinary Dictionary on answers.com, 'senescence' is the "depression of body functions as part of the process of growing old." I think this accurately describes the process outlined above, so I shall start referring to repository senescence. Of course, this procedure could be applied to a live repository, in order to create a static copy for distribution by CD. This would be ideal for conference / workshop series or electronic journal publishers.

NB The procedure applied above is easy to achieve with EPrints precisely because it was designed to eliminate processing load on the server by making as much of the repository as possible servable as pre-generated static web files.

NNB An alternative approach might be to import all the holdings from the project repository into the institutional repository. But since each of these projects consisted of so many partners, most of contents would fall outside the collections policy of the host IR. Very few IRs actually want a collection of material that is 95% created by other institutions, and nor do the 95% of authors want to see their work bolstering another University's profile! Various IRs have managed to square the circle when it comes to articles of journals that they publish in their presses, but it seems for now that project partnership is a different relationship.

Thursday, 6 September 2007

Metadata Planning

After a couple of weeks' holiday, I've been spending some time developing some programs to help visualise the metadata that is used by a repository. There are two requirements -one is just to see all the fields that users are being asked to enter for the various kinds of deposits. That is useful for planning a repository, or just keeping an eye on the usefulness of what you planned some years ago. For example, our repository has a "comments" field that is supposed to be used to provide feedback on any problems that were encountered depositing each article. I don't think that it has ever been used!

The second requirement is to see an overview of the process that depositors have to walk through. Because EPrints allows the deposit workflow to be customised according to the attributes of the deposited item or the attributes of the depositing user then understanding the interplay between the various conditions can be quite taxing! Especially when the official workflow document is expressed in XML.

You can see an the first (tabular) listing at http://users.ecs.soton.ac.uk/lac/ecs.html and an extract of the second (diagram) at http://users.ecs.soton.ac.uk/lac/ecs.png

Thursday, 9 August 2007

Data in Repositories

Although I've been in several UK repopsitory projects with bona fide hard scientists who have been investigating the use of repositories for storing data (JISC EBank UK, JISC R4L) I'm a bit of a newcomer to the practicalities of storing data in a repository. At one level it's an easy task - just upload a file and add some metadata - in other words it's a process indistinguishable from depositing a journal article. The difference is that humans can interpret the contents of "articles" whereas it is a lot more difficult to understand a spreadsheet or a data table, unless the creator has gone to considerable lengths to document it.

This was brought home to me when I wrote an article on evaluating repositories that was based on a huge spreadsheet of data that I had collected from a registry of repositories. I uploaded the spreadsheet to the repository, and then realised that it was almost useless because no-one else could interpret all the columns of data, let alone discern which columns were intermediate calculations and which were genuine "results". I have tried, on a number of occasions, to "document" spreadsheets so that there are different, self-explanatory regions, but it almost always comes down to the fact that I would be better off creating a new article that explains the spreadsheet.

So I am very interested to see that Apple have just released a new application that tackles exactly this issue - a spreadsheet that is constructed as a set of tables on a sheet of text and images. I have just ordered a copy, and I hope that it will make my job (as a repository user and manager) a bit easier!

Tuesday, 7 August 2007

Cobbling it Together, or How To Make a Slideshow from a Repository

Being a Computer Scientist, I tend to think of ways of achieving automated solutions to problems, but sometimes it just ain't the best way. When I began to think about ways of creating slideshows from PowerPoints stored in the repository (described a couple of entries ago), I imagined that I would get someone to write me a nice little program. But I realised that it's just as quick for me to do myself using the repository pages and Adobe Acrobat.

(a) identify all the relevant eprints with Search or Browse.
(b) drag each interesting Powerpoint link into a notebook (anything that will allow you to drag and drop a link - I used Google Notes)
(c) make sure that the page of notes containing links to powerpoint files appears somewhere on the web (Google Notes creates a URL for your shared entries).
(d) Open Adobe Acrobat
(e) From the File Menu, choose "Create PDF from Web Page..."
(f) Type in the URL of the notes page
(g) Set the depth of the crawl to 2
(h) Click on the "Create" button
After a few minutes, Acrobat will have pulled in each of the PDF files into one long PDF file. It will also have created pages for each of the HTML links it followed.
(i) Use Document/Delete Pages... to get rid of the unwanted HTML pages.
(j) Set the Document's Initial View to "Full Screen" in the File/Properties... menu
(k) Set the Full Screen options to "Loop after last page" and "Advance every 5 seconds" in the Acrobat Preferences.

It involves a bit of messing around, but it is relatively quick while giving you lots of control. In a perfect world, EPrints would provide an "Export to PDF Slideshow plugin" I suppose!

Monday, 6 August 2007

Preserving the Past

This post is more about repsoitory usage, rather than repository management, but I think it allows me to reflect on my own usage and deposit practice and consider how I might need to support other lecturers similar to me.

Our repository has a fairly liberal accession policy - if you think it's a research output, it's in. This policy is flexible in all sorts of directions, for example recently professors have taken to depositing clips of TV news programmes which mention their research. However, it's never been pushed towards the preservation agenda, but today I've been tidying my office - my first proper attempt since we moved to a new building with smaller offices in December. I have finally got the chance to re-evaluate the contents of all those old box-files, last looked at 4 years ago during the previous move. I have discovered a set of CD-ROMs from one of my old PhD students, who has left me his thesis together with demonstrations and presentations of a handful of projects that he was working on. So I have come over all preservation-minded, and I'm wondering how to deposit this in the repository. A lot of it as never published, but it was demonstrated internally in a large multi-institutional project, and I am loathe to forget it. He did put his thesis on the repository as soon as he graduated in 2002 (bless him!), but the rest needs examining. I think it's all screendumps, powerpoints and web sites and the dynamic websites that he worked on had published, static equivalents so there is no issue about the software emulation. Pity - I'd like to try out some of VMWare's virtualisation mechanisms with EPrints.

Friday, 3 August 2007

More Mundane Work

Just so that you don't think that my life is all "wandering through labs" and "having lovely ideas about how to show off our research", I have got a list of edits to make to a professor's eprints. He has been complaining that his publications aren't being correctly categorised by the repository, and I assumed that there was some kind of bizarre bug that we were responsible for. But it turns out that he has just incorrectly filled out the "type" of each publication. So I promised to sort them all out for him - but to do that I'll have to try and find out all the proper conference details for each publication.

Why did I offer to do it for him? Why didn't I stick to my self-archiving principals? Did I mention that he was a professor and I'm not?

Creative Uses of a Repository

I was just walking through our lab this afternoon. The "lab" is the open plan area that the research staff and students inhabit in the Intelligence, Agents, Multimedia group. It is on the top floor of our new building, which enables me to say to visitors as they step out of the lift "Welcome to the IAM group - the highest level of abstraction in the School". (That kind of thing passes for wit in Computer Science circles.)

Back to the plot - as I was walking through the lab, I was struck by how many posters our researchers have accumulated from conference visits. They are stuck up all over the cubicle walls that separate out the different research areas. Some of these posters are looking a bit the worse for wear (they used to be up in the old building) but none the less they give a good impression of the research that this group has undertaken recently.

Now, as long as all these posters are actually stored in the repository, we could use them to provide a rotating display on a public plasma screen. There are an increasing number of these in various buildings all over the department. All it would take would be a little script that chooses a different PPT or PDF Poster Presentation from our eprints repository and displays it for 2 minutes before going on to the next poster. Each time it would choose a poster from a different author / discipline / research team. All I'd have to do is find a spare screen!

My list of things to do this summer isn't getting any shorter!

Monday, 30 July 2007

Importing Frustrations

It all sounds so easy "just import it from BibTeX". But of course, the ACM's idea of what should go where in BibTeX doesn't fit with mine / my repository. So for example, my repository has an "Official URL" field to indicate where the "official publisher's version" (ahem) is to be found. The ACM (bless 'em) instead provide a "DOI" field. That's a straight-forward-enough mismatch of information and easy to work around, but to make matters confused they don't put a DOI in the DOI field, they put a URL there. The URL happens to be the URL of a DOI resolution service (their own) with the DOI stuck on the end. This (as it happens) is very easy for a human to use, but a bit of a pain for a service to interpret. Only a little bit of a pain, I hear you cry! But these import scripts are supposed to be little pieces of easy-to-write code that adapt a well-understood interop format to my database schema. Am I supposed to write a different BibTeX importer for each blooming publisher? Ick! Or am I to write a mega-disambiguation script that can understand what the data provider should have said?

Also, there's that little matter of the missing abstract, so I have to roll my own BibTex by data scraping anyway. Roll on RDF! (But then of course you can make the same mistakes with RDF and all the hordes of Semantic Web technologists that you can with BibTeX.)

Or, do I just make do with whatever little scraps of help the importer does get right and manually enter the rest (using my army of self-archiving slaves)? What's the Zen thing?

Thursday, 26 July 2007

Getting Rid of Lots of Material

Sometimes an import from an external data source works technically, but you rather wish you hadn't done it. This happened to me yesterday when I tried to import details of all the new publications of our staff in the ACM digital library (ACM = scholarly and professional society for Computer Scientists and sundry technophiles). It can export each item to Bibtex, and EPrints imports happily from Bibtex. Yippee I thought! Unfortunately, the ACM do not include an article's abstract in its export, so this makes the mass deposit less useful than I thought.

But I didn't discover this until I had imported a batch of 20 items. Clicking each item, going to its Action page, pressing the "Delete" button and then *confirming* the delete left me without the will to live after dealing with only two items. (Very low pain threshold us academics - not like librarians who seem to be able to withstand banging their heads against the wall for years on end.)

Anyway, an attribute of Computer Science Geeks, is that we would rather write a program capable of doing something 1000 times than actually do it 10 times. So I wrote a script called "BATCH" which allows me to delete arbitrary lists of eprints from any Eprints3 repository - assuming that I have the correct login and password! In theory it would also allow me to do *anything* to that list of eprints, but I can't think of anything else that I would want to do. I'll sleep on it. Who knows, it might be useful to other repository managers.

Monday, 23 July 2007

Welcomed to the Community

I am proud to have been officially welcomed to the community of blogging repository managers by Caveat Lector. Although I don't think I'm up to saving anybody quite yet, I hope that we will see some more blogs from repository practitioners following in her footsteps.

So in response, let me thank her with these words/anagrams:
Dorothea Salo,
Solo Data Hero,
A Haloed Torso,
Has Loot - Adore!

Sunday, 22 July 2007

Bad News and Good News

The portfolio server went down sometime last week, and we realised on Thursday that it couldn't be resurrected. Luckily the disks (two RAID mirrored disks) were fine and so we could transplant them into another of our servers (Tim Brody's development server). Unfortunately he was away on holiday last week, but he had shared the root password with another EPrints developer, so everything worked out alright! Tim's going to have a bit of a shock tomorrow morning though.

The good news, is that our exams officer says that the School policy on third and fourth year project reports and dissertations is that they are to be considered public material (after examination, of course). Hence, I am advised, we don't need to get individual permission from students if we want to host anything on a school repository. So I have spent some time this weekend uploading the highest scoring reports, presentations and posters onto portfolio. We will inform the students, of course, but not having to manage permission makes things a lot easier!

Of course, it's not all plain sailing. Try as I might, I can't turn the A1 PowerPoint posters into PDFs on my Mac. Goodness only knows what the problem is, so I am resorting to exporting them to PNG images instead.

Tuesday, 17 July 2007

Community Solutions!

I've just found out that my Institutional Repository counterparts (eprints.soton.ac.uk) have extended their EPrints installation to include a thing called a "Problem Buffer" that seems to do many of the QA things that I have been trying to do. I'm going to arrange for a demo!

I've known for a long time about their Problem Buffer, but I didn't realise that they had made it quite so sophisticated. I thought that it was just a 'dumping ground'! I'm always telling people to look around and learn from other repositories, and so I'm embarrassed to have been hoist by my own petard.

Saturday, 14 July 2007

Midnight Reflections

Like many repository managers, I have another job to do. Since the repository is only part of my work (and the School has certainly aimed to make sure that the repository is an important part of researchers' work without being a burden) then I find myself working on it after hours or at weekends. There's just too much admin to do 9-5! This weekend my wife is away at the Larmertree Festival" with our youngest daughter, so I have been able to devote some time to the repository this Saturday without guilt.

I thought I'd make a start on the QA (#2 on my list) and I've managed to put together some programs that address most of those topics. So I have some visual reports on potential duplicates, missing metadata fields and stalled publication. Chris has also run me up an EPrints plugin that allows me to embed an eprints metadata field input component into an ordinary, hand-generated web page (so that I can script up my own page designs that happen to include a journal input boxes and the like). My original idea was that I would do all the metadata correction and editing, manually. However, there's so many fields to correct in so many eprints that I really think that I need to go back to the self-archiving ideals and get the depositors to sort out their own mess.

So rather than clever batch editing, I think that I'll need to work on some methods for identifying specific problem records (e.g. missing journal titles) and then assigning them to the depositors/authors as tasks, and then getting the repository to track the users' progress against each of the tasks. A new kind of workflow - the user will see a message saying "please fix the following mistakes on this record" with the necessary input boxes embedded on the message. That'll make it nice and quick. And I will need to be able to track the status of all the 'repairs' that all the users have been asked to do. (Completed, in progress, outstanding, refused.)

Some things will need to be handled by me. I have noticed, for example, that it is so common for abstracts to be cut and paste with explicit line breaks (ie very short lines that don't reflow in a wider window) that it would be too onerous to expect the depositors to fix them properly.

Anyway, enough of this for now. It's midnight on a Saturday evening, I have the house to myself and I want to catch up on my unwatched sci-fi DVDs (Bicentennial Man, Battlestar Galactica and Revelation of the Daleks).

Tuesday, 10 July 2007

Some Background

Just for the record, I ought to explain a bit about the repository that I manage.

The repository contains about 10,000 records and gets about 600 new deposits per year. There are 2400 papers published since 2004, of which 1400 have open access full texts. I'll have more to say on this percentage later.

It started off life as a bibliography database and was migrated to EPrints in Spring 2000. Its use as a bibliographic record of all school output was already well established but a full text mandate was added in January 2003. The explicit aim has always been for 'light touch' repository management, with all eprints being self-deposited and no editorial workflow to check the metadata. For the last two years the role of 'repository manager' has been an official school administrative task, that is, one of those jobs that are assigned to academics to take on as part of their 30% admin. Up to now, the extent of my work has really been to generate termly reports for the Research Committee that summarise the deposits made by each individual in the school and their compliance with the mandate. This is half a day's effort, three times a year or 0.68% FTE.

I also work with the repository administrator who is also our webmaster. He keeps an eye on the repository, managing backups and fixing occasional problems (estimated "a couple of hours per month" or 1.5% FTE). He also runs about six more repositories on the same basis (on the same server) for various of the school's EU and UK funded projects. The server in question is a five-year-old PC running Linux RedHat 7.3 - it was not particularly powerful at the time.

In the steady state, therefore, once you have got your repository up and running and everyone is used to using it, you can see that the resource requirements for a school repository are not onerous!

Having said that, we have just bought (but not commissioned) a new server and we have just spent some significant time configuring the repository to handle our RAE returns in exactly the way that we wanted. Although EPrints has a module for RAE support, our Head of School (Research) had very specific requirements for handling the data. So I haven't included that as a repository expense per se.

Monday, 9 July 2007

And Another Thing

I expect that there will be lots of additions to the list while I try and get my brain (and office) in order.

analyse use of ECS repository: the most popular eprints in our repository are downloaded 200 times per day, whereas the more normal rate is 2 or 3 times per day. I would like to use the tools that we have developed for the JISC IRS project to investigate the reasons behind the recorded download profiles.

Sumer Is Icumen In

All the exam boards have passed and the summer dawns on me and my academic colleagues. I am now old enough and wise enough to know that the apparently enormous stretch of free space in my diary will only allow me to accomplish two or three things before the evenings start drawing in and freshers' week arrives. I'd really like to get some things done on our repository, and I'm trying to make a list.

Migrate the repository (now seven years old) into EPrints v3.

Set up some QA procedures for the repository. Since the staff don't want to have any editorial oversight, we need to do this post hoc. It can all be done manually, but I'd like to have some help for (a) identifying potential duplicates (b) checking for missing full texts (c) checking for items that have been 'submitted' or 'in press' for more than a year (d) looking for missing metadata (e.g. page ranges).

Set up an automatic alert / deposit startup from sources like the ACM and IEEE digital libraries, so that I can regularly find out what new things have been published in recent journal issues and conference proceedings that haven't been deposited into our repository.

Set up a new student repository as an e-portfolio for undergraduate and masters coursework and related activities.

Pre-deposit the best third and fourth year project reports and presentations into the students' individual work areas on the repository

Have a big publicity push at Graduation, and get the students to sign permission forms for us to make the above work public

I've already got a lot of the work for #1 done thanks to Chris Gutteridge's effort over the last couple of months, but hopefully we can finish this off soon. I've also done a lot of work on #4 myself, so that we had something to show the students when their results were published. (See portfolio.ecs.soton.ac.uk for the new repository and also the poster we put up next to their degree results.)