Monday 11 February 2008

New Requests: QA and Citation Counting

I am being pushed by the head of research committee to have the repository send out more QA alerts to all the self-depositing users. Yes, they really do want to be prompted about problems with their metadata! I'm meeting with Chris G to try and decide the best way to do this for everyone, but I think that some of the experiments we tried last summer (see previous blog postings on QA) will help us produce a sleek user interface for the end-users.

I am also being pushed into responding to the national obsession with research metrics by adding citation counting and tracking to EPrints. After Christmas I managed to produce some demo scripts to track the citations of repository holdings using Google Scholar, but they got wiped out in my January Laptop Disk Crash (not to be confused with the February one). I'm delegating the rewriting of the scripts (hey, I'm a senior lecturer!) but things are moving so fast in the UK that they will need to see prime time very quickly!

Saturday 9 February 2008

Let's Do the TimeWarp

One of the reasons I believe in the Preservation ideal is that as a mid-career researcher, I have become very aware of the temporary and unreliable nature of my own personal IT infrastructure. Both the hardware and organisational support offered to help manage my intellectual journey (pretentious? moi?) are totally inadequate. I've just gone through my third hard disk in three months, and each time I've ended up with a period of splintered emails, diary entries, papers, proposals in different folders, using different applications on borrowed machines while the "Support" team try to diagnose and fix my hardware.

I keep going through these processes every few years - stolen laptops, broken laptops, borrowed laptops, new computers that I don't quite have time to transfer all my old environment over to. It just takes so much time, effort AND CONCENTRATION. Juggling backups from various periods, trying to reconcile duplicate files and remember what is on which machine. You never discover you've failed until 6 months later when you look for a document that you wrote 3 years ago on "just this topic" and it's not there - the whole project is missing. Arrgh!

So I (and a whole bunch of my colleagues) have quite fallen in love with Apple's Time Machine software that just creates daily snapshots of your hard disk, and allows you to browse through your hard disk backwards through history. It's like the Wayback Machine, but with an interaction paradigm that someone has actually thought about. It's very effective. And now there's this new wireless hard disk (the Time Capsule) that allows your machine to be backed up, automatically, without even having to plug the backup drive in. Fantastic!

For the first time in history, I'm seeing my colleagues get excited by backups. It was always such a tedious obligation before, and most people didn't do it very often. Certainly not on their laptops, for which our Support team disclaim all responsibility. And now, it can just happen, without thinking about it.

So perhaps, this is how we should make repositories work. Don't ingest individual, exquisitely formed digital items, complete with metadata and licensing information. Just ingest the whole flipping hard disk, offering at least a backup service. As Caveat Lector recently pointed out, everyone wants backup so everyone would use the repository. I was dubious at the time, but as you can see, the idea is growing on me.

So perhaps we ought to augment the OAIS model of the repository which (paraphrased) says that the repository is like a digestive system: stuff goes in one end, gets stored in the middle and then goes out the other end. (They even use the term "ingest", so don't tell me that the metaphor wasn't on their minds.) I'd like to tweak this model to be more like a cow, with multiple stomachs, each of which has a different task in the overall digestion process.

A hard disk (ie computer file system) goes in and gets stored for backup. Multiple versions are handled over time, so the whole history of the data contents are available. Only to the owner, of course, at this stage the contents are opaque to the repository staff. This level of privacy would need to be strictly enforced to make users feel happy about entrusting their files to a third party. At this stage, the benefit is all to the user - they have backup.

In the next stage, the user can "break down" the file system into important components - folders for projects, experiments, papers, proposals, lecture courses etc. Important individual documents can be identified. Metadata can be inferred by looking at the relationship between the low-level items (files) and the high level structures (folders and directories). File names, office document metadata, file system metadata, file contents, proximity to other files and their contents can all help to profile the individual items and ease the task of metadata entry.

In the next stage, the user can organise or "map" the important components from the disk (above) into a set of entries in the repository. (e.g. all the doc/ppt and pdf files from this folder to go into a single eprint whose title comes from the name of the word file and whose journal name comes from the folder name).

Then we get to the normal ingest stage, where the metadata can be checked and improved and all the normal processes can go on.

Perhaps this is just the hysteria of the marking season (I've still got to mark 65 students XML and XSLT files before the end of the weekend). Or perhaps its a strange state of mind that comes from living on a borrowed iMac in the spare room until I get my proper laptop back and all its files restored. But it might satisfy the need to get the repository closer to the user, and encourage the greater use of the repository for preservation and open access.

Thursday 7 February 2008

Pride and Prejudice

Early on in my Open Access career I learned to Always Listen To Librarians. I was taught this lesson by the formidable but fabulous team of Pauline Simpson and Jessie Hey who ran the JISC TARDiS project that developed the Southampton Institutional Repository. As a computer scientist I originally thought that I knew everything about digital information management. Now it's not that I think that librarians are always right, or that they are always more right than other groups with a stake in OA, but they do have a lot of experience in managing lots of information sources on behalf of a disparate community of users. They have "form". Or "previous" as you used to hear in TV cop shows. And you ignore them at your peril!

So Caveat Lector is a daily read of mine. I understand where the writer is coming from - repository management is not yet a well funded, well supported or well understood profession, and few repositories have the luxury of a whole team of professionals to dance in attendance on it. Or a single professional, as it happens. As a repository developer and open access advocate I LIKE to hear praise about how good repositories are, but I NEED to hear criticism about how much they suck and Caveat Lector isn't afraid of offering up some well-thought-out criticism on occasions.

Aside: if you talk to Chris Gutteridge in the bar at OR08 he'll tell you that he thinks that our slogan should be "EPrints: we suck less". It's one of those open source developer attitudes, but I'm not sure that I'll be committing it to any T-shirts just yet :-)


I read today's entry on "taking name entries that are obviously for the same person and making sure they have a single representation" and I'm taking it to heart. Look at any repository that has a "list by author" view, and you'll find that you don't need to go far down the first page before you see multiple entries for the same author. Not just DSpace repositories (no finger pointing here) but EPrints, Fez and Vital too. More about EPrints below, but back to the issues that this posting raises.

Firstly, Quality Assurance. All repository managers need to check over their repository, and its not a task that has been made particularly easy by the repository software. Checking author names (or journal names, or conference names, or any entities) for consistency is a great example of something laborious where escaping from the user interface to the underlying storage layer may actually be a relief. Wow. Doesn't that say something about our software? And I have to put my hands up (another cop show thing) because it's as true in EPrints for that particular task as it is in DSpace. It's a feature that is on the EPrints v 3.1 list of things to do, so I hope to be able to announce some progress at OR08, but at the moment it's a fair cop guv.

But secondly, why bother? As CL puts it: When the rubber meets the road, libraries don’t think IRs are important enough to waste even a smidgen of authority-control effort on. CL puts this down to pride in the repository, but I'd like to suggest it's much more significant than that. In fact, this is a huge, enormous, 3-lane motorway pileup of an issue for an institutional repository - until it's dealt with, no-one's going to be going anywhere down that road. Why? Because your institutional repository becomes institutional when it is embedded into the institution, and that means making it useful to the institution, and that means making it do things that the institution (rather than the just the faculty) want. And the first thing that the institution (read managers, administrators, marketeers etc) wants is lists - what papers are attributable to this person, research group, department, school, project? And the most fundamental part of that is the ability to be able to accurately and authoritatively deliver a list of items attributable to an individual (from there everything else is aggregation).

How do you do that? You can't escape the fact that every local author has to have an id. It can be an email or a staff number, or anything you like, but as well as being unique it has to be persistent to avoid problems with staff changes. When we were creating the Southampton IR with Pauline and Jessie, we got the pilot version wrong because we avoided adding staff ids - they just looked like too much hard work for the depositers. However, we quickly got back the message that what everyone wanted was up to date lists of publications - faculty wanted them for their CVs, departments wanted them for their web pages, and the admin staff wanted them for their never-ending form filling. If you just rely on author names as they are entered (or even, as they appear in the published item) then each author appears as 4, 5 or 6 different names and worse, multiple authors appear as the same individual.

That's why all person names in EPrints (whether authors, editors, lyricists, accompanists, experimenters or anything else) are now a compound of (a) title, (b) given name, (c) family name, (d) lineage (e.g. Jr. or III) and (e) id. And that's also why EPrints now has autocompletion and name authority lists, so that the id can be entered without imposing a burden on the depositor.

Back to the title of this posting: in case you hadn't realised, "pride" was the title of CL's posting. And "prejudice" describes my original attitude towards librarians. But it is also a challenge to repository managers who are from the library community - are you prejudiced into seeing the Institutional Repository as Library Property? A Library Plaything? Or a core Institutional Service?