Thursday, 7 February 2008

Pride and Prejudice

Early on in my Open Access career I learned to Always Listen To Librarians. I was taught this lesson by the formidable but fabulous team of Pauline Simpson and Jessie Hey who ran the JISC TARDiS project that developed the Southampton Institutional Repository. As a computer scientist I originally thought that I knew everything about digital information management. Now it's not that I think that librarians are always right, or that they are always more right than other groups with a stake in OA, but they do have a lot of experience in managing lots of information sources on behalf of a disparate community of users. They have "form". Or "previous" as you used to hear in TV cop shows. And you ignore them at your peril!

So Caveat Lector is a daily read of mine. I understand where the writer is coming from - repository management is not yet a well funded, well supported or well understood profession, and few repositories have the luxury of a whole team of professionals to dance in attendance on it. Or a single professional, as it happens. As a repository developer and open access advocate I LIKE to hear praise about how good repositories are, but I NEED to hear criticism about how much they suck and Caveat Lector isn't afraid of offering up some well-thought-out criticism on occasions.

Aside: if you talk to Chris Gutteridge in the bar at OR08 he'll tell you that he thinks that our slogan should be "EPrints: we suck less". It's one of those open source developer attitudes, but I'm not sure that I'll be committing it to any T-shirts just yet :-)

I read today's entry on "taking name entries that are obviously for the same person and making sure they have a single representation" and I'm taking it to heart. Look at any repository that has a "list by author" view, and you'll find that you don't need to go far down the first page before you see multiple entries for the same author. Not just DSpace repositories (no finger pointing here) but EPrints, Fez and Vital too. More about EPrints below, but back to the issues that this posting raises.

Firstly, Quality Assurance. All repository managers need to check over their repository, and its not a task that has been made particularly easy by the repository software. Checking author names (or journal names, or conference names, or any entities) for consistency is a great example of something laborious where escaping from the user interface to the underlying storage layer may actually be a relief. Wow. Doesn't that say something about our software? And I have to put my hands up (another cop show thing) because it's as true in EPrints for that particular task as it is in DSpace. It's a feature that is on the EPrints v 3.1 list of things to do, so I hope to be able to announce some progress at OR08, but at the moment it's a fair cop guv.

But secondly, why bother? As CL puts it: When the rubber meets the road, libraries don’t think IRs are important enough to waste even a smidgen of authority-control effort on. CL puts this down to pride in the repository, but I'd like to suggest it's much more significant than that. In fact, this is a huge, enormous, 3-lane motorway pileup of an issue for an institutional repository - until it's dealt with, no-one's going to be going anywhere down that road. Why? Because your institutional repository becomes institutional when it is embedded into the institution, and that means making it useful to the institution, and that means making it do things that the institution (rather than the just the faculty) want. And the first thing that the institution (read managers, administrators, marketeers etc) wants is lists - what papers are attributable to this person, research group, department, school, project? And the most fundamental part of that is the ability to be able to accurately and authoritatively deliver a list of items attributable to an individual (from there everything else is aggregation).

How do you do that? You can't escape the fact that every local author has to have an id. It can be an email or a staff number, or anything you like, but as well as being unique it has to be persistent to avoid problems with staff changes. When we were creating the Southampton IR with Pauline and Jessie, we got the pilot version wrong because we avoided adding staff ids - they just looked like too much hard work for the depositers. However, we quickly got back the message that what everyone wanted was up to date lists of publications - faculty wanted them for their CVs, departments wanted them for their web pages, and the admin staff wanted them for their never-ending form filling. If you just rely on author names as they are entered (or even, as they appear in the published item) then each author appears as 4, 5 or 6 different names and worse, multiple authors appear as the same individual.

That's why all person names in EPrints (whether authors, editors, lyricists, accompanists, experimenters or anything else) are now a compound of (a) title, (b) given name, (c) family name, (d) lineage (e.g. Jr. or III) and (e) id. And that's also why EPrints now has autocompletion and name authority lists, so that the id can be entered without imposing a burden on the depositor.

Back to the title of this posting: in case you hadn't realised, "pride" was the title of CL's posting. And "prejudice" describes my original attitude towards librarians. But it is also a challenge to repository managers who are from the library community - are you prejudiced into seeing the Institutional Repository as Library Property? A Library Plaything? Or a core Institutional Service?


  1. Garrrrrh, you got me. I was about to go down the road of pointing out everyone's interest in consistent author naming.

    It matters to by-author statistics gathering, too.

  2. There's an argument for *not* having consistent author naming, especially if the name is inconsistent in published sources. Should your repository metadata represent the published data as it exists, or should it reflect the internal data acquisition of your institution? I think it's a policy thing. If you've got id's you can play it either way.