RepositoryMan: January 2009

Friday, 30 January 2009

Copyright? Got Chirpy

At Friday morning's IR steering meeting the topic was copyright policy, and I was prepared to bale out early using marking as an excuse. However, to my enormous surprise the 90-minute discussion was gripping, inspiring and useful. I think that we have found a way to talk positively about copyright within the institution's mission and business objectives! Hurrah!

Wednesday, 28 January 2009

Using the EPrints Commandline Toolbox

The so-called "EPrints Toolbox" (bin/toolbox) allows the administrator to access/change data in EPrints from the command line. It is useful for people like myself who aren't Perl programmers (the shame!)

I haven't had chance to use it much, because it turns out I can often do what I want through the batch editor, but I found it very useful this morning to update some publication data.

The background is that I am helping run a conference (WebSci09), for which all the submissions have been handled by a web service called "EasyChair". I scraped all the submission data for the accepted papers and posters from the EasyChair web pages and turned them into an EP3 XML file which I then imported into an existing, subject-based EPrints repository. A few days after having done that, I realised that it would have been nice to import the affiliation and location data that EasyChair maintains for each of the authors. So I added "location" and "affiliation" text subfields to the creators compound field in the eprints dataset, ran "epadmin update_database_structure" to make the database tables sync with the updated config definitions, and then used the scraped data to run a sequence of commands like the following:

/opt/eprints3/bin/toolbox devel modifyEprint --eprint 106 << \EOF
 <eprint>
 <creators>
  <item>
     <name><given>Leslie</given><family>Carr</family></name>
     <id>lac@gmail.com</id>
     <affiliation>Department of Computer Science, Gadget University</affiliation>
     <location>Japan</location>
   </item>          
 </creators>         
</eprint>            
EOF

That works for me because I am a dyed-in-the-wool shell programmer, but you can invoke toolbox functionality from the web via CGI scripts - if you set up the appropriate security regime first (CGI toolbox is disabled by default because it is very dangerous to let all and sundry on the web have edit access to the database!)

I see that toolbox isn't documented in the wiki yet, and we've not had that much experience using it here at Southampton, but the range of facilities is shown below. Note that when you try to modify an eprint, the modification is happening field by field, so you don't need to add a full eprint record, but you do have to provide the entire contents of a field.

    toolbox *repository_id* [options] getEprint --eprint eprintid
    toolbox *repository_id* [options] getEprintField --eprint eprintid --field fieldname
    toolbox *repository_id* [options] createEprint < data
    toolbox *repository_id* [options] modifyEprint --eprint eprintid < data
    toolbox *repository_id* [options] removeEprint --eprint eprintid
    toolbox *repository_id* [options] addDocument --eprint eprintid < data
    toolbox *repository_id* [options] modifyDocument --document documentid < data
    toolbox *repository_id* [options] removeDocument --document documentid
    toolbox *repository_id* [options] getFile --document documentid --filename filename
    toolbox *repository_id* [options] addFile --document documentid --filename filename < data
    toolbox *repository_id* [options] removeFile --document documentid --filename filename
    toolbox *repository_id* [options] idSearchEprints < data
    toolbox *repository_id* [options] xmlSearchEprints < data

If you prefer to do this via the Web, I did successfully access the toolbox functionality in JavaScript (well, the jQuery library) like so:

jQuery.post("http://repository/cgi/toolbox",
  {verb: "getEprint", username: "admin", password: "whatever", eprint: 358},
  function(xml){ alert(xml); }
  );

Tuesday, 27 January 2009

Repositories vs Learning Object Repositories

I got into a bit of an argument on the JISC-REPOSITORIES list yesterday, about whether general repositories (EPrints, DSpace, Fez etc) could take on the functions of a bespoke learning object repository (e.g. Intralibrary). My position is that a general repository is made to be adapted - you should be able to change the schema and the services to adapt to local requirements, but the contrary position is that a learning object repository is just too different and specialised.

We'll see. The EdSpace project at Southampton is running a learning resources repository based on EPrints, but they are experimenting with the nature of a learning object repository by introducing open access practices and sensibilities rather than keeping learning behind institutional firewalls. They are building something interesting (which shows signs of being effective as well) but they certainly wouldn't claim to be trying to replicate a learning object repository.

However, the discussion got me thinking about the limits of plasticity inherent in an open source repository such as EPrints (or DSpace etc).

The out-of-the-box, vanilla repository provides various services to support certain agendas (say open access, preservation and scholarly collections). However, it comes with lots of configuration options and customisation opportunities to extend that basic functionality. You can change the look and feel of the user interface, or the schema for the metadata or the services that are applied to the repository holdings. There are configuration options, APIs and plugins that you can use to adapt the repository to your local requirements, and every institution has its own list of extras that the repository just has to handle - whether it is journal workflow management, scientific data archiving or RAE evidence gathering. You can do any of these things as long as you have sufficient technical assistance to hand. And sufficient time. Otherwise, you just have to live with the generic experience. The diagram (above and to the left) makes it plain that the more you want to extend the boundary of your repository, the more effort you are going to have to put in.

In theory you can adapt your repository so far in the direction of any particular agenda that you could encompass all the needs and requirements of users concerned with that agenda. However, that may require an awful lot of effort - or just more understanding and insight than you have time to achieve. Bigger institutions will obviously be at an advantage here!

That may be where you turn to the open source community, so that others may help you to add the facilities that you want (see diagram to the left). But what has tended to happen in the repository community is that these out-sourced, open-source developments proceed independently of each other, so that it can be difficult to have a Basic Repository + Education module + Research Management module that work happily together.

It's hardly controversial to conclude that the facilities that you can add to a repository are always going to be constrained by the amount of technical resource available to you. This does put some constraints on the amount of the terrain (agendas and services) that your repository's perimeter can encompass. So perhaps a way forward is to cheat by redefining the problem in terms of something that the repository can already do. I've already mentioned that EdSpace are getting results by making the "educational resources" problem look more like "Open Access + preservation". This approach seems to be working in other areas as well - scientific data (eCrystals), archiving fine arts (KULTUR). Still, there remain interesting challenges for making a single repository "all things to all men" - whether they are physicists, chemists, engineers, social scientists or sculptors.

"Some things to all men" we can obviously do straight out of the box.