Saturday 21 February 2009

The Cloud, the Researcher and the Repository

There's currently a lot of buzz about DuraSpace, the DSpace and Fedora project to incorporate cloud storage into repositories. I wasn't able to catch their webinar on Thursday, but I'm keeping my ear to the ground because it sounds like a very positive agenda for repositories in general to adopt. I hope this is a good opportunity to make a few remarks about the work that EPrints is doing that also might make cloud services accessible to repositories and users of repositories.

Moving your data into the cloud is a bit like moving your stuff into an unfurnished apartment. You get an awful lot of space to put things, once a month you have to pay the landlord, and you end up with absolutely nothing available to help you to organise and look after your things. You have to put your clothes, DVDs and crockery in a big pile on the floor unless you get some furniture in. But cloud 'furniture' comes as downloadable instructions on how to take three planks of wood and craft something that functions almost the same as a coffee table. In short, it's a great place for highly competent DIY enthusiasts with time on their hands. The EPrints team have been working on projects that might help researchers looking to take advantage of the cloud's benefits, without being put off by its lack of home comforts.

We've previously announced that Dave Tarrant has extended EPrints to use cloud storage services as part of JISC's PRESERV2 project (preserv.eprints.org). The new EPrints storage controller (debuting in EPrints v3.2) allows the repository to offload the storage of its files to any external service - cloud storage, local storage area networks or even national archiving services. The repository can mix and match these services according to the characteristics of each deposited object - even storing each item in several places for redundancy or performance improvement.

That tackles the technical part of the problem - how to join up repositories with the cloud, but it doesn't have much to say about how to better engage data-rich-users with the cloud (or with the repository come to that). As part of the JISC KULTUR project (kultur.eprints.org), Tim Brody has been looking at the problem of user deposit for lots of large media files. Not petabyte large, but gigabyte large. Even at that scale, the normal web infrastructure fails to deliver a reliable service - connections between a web browser and server just time out unexpectedly and silently - which makes it unpleasant for an artist who is trying to archive their career's-worth of video installations to the institutional repository. It's also really tedious even if you try to upload 100 small image files to the repository through the web deposit interface.

The solution that Tim has come up with is to allow the researcher's desktop environment to directly use EPrints as a file system - you can 'mount' the repository as a network drive on your Windows/Mac/Linux desktop using services like WebDAV or FTP. As far as the user is concerned, they can just drag and drop a whole bunch of files from their documents folders, home directories or DVD-ROMs onto the repository disk, and EPrints will automatically deposit them into a new entry or entries. Of course, you can also do the reverse - copy documents from the repository back onto your desktop, open them directly in applications, or attach them to an email. And once you have opened a repository file directly in Microsoft Word (say) then why not save the changes back into the repository, with the repository either updating the original document or making a new version of it according to local policy? Or for UNIX admins, you can just set up a command-line FTP connection to the repository and relive the glory days of the pre-Web internet. And who knows, perhaps there will be demand for a gopher interface too?

Now perhaps if you put the desktop front-end together with the cloud back-end, the repository might be able to offer institutional researchers a realistic path to cloud storage. For the researcher who is tempted by the expansion capacity that the cloud's metaphorical unfurnished apartment offers them, the repository could offer a removal van, a concierge, a security guard, a cleaner and an expandable set of prefabricated cupboards and walk-in wardrobes. Not naked cloud storage, but storage that is mediated, managed and moderated on the researcher's behalf by the institution, so that they have the assurance that their data is not stranded and susceptible to the irregularities of cloud service provider SLAs. In other words, a cloud you can depend on!

The above paragraph sounds a bit hand-wavy, and to be honest we need to get some proper experience of this with real researchers before we can be confident that it is a viable approach. Desktop services have already been built on top of cloud storage - JungleDisk for example is a desktop backup and archiving service, but it still requires the user to have their own cloud account. Hopefully, a repository can take away all the necessity for special accounts, passwords and storage management from the user and provide them with a whole host of extra, valuable services.

Perhaps that's where the challenge lies. Repositories need to commit to providing really useful services to all their users - cloud users (or potential cloud users) are not a new breed, even if they do have exacting requirements. So having taken care of the infrastructure that seemlessly connects repositories and clouds, lets make sure that we keep on innovating in the user space. Backup, archiving, preservation and access are a good foundation, but they are only the start.

There will be a demonstration of this work and other features of EPrints 3.2 at Open Repositories 2009 in Atlanta, Georgia on May 18th-21st. Make sure you come along because it's going to be a really exciting conference, whether or not it is cloudy :-)

4 comments:

  1. Hi Les. Look forward to seeing the cloud storage controller in action sometime - certainly addresses a number of teeth-grinding issues we've encountered (or side-stepped, for now). The FTP-like filesystem extension idea sounds rather like the RepoMMan system at Hull, which I recently heard Richard Green describe - I presume you've checked that out? It got me itching to give Fedora another look - if I ever find the time - but something similar that existing EPrints installations can bolt on in a future upgrade would be very attractive (and spare us another learning curve ;)

    ReplyDelete
  2. RepoMMan produced a UI for the repository that mimicked the WS_FTP windows application - ie a window with separate source and destination panes. It gives an FTP user a familiar experience - no bad thing for a repository!

    The EPrints facility described above implements a genuine FTP server (and WebDAV server) so that your O/S can mount it as a real file system and all your applications will be able to use it as a source or destination for opening and saving their documents.

    ReplyDelete
  3. Chris Rusbridge says:

    ... and you didn't mention Fedorazon, the small innovation project funded in an earlier JISC Repositories round. See http://www.ukoln.ac.uk/repositories/digirep/index/Fedorazon

    How could I forget Fedorazon, run by the fantastic and recently poached Dave Flanders.
    The project literally put repositories in the cloud by creating Amazon EC2 images of running repositories so that you could literally run up a new repository in 30 seconds.

    The real challenge will be in creating scalable repository farms that can handle a whole planet's data, just like Google. But at the moment most repositories can be adequately housed on a small iPod, so there's little incentive to do too much investigation I suppose.

    ReplyDelete
  4. Leslie Carr said:
    ... housed on a small iPod, so there's little incentive ...

    at the moment most repositories for publications may be of manageable size, but forthcoming repository-based research environments are pretty substantial. eSciDoc, Perseus/Scaife, TextGrid, dariah.eu, ... - those initiatives deal with massive amounts of data, and they are no doubt all very excited about the discussions here :-)

    ReplyDelete