Saturday, 9 February 2008

Let's Do the TimeWarp

One of the reasons I believe in the Preservation ideal is that as a mid-career researcher, I have become very aware of the temporary and unreliable nature of my own personal IT infrastructure. Both the hardware and organisational support offered to help manage my intellectual journey (pretentious? moi?) are totally inadequate. I've just gone through my third hard disk in three months, and each time I've ended up with a period of splintered emails, diary entries, papers, proposals in different folders, using different applications on borrowed machines while the "Support" team try to diagnose and fix my hardware.

I keep going through these processes every few years - stolen laptops, broken laptops, borrowed laptops, new computers that I don't quite have time to transfer all my old environment over to. It just takes so much time, effort AND CONCENTRATION. Juggling backups from various periods, trying to reconcile duplicate files and remember what is on which machine. You never discover you've failed until 6 months later when you look for a document that you wrote 3 years ago on "just this topic" and it's not there - the whole project is missing. Arrgh!

So I (and a whole bunch of my colleagues) have quite fallen in love with Apple's Time Machine software that just creates daily snapshots of your hard disk, and allows you to browse through your hard disk backwards through history. It's like the Wayback Machine, but with an interaction paradigm that someone has actually thought about. It's very effective. And now there's this new wireless hard disk (the Time Capsule) that allows your machine to be backed up, automatically, without even having to plug the backup drive in. Fantastic!

For the first time in history, I'm seeing my colleagues get excited by backups. It was always such a tedious obligation before, and most people didn't do it very often. Certainly not on their laptops, for which our Support team disclaim all responsibility. And now, it can just happen, without thinking about it.

So perhaps, this is how we should make repositories work. Don't ingest individual, exquisitely formed digital items, complete with metadata and licensing information. Just ingest the whole flipping hard disk, offering at least a backup service. As Caveat Lector recently pointed out, everyone wants backup so everyone would use the repository. I was dubious at the time, but as you can see, the idea is growing on me.

So perhaps we ought to augment the OAIS model of the repository which (paraphrased) says that the repository is like a digestive system: stuff goes in one end, gets stored in the middle and then goes out the other end. (They even use the term "ingest", so don't tell me that the metaphor wasn't on their minds.) I'd like to tweak this model to be more like a cow, with multiple stomachs, each of which has a different task in the overall digestion process.

A hard disk (ie computer file system) goes in and gets stored for backup. Multiple versions are handled over time, so the whole history of the data contents are available. Only to the owner, of course, at this stage the contents are opaque to the repository staff. This level of privacy would need to be strictly enforced to make users feel happy about entrusting their files to a third party. At this stage, the benefit is all to the user - they have backup.

In the next stage, the user can "break down" the file system into important components - folders for projects, experiments, papers, proposals, lecture courses etc. Important individual documents can be identified. Metadata can be inferred by looking at the relationship between the low-level items (files) and the high level structures (folders and directories). File names, office document metadata, file system metadata, file contents, proximity to other files and their contents can all help to profile the individual items and ease the task of metadata entry.

In the next stage, the user can organise or "map" the important components from the disk (above) into a set of entries in the repository. (e.g. all the doc/ppt and pdf files from this folder to go into a single eprint whose title comes from the name of the word file and whose journal name comes from the folder name).

Then we get to the normal ingest stage, where the metadata can be checked and improved and all the normal processes can go on.

Perhaps this is just the hysteria of the marking season (I've still got to mark 65 students XML and XSLT files before the end of the weekend). Or perhaps its a strange state of mind that comes from living on a borrowed iMac in the spare room until I get my proper laptop back and all its files restored. But it might satisfy the need to get the repository closer to the user, and encourage the greater use of the repository for preservation and open access.

