RepositoryMan

Sunday, 29 March 2009

Repository as a Trusted Intermediary

The idea of a trusted intermediary that makes content both durable and usable with a "chinese menu" of added-value services is my new favourite definition of repository. These words come from the DuraSpace project's midterm report, and although they were not penned with repositories per se in mind, I believe that they provide an excellent description of their rationale ie to increase trust in material created in

a random place on the Web.
my rented niche in the Cloud
my departmental filestore
my own desktop.

So I am particularly pleased to congratulate the JISC EdSpace team on their recent upgrade to the EdShare learning resource repository at Southampton, because they have helped deliver on the first bullet point - adding trust to web resources.

I have been using EdShare to distribute material from the modules that I teach. Much of this material consists of PowerPoint lecture slides that I have created, but a significant proportion of it is material available on the open Web - perhaps other people's slides, papers or reports from their own web sites.

In the past I have had two choices: either deposit a link to the web page or deposit a copy of the web page. The former is a lightweight solution and obviously the right "Web thing" to do when you just want to provide a URL pointer to someone else's resource. But the latter is the right "repository thing" to do in terms of making a safe and durable copy. Except that I don't automatically have the right to clutter up Google space with ad hoc copies of the same material and reducing their Pagerank. So most of the time I have settled for "just linking", at the price of accepting that some of this material will move or disappear before I teach the topic again. In the words of Humphrey Bogart, I know that I'll regret it -- maybe not today, maybe not tomorrow, but soon, and for the rest of my course.

Now EdShare lets me have my cake and eat it. I can deposit and disseminate a link to the external material (as before) but the repository will make a dark copy and start serving that if the original disappears. Essentially they treat important material that I find on the Web in the same way that they treat important material that I move into the repository. Both get managed, indexed, thumbnailed and subjected to the normal range of repository services.

So I'm delighted that I can now do the right Web thing and the right repository thing at the same time.

Thursday, 26 February 2009

Why I Need Trusted Storage

Yesterday I went to the Apple Store to get my laptop hard disk repaired for the FIFTH time. Each time the data has been unrecoverable - or worse - partially recoverable. Each time I have lost material that was not backed up and each time my different sets of historical backups were recovered but only partially integrated with each other. After all, each fatal disk crash knocks a week out of your working life, what with trying to extract anything important from the smoking remains of the still-spinning disk, taking the machine to the repair shop, waiting for them to repair it, working on another system, getting email sorted, getting your old machine back, restoring the operating system and applications and then trying to copy all your old backups onto the new disk. The effect is that each subsequent crash has confused my backups / restores to the point where I have folders within folders within folders of different sets of partially restored material and I no longer know what has been restored where.

How could I possibly get into such a lunatic state, I hear you ask. Is this man the worst kind of professional incompetent? Everyone knows you have to back your stuff up. Why doesn't he just buy a big disk and use Time Machine? These are good questions. I ask them of myself all the time.

Our school systems team disavow responsibility for all laptops. We are literally on our own if we dare to have mobile machines.
When I started on this voyage of data loss some three years ago, Time Machine wasn't invented.
Disks that you buy for backup are just as likely to go foom as your own personal laptop disk. My main coherent, level 0 backup on a LaCie Terabyte disk just stopped working one day, just when I tried to restore my work.
Large disks are forever being used for other urgent purposes. Students need some space for something. A project needs some temporary storage. You need to be able to transfer a large amount of data from one machine to another. It gets difficult to manage the various assortments of undistinguished grey bricks that build up in your office. Which one has the old duplicate backup on it that is no longer necessary?

There are lots of other mitigating circumstances with which I won't bore you, but what I would like to lay down are my beliefs that (a) backup management is a complex task that requires serious attention and preferably support from professionals who can devote some attention to it and (b) it is never urgent enough to displace any of the truly important and terribly overdue academic tasks that you are trying to accomplish TODAY so you don't get sacked.

I've had a lot of time to reflect on this since my laptop started plunging me into regular data hell, and the idea of trusted storage for me isn't just about having files that don't disappear. It's about having an organised, stable, useful, authoritative picture of my professional life - research and teaching - that grows and tells an emerging story as my career develops. That's mainly what has been disrupted - I can pretty much find any specific thing that I want by grep/find or desktop search. But the overall understanding of what I had and what I had been working on has been disrupted and damaged and fragmented.

So an intelligent store should help me understand what I have - a bit like the way that user tools like iPhoto help you understand and organise thousands of images. It should be possible to get a highly distilled overview/representation/summary/visualisation of all my intellectual content/property/achievements as well as a detailed and comprehensive store of all my individual documents and files.

I guess you can see where I'm going with this. I've gone and got the ideal desktop storage and the dream repository all mixed up. Well perhaps I have - but why not?

Anyway, all's well that ends well. My colleagues all clubbed together and got a terabyte Time Capsule for work, that is run by a sympathetic member of the systems team. And Apple just phoned up to offer me a brand new 17" MacBook Pro in exchange for my broken old one.

Still, I'd really like to make my data store intelligible as well as safe!

Wednesday, 25 February 2009

DuraSpace: High Hopes or Crying Wolf

I promised that I would try to keep informed about DuraSpace, and so I was pleased to read the DuraSpace midterm report to Mellon. (Note to Mellon staff: please don't scan these reports without OCR'ing them. It's frustrating not being able to Google them!)

As I said previously, I'm a big fan of the DuraSpace agenda. My distillation of DuraSpace goals from the report's opening paragraph is to provide a trusted intermediary that makes content both durable and usable with a "chinese menu" of added-value services. Now this isn't really specific to the cloud - but that seems in keeping with the report because it frequently refers to "third party storage solutions" rather than "the cloud".

So the DuraSpace agenda could apply as much to the Web, or any other information environment, as it does to the cloud. Which in itself seems to be a good thing, and proves the worth of the open repositories community (go repositories!)

Except that we're still trying to consolidate and prove our worth in the web environment. Have we got a huge community of end-users who are all cheering for repositories and swear by their functionality? Exactly how long is our chinese menu of appealing and valuable services? It may be a bit of a hobbyhorse of mine (sorry about that) but let's make sure that we deliver on repository value and usefulness in the Web, on the desktop and also in the cloud.

Otherwise someone is going to accuse us of crying wolf - quick! come and look at the value proposition of repositories in the cloud! We've already alerted people about value and the web till we're blue in the face. Can we really tick that one off? Have we delivered? Do people trust us? (Have people heard of us?)

I haven't suddenly gone all anti-repository - I believe that we are genuinely seeing some really interesting repository services starting to emerge from a variety projects. But they are not mainstream yet, and they are not common experience. We still need to work harder on creating value for end users as well as repository managers and repository developers.

Let's do it in the cloud - but lets work really hard at articulating the benefits that the cloud end user will enjoy stop relying on general talk about value-added services. We need to Think. Specifically. Make a clear offering to our users - or would-be users. I think researchers/end-users will forgive us for not having finished implementing something yet, but they won't forgive us for a lack of imagination.

Tuesday, 24 February 2009

Fifteen Years After The Fact

Thanks to Colin Smith for pointing out this new discussion from The Council of Editors of Learned Journals on the future of the journal in which they propose the following four principles.

Journals must pursue interoperability with the other online tools that are shaping the techne of scholarly practice

Journals have opportunity to reframe their role in the academy as curators of the noise of the web.

Electronic journals will have the opportunity to expand their curatorial mandate include different forms of publication.

Broadening the community of participation.

I was expecting to be disappointed - this set of blogged responses of journals to a web-based future expends 3400 words failing to mention open access or repositories. But then in principle #3 they went and completely exceeded my expectations by proposing a model of scholarly publication that genuinely fits in with the web.

It is contrary to utility, in the world of web 2.0, to maintain exclusive publication rights on an article. Exclusivity of publication places a text in only one domain. Yet non-exclusive text gets reproduced and recopied, circulated around the internet, and rapidly floats onward to mimetic influence in other cultures, excerpted and referenced. For every web 2.0 author, non-exclusivity and easy republication is ideal. For every would-be-idea-of-influence in the age of web 2.0, easy reduplication is crucial.
Exclusivity has been the format followed by most online journals, which seek to mimic in form the traditional journal: one essay, neatly formatted, looking as professional as possible. Exclusive re-publication suggests the old model of authority, and is superficially reassuring to editors without actually promoting the real functions of the journal: disseminating ideas and establishing the authority of the journal-as-canon and disciplinary metric.
Significantly more desirable would be setting a different precedent: for all disseminated forms of the text to advertise the article's accreditation as having been curated by inclusion in the journal-as-stream. (the text might end with, for instance, "please recirculate with this citation: by-Professor-Bonnie-Wheeler, SMU, 2009; officially tagged in 'Arthuriana,' [link] May 2010") Advertising the link between article and journal in many reproduced/cross-referenced copies would function both to the benefit of the article and the prestige of the journal.
Again, if the dissemination model is followed, the journal homepage need not include reprints of the articles themselves: merely links to the original blogspace or university-housed-pdf or slideshow where the material was originally posted, with all of its links, illustrations, video, and wallpaper as the author originally presented it. The journal's role is reduced to curation, not to presentaiton. Not having a use for a graphic designer, typesetter, or illustrations layout person, the journal's workflow will be considerably reduced.

This isn't exactly a new model - syndicated scholarly dissemination based on links and certification - but I didn't expect to hear a council of learned journal editors proposing it in my lifetime! The times they are a-changin! Or perhaps, the web is finally changing us all.

Saturday, 21 February 2009

The Cloud, the Researcher and the Repository

There's currently a lot of buzz about DuraSpace, the DSpace and Fedora project to incorporate cloud storage into repositories. I wasn't able to catch their webinar on Thursday, but I'm keeping my ear to the ground because it sounds like a very positive agenda for repositories in general to adopt. I hope this is a good opportunity to make a few remarks about the work that EPrints is doing that also might make cloud services accessible to repositories and users of repositories.

Moving your data into the cloud is a bit like moving your stuff into an unfurnished apartment. You get an awful lot of space to put things, once a month you have to pay the landlord, and you end up with absolutely nothing available to help you to organise and look after your things. You have to put your clothes, DVDs and crockery in a big pile on the floor unless you get some furniture in. But cloud 'furniture' comes as downloadable instructions on how to take three planks of wood and craft something that functions almost the same as a coffee table. In short, it's a great place for highly competent DIY enthusiasts with time on their hands. The EPrints team have been working on projects that might help researchers looking to take advantage of the cloud's benefits, without being put off by its lack of home comforts.

We've previously announced that Dave Tarrant has extended EPrints to use cloud storage services as part of JISC's PRESERV2 project (preserv.eprints.org). The new EPrints storage controller (debuting in EPrints v3.2) allows the repository to offload the storage of its files to any external service - cloud storage, local storage area networks or even national archiving services. The repository can mix and match these services according to the characteristics of each deposited object - even storing each item in several places for redundancy or performance improvement.

That tackles the technical part of the problem - how to join up repositories with the cloud, but it doesn't have much to say about how to better engage data-rich-users with the cloud (or with the repository come to that). As part of the JISC KULTUR project (kultur.eprints.org), Tim Brody has been looking at the problem of user deposit for lots of large media files. Not petabyte large, but gigabyte large. Even at that scale, the normal web infrastructure fails to deliver a reliable service - connections between a web browser and server just time out unexpectedly and silently - which makes it unpleasant for an artist who is trying to archive their career's-worth of video installations to the institutional repository. It's also really tedious even if you try to upload 100 small image files to the repository through the web deposit interface.

The solution that Tim has come up with is to allow the researcher's desktop environment to directly use EPrints as a file system - you can 'mount' the repository as a network drive on your Windows/Mac/Linux desktop using services like WebDAV or FTP. As far as the user is concerned, they can just drag and drop a whole bunch of files from their documents folders, home directories or DVD-ROMs onto the repository disk, and EPrints will automatically deposit them into a new entry or entries. Of course, you can also do the reverse - copy documents from the repository back onto your desktop, open them directly in applications, or attach them to an email. And once you have opened a repository file directly in Microsoft Word (say) then why not save the changes back into the repository, with the repository either updating the original document or making a new version of it according to local policy? Or for UNIX admins, you can just set up a command-line FTP connection to the repository and relive the glory days of the pre-Web internet. And who knows, perhaps there will be demand for a gopher interface too?

Now perhaps if you put the desktop front-end together with the cloud back-end, the repository might be able to offer institutional researchers a realistic path to cloud storage. For the researcher who is tempted by the expansion capacity that the cloud's metaphorical unfurnished apartment offers them, the repository could offer a removal van, a concierge, a security guard, a cleaner and an expandable set of prefabricated cupboards and walk-in wardrobes. Not naked cloud storage, but storage that is mediated, managed and moderated on the researcher's behalf by the institution, so that they have the assurance that their data is not stranded and susceptible to the irregularities of cloud service provider SLAs. In other words, a cloud you can depend on!

The above paragraph sounds a bit hand-wavy, and to be honest we need to get some proper experience of this with real researchers before we can be confident that it is a viable approach. Desktop services have already been built on top of cloud storage - JungleDisk for example is a desktop backup and archiving service, but it still requires the user to have their own cloud account. Hopefully, a repository can take away all the necessity for special accounts, passwords and storage management from the user and provide them with a whole host of extra, valuable services.

Perhaps that's where the challenge lies. Repositories need to commit to providing really useful services to all their users - cloud users (or potential cloud users) are not a new breed, even if they do have exacting requirements. So having taken care of the infrastructure that seemlessly connects repositories and clouds, lets make sure that we keep on innovating in the user space. Backup, archiving, preservation and access are a good foundation, but they are only the start.

There will be a demonstration of this work and other features of EPrints 3.2 at Open Repositories 2009 in Atlanta, Georgia on May 18th-21st. Make sure you come along because it's going to be a really exciting conference, whether or not it is cloudy :-)

Wednesday, 18 February 2009

EPrints Evaluation - assessing the usability

Andy Powell has been looking at repository usability on his blog, and his latest posting uses a paper in the ECS school repository as an example. He makes some very good points to which I ought to respond.

First, Andy comments on the URL of the splash page.

The jump-off page for the article in the repository is at http://eprints.ecs.soton.ac.uk/14352/, a URL that, while it isn't too bad, could probably be better. How about replacing 'eprints.ecs' by 'research' for example to mitigate against changes in repository content (things other than eprints) and organisational structure (the day Computer Science becomes a separate school).

This is certainly a point to consider, but there are two approaches to unchanging cool URIs. One is to try to make the URI as independent of any implementation or specific service as possible - "research" instead of eprints.ecs. Unfortunately, there are some things that are givens - we are the School of Electronics and Computer Science and ecs.soton.ac.uk is our subdomain. We can't pretend otherwise - and we do not have the leeway to invent a new soton.ac.uk subdomain. We are very aware of the impermanence of the university structure (and hence the URL and domain name structure). Five years ago the whole University was re-arranged from departments into new amalgamated schools. Luckily, we and our URLs survived unscathed. Even worse, last year the University's marketing office almost rebranded every official URL and mail address from soton.ac.uk to southampton.ac.uk!

The ultimate in insulating yourself from organisational changes is to adopt an entirely opaque URI such as a handle. The alternative is to admit that you can't choose a 100% safe name, and have policies and procedures in place to support old names whatever changes and upheavals come to pass. For example, our eprints service itself replaces an older bibliographic database called jerome, whose legacy URIs are redirected to the corresponding pages in eprints. That is the approach that we take with our services - adapt, change, move and rename if necessary, but always provide a continuity of service for already published URIs.

The jump-off page itself is significantly better in usability terms than the one I looked at yesterday. The page <title> is set correctly for a start. Hurrah! Further, the link to the PDF of the paper is near the top of the page and a mouse-over pop-up shows clearly what you are going to get when you follow the link. I've heard people bemoaning the use of pop-ups like this in usability terms in the past but I have to say, in this case, I think it works quite well. On the downside, the link text is just 'PDF' which is less informative than it should be.

The link text is a bit curt: it should say "View/Open/Download PDF document"

Following the abstract a short list of information about the paper is presented. Author names are linked (good) though for some reason keywords are not (bad). I have no idea what a 'Performance indicator' is in this context, even less so the value "EZ~05~05~11". Similarly I don't see what use the ID Code is and I don't know if Last Modified refers to the paper or the information about the paper. On that basis, I would suggest some mouse-over help text to explain these terms to end-users like myself.

Author names are linked to the official school portal pages externally to the repository. Presumably keywords should be linked to a keyword cloud that groups all similarly-themed papers. The ID Code and Performance indicators are internal, and should be low-lighted in some way. The last-modified information refers to the eprint record itself, and so the label should be mae more informative.

The 'Look up in Google Scholar' link fails to deliver any useful results, though I'm not sure if that is a fault on the part of Google Scholar or the repository? In any case, a bit of Ajax that indicated how many results that link was going to return would be nice (note: I have no idea off the top of my head if it is possible to do that or not).

The Google Scholar link failed to work on this item because the author/depositor changed the title of the article in its final version and left the original title in the eprint record. I have revised the record and now the Google Scholar link works properly. (The link initiates a search on the title and first named author.)

Each of the references towards the bottom of the page has a 'SEEK' button next to them (why uppercase?). As with my comments yesterday, this is a button that acts like a link (from my perspective as the end-user) so it is not clear to me why it has been implemented in the way it has (though I'm guessing that it is to do with limitations in the way Paracite (the target of the link) has been implemented. My gut feeling is that there is something unRESTful in the way this is working, though I could be wrong. In any case, it seems to be using an HTTP POST request where a HTTP GET would be more appropriate?

You are right to pull us up on this. ParaCite is a piece of legacy technology that we should probably either revise or remove. The author left the University several years ago. I will check to see what genuine usage it is getting.

There is no shortage of embedded metadata in the page, at least in terms of volume, though it is interesting that <meta name="DC.subject" ... > is provided whereas the far more useful <meta name="keywords" ... > is not.
The page also contains a large number of <link rel="alternate" ... > tags in the page header - matching the wide range of metadata formats available for manual export from the page (are end-users really interested in all this stuff?) - so many in fact, that I question how useful these could possibly be in any real-world machine-to-machine scenario.

We should definitely add a meta entry with the unqualified keyword tag. The large number of exports are for bibliographic databases (BibTeX, EndNote and the like), metadata standards (METS, MODS, DIDL, Dublin Core) and various services (e.g. visualisations or mashups). The problem with that list is that it is undifferentiated and unexplained - it should at least have categories of export functionality to inform users. As for the large number of <<link rel="alternate" ...> tags, I'm not sure I understand the criticism - will they bore the HTML parser?

Overall then, I think this is a pretty good HTML page in usability terms. I don't know how far this is an "out of the box" ePrints.org installation or how much it has been customised but I suggest that it is something that other repository managers could usefully take a look at.

I am happy to have received a "Good, but could do better" assessment for the repository. This is a configured EPrints repository, but not a heavily customised installation, so it shouldn't be too different an experience for other EPrints 3 repositories.

Usability and SEO don't centre around individual pages of course, so the kind of analysis that I've done here needs to be much broader in its reach, considering how the repository functions as a whole site and, ultimately, how the network of institutional repositories and related services (since that seems to be the architectural approach we have settled on) function in usability terms.

I agree with this - repositories are about more than individual files, they are about services over aggregations of documents - searches, browse views of collections. We need critiques of more of a repository's features - perhaps something that SHERPA/DOAR could implement in future?

It's not enough that individual repositories think about the issues, even if some or most make good decisions, because most end-users (i.e. researchers) need to work across multiple repositories (typically globally) and therefore we need the usability of the system as a whole to function correctly. We therefore need to think about these issues as a community.

This is a key point that we need to bear in mind. So is it possible to be more specific about the external players in the Web? What services exactly do we need to play happily with? Various commentators have mentioned different aspects of the Web ecology - Google, Google Scholar, RSS, Web 2.0 services, Zotero etc. Can we bring together a catalogue of the important external players with which a repository must interoperate and against which a repository should be expect to be judged/accredited?

Tuesday, 17 February 2009

Repository as Blog?

As ever, it's the people who are closest in proximity to you that get to talk to you the least.

Simon Coles, an chemistry researcher with his own EPrints development team works in an adjacent building to me, and today I managed to have a technical discussion with him and his team for the first time in about a year!

Simon does a lot of really interesting innovation in the area of the scientific information environment, and he runs a national scientific information service which gives him a really pragmatic attitude to what actually works in practice and what is simply a good idea. He runs the eCrystals data repository that shares scientific metadata and data on crystallography experiments.

One of his team's recent developments has been the scientific data blog - the use of blogging software to act as a laboratory notebook, describing experimental procedure with attached data files. As he described the ideas, and their implementation as a piece of blogging software, it occurred to me that a repository could appropriately provide this kind of service, after all, a daily posting of text and data files sounds very like an eprint consisting of an extended abstract with uploaded documents.

Of course, "To a man with a repository, everything looks like a deposit", but which sounds most likely? A blog environment that curates scientific data files? Or a data and document curation environment that provides a blog style-interface?

What you'd need to add to a repository (apart from some bespoke deposit interfaces) is the ability to create the document that you are going to deposit within the workflow itself. In the page where you are invited to upload a document, you should also have a Rich Text Editor (like blogger) so that you can type the document in.

That's a project for another day. After a truly exhausting time at #dev8D my repository developers need to charge their batteries :-)