RepositoryMan: February 2009

Thursday, 26 February 2009

Why I Need Trusted Storage

Yesterday I went to the Apple Store to get my laptop hard disk repaired for the FIFTH time. Each time the data has been unrecoverable - or worse - partially recoverable. Each time I have lost material that was not backed up and each time my different sets of historical backups were recovered but only partially integrated with each other. After all, each fatal disk crash knocks a week out of your working life, what with trying to extract anything important from the smoking remains of the still-spinning disk, taking the machine to the repair shop, waiting for them to repair it, working on another system, getting email sorted, getting your old machine back, restoring the operating system and applications and then trying to copy all your old backups onto the new disk. The effect is that each subsequent crash has confused my backups / restores to the point where I have folders within folders within folders of different sets of partially restored material and I no longer know what has been restored where.

How could I possibly get into such a lunatic state, I hear you ask. Is this man the worst kind of professional incompetent? Everyone knows you have to back your stuff up. Why doesn't he just buy a big disk and use Time Machine? These are good questions. I ask them of myself all the time.

Our school systems team disavow responsibility for all laptops. We are literally on our own if we dare to have mobile machines.
When I started on this voyage of data loss some three years ago, Time Machine wasn't invented.
Disks that you buy for backup are just as likely to go foom as your own personal laptop disk. My main coherent, level 0 backup on a LaCie Terabyte disk just stopped working one day, just when I tried to restore my work.
Large disks are forever being used for other urgent purposes. Students need some space for something. A project needs some temporary storage. You need to be able to transfer a large amount of data from one machine to another. It gets difficult to manage the various assortments of undistinguished grey bricks that build up in your office. Which one has the old duplicate backup on it that is no longer necessary?

There are lots of other mitigating circumstances with which I won't bore you, but what I would like to lay down are my beliefs that (a) backup management is a complex task that requires serious attention and preferably support from professionals who can devote some attention to it and (b) it is never urgent enough to displace any of the truly important and terribly overdue academic tasks that you are trying to accomplish TODAY so you don't get sacked.

I've had a lot of time to reflect on this since my laptop started plunging me into regular data hell, and the idea of trusted storage for me isn't just about having files that don't disappear. It's about having an organised, stable, useful, authoritative picture of my professional life - research and teaching - that grows and tells an emerging story as my career develops. That's mainly what has been disrupted - I can pretty much find any specific thing that I want by grep/find or desktop search. But the overall understanding of what I had and what I had been working on has been disrupted and damaged and fragmented.

So an intelligent store should help me understand what I have - a bit like the way that user tools like iPhoto help you understand and organise thousands of images. It should be possible to get a highly distilled overview/representation/summary/visualisation of all my intellectual content/property/achievements as well as a detailed and comprehensive store of all my individual documents and files.

I guess you can see where I'm going with this. I've gone and got the ideal desktop storage and the dream repository all mixed up. Well perhaps I have - but why not?

Anyway, all's well that ends well. My colleagues all clubbed together and got a terabyte Time Capsule for work, that is run by a sympathetic member of the systems team. And Apple just phoned up to offer me a brand new 17" MacBook Pro in exchange for my broken old one.

Still, I'd really like to make my data store intelligible as well as safe!

Wednesday, 25 February 2009

DuraSpace: High Hopes or Crying Wolf

I promised that I would try to keep informed about DuraSpace, and so I was pleased to read the DuraSpace midterm report to Mellon. (Note to Mellon staff: please don't scan these reports without OCR'ing them. It's frustrating not being able to Google them!)

As I said previously, I'm a big fan of the DuraSpace agenda. My distillation of DuraSpace goals from the report's opening paragraph is to provide a trusted intermediary that makes content both durable and usable with a "chinese menu" of added-value services. Now this isn't really specific to the cloud - but that seems in keeping with the report because it frequently refers to "third party storage solutions" rather than "the cloud".

So the DuraSpace agenda could apply as much to the Web, or any other information environment, as it does to the cloud. Which in itself seems to be a good thing, and proves the worth of the open repositories community (go repositories!)

Except that we're still trying to consolidate and prove our worth in the web environment. Have we got a huge community of end-users who are all cheering for repositories and swear by their functionality? Exactly how long is our chinese menu of appealing and valuable services? It may be a bit of a hobbyhorse of mine (sorry about that) but let's make sure that we deliver on repository value and usefulness in the Web, on the desktop and also in the cloud.

Otherwise someone is going to accuse us of crying wolf - quick! come and look at the value proposition of repositories in the cloud! We've already alerted people about value and the web till we're blue in the face. Can we really tick that one off? Have we delivered? Do people trust us? (Have people heard of us?)

I haven't suddenly gone all anti-repository - I believe that we are genuinely seeing some really interesting repository services starting to emerge from a variety projects. But they are not mainstream yet, and they are not common experience. We still need to work harder on creating value for end users as well as repository managers and repository developers.

Let's do it in the cloud - but lets work really hard at articulating the benefits that the cloud end user will enjoy stop relying on general talk about value-added services. We need to Think. Specifically. Make a clear offering to our users - or would-be users. I think researchers/end-users will forgive us for not having finished implementing something yet, but they won't forgive us for a lack of imagination.

Tuesday, 24 February 2009

Fifteen Years After The Fact

Thanks to Colin Smith for pointing out this new discussion from The Council of Editors of Learned Journals on the future of the journal in which they propose the following four principles.

Journals must pursue interoperability with the other online tools that are shaping the techne of scholarly practice

Journals have opportunity to reframe their role in the academy as curators of the noise of the web.

Electronic journals will have the opportunity to expand their curatorial mandate include different forms of publication.

Broadening the community of participation.

I was expecting to be disappointed - this set of blogged responses of journals to a web-based future expends 3400 words failing to mention open access or repositories. But then in principle #3 they went and completely exceeded my expectations by proposing a model of scholarly publication that genuinely fits in with the web.

It is contrary to utility, in the world of web 2.0, to maintain exclusive publication rights on an article. Exclusivity of publication places a text in only one domain. Yet non-exclusive text gets reproduced and recopied, circulated around the internet, and rapidly floats onward to mimetic influence in other cultures, excerpted and referenced. For every web 2.0 author, non-exclusivity and easy republication is ideal. For every would-be-idea-of-influence in the age of web 2.0, easy reduplication is crucial.
Exclusivity has been the format followed by most online journals, which seek to mimic in form the traditional journal: one essay, neatly formatted, looking as professional as possible. Exclusive re-publication suggests the old model of authority, and is superficially reassuring to editors without actually promoting the real functions of the journal: disseminating ideas and establishing the authority of the journal-as-canon and disciplinary metric.
Significantly more desirable would be setting a different precedent: for all disseminated forms of the text to advertise the article's accreditation as having been curated by inclusion in the journal-as-stream. (the text might end with, for instance, "please recirculate with this citation: by-Professor-Bonnie-Wheeler, SMU, 2009; officially tagged in 'Arthuriana,' [link] May 2010") Advertising the link between article and journal in many reproduced/cross-referenced copies would function both to the benefit of the article and the prestige of the journal.
Again, if the dissemination model is followed, the journal homepage need not include reprints of the articles themselves: merely links to the original blogspace or university-housed-pdf or slideshow where the material was originally posted, with all of its links, illustrations, video, and wallpaper as the author originally presented it. The journal's role is reduced to curation, not to presentaiton. Not having a use for a graphic designer, typesetter, or illustrations layout person, the journal's workflow will be considerably reduced.

This isn't exactly a new model - syndicated scholarly dissemination based on links and certification - but I didn't expect to hear a council of learned journal editors proposing it in my lifetime! The times they are a-changin! Or perhaps, the web is finally changing us all.

Saturday, 21 February 2009

The Cloud, the Researcher and the Repository

There's currently a lot of buzz about DuraSpace, the DSpace and Fedora project to incorporate cloud storage into repositories. I wasn't able to catch their webinar on Thursday, but I'm keeping my ear to the ground because it sounds like a very positive agenda for repositories in general to adopt. I hope this is a good opportunity to make a few remarks about the work that EPrints is doing that also might make cloud services accessible to repositories and users of repositories.

Moving your data into the cloud is a bit like moving your stuff into an unfurnished apartment. You get an awful lot of space to put things, once a month you have to pay the landlord, and you end up with absolutely nothing available to help you to organise and look after your things. You have to put your clothes, DVDs and crockery in a big pile on the floor unless you get some furniture in. But cloud 'furniture' comes as downloadable instructions on how to take three planks of wood and craft something that functions almost the same as a coffee table. In short, it's a great place for highly competent DIY enthusiasts with time on their hands. The EPrints team have been working on projects that might help researchers looking to take advantage of the cloud's benefits, without being put off by its lack of home comforts.

We've previously announced that Dave Tarrant has extended EPrints to use cloud storage services as part of JISC's PRESERV2 project (preserv.eprints.org). The new EPrints storage controller (debuting in EPrints v3.2) allows the repository to offload the storage of its files to any external service - cloud storage, local storage area networks or even national archiving services. The repository can mix and match these services according to the characteristics of each deposited object - even storing each item in several places for redundancy or performance improvement.

That tackles the technical part of the problem - how to join up repositories with the cloud, but it doesn't have much to say about how to better engage data-rich-users with the cloud (or with the repository come to that). As part of the JISC KULTUR project (kultur.eprints.org), Tim Brody has been looking at the problem of user deposit for lots of large media files. Not petabyte large, but gigabyte large. Even at that scale, the normal web infrastructure fails to deliver a reliable service - connections between a web browser and server just time out unexpectedly and silently - which makes it unpleasant for an artist who is trying to archive their career's-worth of video installations to the institutional repository. It's also really tedious even if you try to upload 100 small image files to the repository through the web deposit interface.

The solution that Tim has come up with is to allow the researcher's desktop environment to directly use EPrints as a file system - you can 'mount' the repository as a network drive on your Windows/Mac/Linux desktop using services like WebDAV or FTP. As far as the user is concerned, they can just drag and drop a whole bunch of files from their documents folders, home directories or DVD-ROMs onto the repository disk, and EPrints will automatically deposit them into a new entry or entries. Of course, you can also do the reverse - copy documents from the repository back onto your desktop, open them directly in applications, or attach them to an email. And once you have opened a repository file directly in Microsoft Word (say) then why not save the changes back into the repository, with the repository either updating the original document or making a new version of it according to local policy? Or for UNIX admins, you can just set up a command-line FTP connection to the repository and relive the glory days of the pre-Web internet. And who knows, perhaps there will be demand for a gopher interface too?

Now perhaps if you put the desktop front-end together with the cloud back-end, the repository might be able to offer institutional researchers a realistic path to cloud storage. For the researcher who is tempted by the expansion capacity that the cloud's metaphorical unfurnished apartment offers them, the repository could offer a removal van, a concierge, a security guard, a cleaner and an expandable set of prefabricated cupboards and walk-in wardrobes. Not naked cloud storage, but storage that is mediated, managed and moderated on the researcher's behalf by the institution, so that they have the assurance that their data is not stranded and susceptible to the irregularities of cloud service provider SLAs. In other words, a cloud you can depend on!

The above paragraph sounds a bit hand-wavy, and to be honest we need to get some proper experience of this with real researchers before we can be confident that it is a viable approach. Desktop services have already been built on top of cloud storage - JungleDisk for example is a desktop backup and archiving service, but it still requires the user to have their own cloud account. Hopefully, a repository can take away all the necessity for special accounts, passwords and storage management from the user and provide them with a whole host of extra, valuable services.

Perhaps that's where the challenge lies. Repositories need to commit to providing really useful services to all their users - cloud users (or potential cloud users) are not a new breed, even if they do have exacting requirements. So having taken care of the infrastructure that seemlessly connects repositories and clouds, lets make sure that we keep on innovating in the user space. Backup, archiving, preservation and access are a good foundation, but they are only the start.

There will be a demonstration of this work and other features of EPrints 3.2 at Open Repositories 2009 in Atlanta, Georgia on May 18th-21st. Make sure you come along because it's going to be a really exciting conference, whether or not it is cloudy :-)

Wednesday, 18 February 2009

EPrints Evaluation - assessing the usability

Andy Powell has been looking at repository usability on his blog, and his latest posting uses a paper in the ECS school repository as an example. He makes some very good points to which I ought to respond.

First, Andy comments on the URL of the splash page.

The jump-off page for the article in the repository is at http://eprints.ecs.soton.ac.uk/14352/, a URL that, while it isn't too bad, could probably be better. How about replacing 'eprints.ecs' by 'research' for example to mitigate against changes in repository content (things other than eprints) and organisational structure (the day Computer Science becomes a separate school).

This is certainly a point to consider, but there are two approaches to unchanging cool URIs. One is to try to make the URI as independent of any implementation or specific service as possible - "research" instead of eprints.ecs. Unfortunately, there are some things that are givens - we are the School of Electronics and Computer Science and ecs.soton.ac.uk is our subdomain. We can't pretend otherwise - and we do not have the leeway to invent a new soton.ac.uk subdomain. We are very aware of the impermanence of the university structure (and hence the URL and domain name structure). Five years ago the whole University was re-arranged from departments into new amalgamated schools. Luckily, we and our URLs survived unscathed. Even worse, last year the University's marketing office almost rebranded every official URL and mail address from soton.ac.uk to southampton.ac.uk!

The ultimate in insulating yourself from organisational changes is to adopt an entirely opaque URI such as a handle. The alternative is to admit that you can't choose a 100% safe name, and have policies and procedures in place to support old names whatever changes and upheavals come to pass. For example, our eprints service itself replaces an older bibliographic database called jerome, whose legacy URIs are redirected to the corresponding pages in eprints. That is the approach that we take with our services - adapt, change, move and rename if necessary, but always provide a continuity of service for already published URIs.

The jump-off page itself is significantly better in usability terms than the one I looked at yesterday. The page <title> is set correctly for a start. Hurrah! Further, the link to the PDF of the paper is near the top of the page and a mouse-over pop-up shows clearly what you are going to get when you follow the link. I've heard people bemoaning the use of pop-ups like this in usability terms in the past but I have to say, in this case, I think it works quite well. On the downside, the link text is just 'PDF' which is less informative than it should be.

The link text is a bit curt: it should say "View/Open/Download PDF document"

Following the abstract a short list of information about the paper is presented. Author names are linked (good) though for some reason keywords are not (bad). I have no idea what a 'Performance indicator' is in this context, even less so the value "EZ~05~05~11". Similarly I don't see what use the ID Code is and I don't know if Last Modified refers to the paper or the information about the paper. On that basis, I would suggest some mouse-over help text to explain these terms to end-users like myself.

Author names are linked to the official school portal pages externally to the repository. Presumably keywords should be linked to a keyword cloud that groups all similarly-themed papers. The ID Code and Performance indicators are internal, and should be low-lighted in some way. The last-modified information refers to the eprint record itself, and so the label should be mae more informative.

The 'Look up in Google Scholar' link fails to deliver any useful results, though I'm not sure if that is a fault on the part of Google Scholar or the repository? In any case, a bit of Ajax that indicated how many results that link was going to return would be nice (note: I have no idea off the top of my head if it is possible to do that or not).

The Google Scholar link failed to work on this item because the author/depositor changed the title of the article in its final version and left the original title in the eprint record. I have revised the record and now the Google Scholar link works properly. (The link initiates a search on the title and first named author.)

Each of the references towards the bottom of the page has a 'SEEK' button next to them (why uppercase?). As with my comments yesterday, this is a button that acts like a link (from my perspective as the end-user) so it is not clear to me why it has been implemented in the way it has (though I'm guessing that it is to do with limitations in the way Paracite (the target of the link) has been implemented. My gut feeling is that there is something unRESTful in the way this is working, though I could be wrong. In any case, it seems to be using an HTTP POST request where a HTTP GET would be more appropriate?

You are right to pull us up on this. ParaCite is a piece of legacy technology that we should probably either revise or remove. The author left the University several years ago. I will check to see what genuine usage it is getting.

There is no shortage of embedded metadata in the page, at least in terms of volume, though it is interesting that <meta name="DC.subject" ... > is provided whereas the far more useful <meta name="keywords" ... > is not.
The page also contains a large number of <link rel="alternate" ... > tags in the page header - matching the wide range of metadata formats available for manual export from the page (are end-users really interested in all this stuff?) - so many in fact, that I question how useful these could possibly be in any real-world machine-to-machine scenario.

We should definitely add a meta entry with the unqualified keyword tag. The large number of exports are for bibliographic databases (BibTeX, EndNote and the like), metadata standards (METS, MODS, DIDL, Dublin Core) and various services (e.g. visualisations or mashups). The problem with that list is that it is undifferentiated and unexplained - it should at least have categories of export functionality to inform users. As for the large number of <<link rel="alternate" ...> tags, I'm not sure I understand the criticism - will they bore the HTML parser?

Overall then, I think this is a pretty good HTML page in usability terms. I don't know how far this is an "out of the box" ePrints.org installation or how much it has been customised but I suggest that it is something that other repository managers could usefully take a look at.

I am happy to have received a "Good, but could do better" assessment for the repository. This is a configured EPrints repository, but not a heavily customised installation, so it shouldn't be too different an experience for other EPrints 3 repositories.

Usability and SEO don't centre around individual pages of course, so the kind of analysis that I've done here needs to be much broader in its reach, considering how the repository functions as a whole site and, ultimately, how the network of institutional repositories and related services (since that seems to be the architectural approach we have settled on) function in usability terms.

I agree with this - repositories are about more than individual files, they are about services over aggregations of documents - searches, browse views of collections. We need critiques of more of a repository's features - perhaps something that SHERPA/DOAR could implement in future?

It's not enough that individual repositories think about the issues, even if some or most make good decisions, because most end-users (i.e. researchers) need to work across multiple repositories (typically globally) and therefore we need the usability of the system as a whole to function correctly. We therefore need to think about these issues as a community.

This is a key point that we need to bear in mind. So is it possible to be more specific about the external players in the Web? What services exactly do we need to play happily with? Various commentators have mentioned different aspects of the Web ecology - Google, Google Scholar, RSS, Web 2.0 services, Zotero etc. Can we bring together a catalogue of the important external players with which a repository must interoperate and against which a repository should be expect to be judged/accredited?

Tuesday, 17 February 2009

Repository as Blog?

As ever, it's the people who are closest in proximity to you that get to talk to you the least.

Simon Coles, an chemistry researcher with his own EPrints development team works in an adjacent building to me, and today I managed to have a technical discussion with him and his team for the first time in about a year!

Simon does a lot of really interesting innovation in the area of the scientific information environment, and he runs a national scientific information service which gives him a really pragmatic attitude to what actually works in practice and what is simply a good idea. He runs the eCrystals data repository that shares scientific metadata and data on crystallography experiments.

One of his team's recent developments has been the scientific data blog - the use of blogging software to act as a laboratory notebook, describing experimental procedure with attached data files. As he described the ideas, and their implementation as a piece of blogging software, it occurred to me that a repository could appropriately provide this kind of service, after all, a daily posting of text and data files sounds very like an eprint consisting of an extended abstract with uploaded documents.

Of course, "To a man with a repository, everything looks like a deposit", but which sounds most likely? A blog environment that curates scientific data files? Or a data and document curation environment that provides a blog style-interface?

What you'd need to add to a repository (apart from some bespoke deposit interfaces) is the ability to create the document that you are going to deposit within the workflow itself. In the page where you are invited to upload a document, you should also have a Rich Text Editor (like blogger) so that you can type the document in.

That's a project for another day. After a truly exhausting time at #dev8D my repository developers need to charge their batteries :-)

Sunday, 15 February 2009

How EPrints Might Support Copyright for Teaching Materials

In yesterday's posting I mentioned the notion of a copyright audit for teaching materials that incorporate images and content produced by other people. It's a very current topic for the teaching and learning community, and one that has been discussed extensively in the context of the EdSpace project.

So I thought I'd knock up some extra document metadata fields to allow sufficient information to run with this idea. I've just added a text field to record the identity of the copyright holder, a URL field to store a reference to where the media had been sourced from, and a pick list for the author to declare whether it was his/her own material, whether rights had been bought, or permission obtained, whether the item was in the public domain, or whether the copyright status was 'uncleared' or 'unknown'.

To recap: this is in the context of a set of lecture notes that have incorporated third party pictures and images. The repository has "burst out" all these images from the slideshow and recorded them as separate (but related) documents, with their own metadata for cataloguing purposes.

Now that the metadata is recorded, this can be used as the basis for a workflow, so that the public access status of the slides can be made contingent on the correct permissions being obtained for all the embedded items. Or alternatively, in the case of problematic items, the repository can create a 'copyright free' version of the slideshow by pixelating, graying out, or simply removing the original image. Or, the author can be allowed to deposit the slideshow into the repository, but the EPrints QA audit system can be used instead to provide warnings and reminders to get missing permissions.

To re-iterate. This is not about Open Access to self-authored, self-deposited research materials! This is about Open Educational Resources, which may incorporate third party materials and which the authors may worry about making public.

Repository managers may be looking at this and thinking that I've just made the process of sharing information much more complex. But there are lots of ways of simplifying this - you could just resort to a tick box for the powerpoint that says "I have checked all the resources below and declare that copyright permission has been obtained for all of them".

Friday, 13 February 2009

Microsoft Office at #dev8D

I joined my colleagues Chris Gutteridge and Dave Tarrant (aka BitTarrant) at the JISC Developer Happiness (#dev8D) event in London this week. At least, I came to the tail end of the event after I had dispatched some JISC bids! It was a great time, with lots of food for thought. During the closing Repositories session the discussion touched on the role of the repository in mediating between desktop documents and the world of the web (1.0, 2.0 and the cloud). In particular, one of the Fedora developers suggested that the repository could expose new "endpoints" (i.e. points of access) for the kinds of complex documents that were normally encountered as a take-it-or-leave-it package. Documents like Microsoft Word files, which are now stored as explicit bundles of text, media, metadata and relationships.

This fits into so many of my soap boxes - providing more value for end users, supporting desktop activities, taking advantage of the new Office openness - so I got really excited about the possibilities. At the end of the session I sat down with Chris and we (i.e. he) started implementing an EPrints service to do that. If the wireless network hadn't gone down, he would have finished before the conference dinner. However, he did finish and refine it the next day during the repository briefing sessions.

The image on the left shows what happens when you upload a Word 2007 document. Firstly, the Dublin Core metadata (author, title etc) of the document is applied to the eprint record itself (aka automatic metadata extraction). This has obvious advantages because it means that if you want to create a sensible, standalone record then you might be able to get away with just uploading the document and not filling out any extra metadata. If you can then look up the author and title in Web of Science you really might not need to fill out any extra metadata at all. That would be nice!

Secondly, each of the images is extracted as a separate document in their own right. That means they get their own metadata and URLs and you could download and reuse individual figures without downloading the whole document. (In the image I have shown the figure captions as part of the metadata, but I cheated by cutting and pasting them from the original.)

Another example is a record for a PowerPoint document shown here. By bursting out all the images used in the slideshow, the repository has automatically created a catalogue of media resources which could be used in a copyright audit to check that it is safe to make this teaching resource Open Access.

Since each media resource is a separate entity - and its not just limited to pictures and videos, it could be embedded spreadsheets and other complex documents - it is linked internally to a specific slide entity, so it would be easy to make a rather more sophisticated table of slides and resources.

And once you have all the slides listed for an individual slideshow, then the repository can make a page that views all of the slides from all of the individual Powerpoints. Or just the ones from a particular project. Or just the ones from a particular research group.

So I think that there's a lot of mileage in this approach, especially when you combine it with SWORD and allow the Office application to automatically save the Office document into the repository in the first place.

Sunday, 1 February 2009

Repository meets the Semantic Web. Semantic Web Acquits Itself Well

I am helping to organise the Web Science 2009 conference in Athens in March, and I am putting the conference papers in a repository to generate all the paper lists for the schedules etc.

This is a new conference for an emerging discipline and it seems particularly important to be able to give an impression of the breadth of the community contribution. An obvious way to show international contribution is to plot the conference contributors on Google Earth. That's a fairy standard EPrints export demo, but I just need to know where all the contributors are (their latitude and longitude).

It's easy enough to import the affiliation, country and email for each author from the conference submission system (EasyChair) into the repository (EPrints), but there's no central service that will give me the lat/long of a university. Wikipedia has them, but there's no easy way to go reliably from the entered Affiliation data to a Wikipedia University entry. The best way that I found was to use each author's email address (or part of it!) to do a semantic web search of DBpedia for matching Universities, look up the city that the University is located in and find out the latitude and longitude of that city. It's all automated in SPARQL, so it's pretty efficient (now that I've learned about DBpedia and SPARQL that is!) It may have been just as quick to do it by hand from Wikipedia, but where's the fun in that?