RepositoryMan: April 2009

Tuesday, 28 April 2009

Radio EPrints

I have just found out that UMAP 2009 is publishing MP3s of all its conference abstracts before the conference starts. I'm going to keep a weather eye on how this goes, because the organisers clearly think that this is a way of getting the message out about their work.

This isn't particularly new: in the 1980s the cool Mac crowd were using HyperCard to record Usenet messages to listen to in the car. But when proper conference organisers (rather than maverick geeks) start to invest the effort to do it for their conference, then it is more serious. After all, these guys have a conference to organise!

Is this the right time to take the step and do similar for our repositories? Record summaries, commentaries or adverts for each of our papers / lectures / reports ? Why not have an "elevator pitch" to advertise important work? There must be all sorts of opportunities for syndicating audio content, and mixing it up into academic playlists.

Wednesday, 22 April 2009

There's an app for that

I bought an iPhone last year to replace my six-year-old cellphone because (a) I needed to use it for work and (b) all the other mobile phones I looked at were so complicated that I had to ask my teenage children how to work them.

But the iPhone isn't being advertised on the basis of its design or on its ease of use. Instead Apple are churning out advert after advert about all the applications that people have written for the device. The phone-as-a-platform for useful little 'apps' seems to be a winning story, and I have seen rabid anti-Apple friends and colleagues being won around to buying one on this basis.

Apple's slogan is "there's an app for that" - whether you're trying to find a local restaurant, go bird watching, get directions, authorise a visa transaction, print shipping labels, rent an apartment or buy a textbook. And they don't even mention the word "phone" in their ads any more.

You can probably see where I'm going with this - the repository as a platform, a locus for useful services, an environment for innovation. I've said it before - we need to cram a lot more interesting and useful services into our repositories - stuff that will pique the interest of researchers and users, not just digital librarians and developers. Apparently not everyone goes faint at the thought of preservation, APIs and service oriented architectures.

So I was delighted that Adam Field came up with a new EPrints export plugin today that helps me to show off the contents of my repository. Called the DocumentGrid it simply displays a linked table of thumbnails for the records in a collection or in the results of a search. It is really simple and it took him about twenty minutes to write, but it is one of those really useful functions that I find myself needing. It answers the question "show me what's in your repository", or "let me see what's in that collection". It's useful, and it's eye candy at the same time.

Let's have some more of these! Half the stuff for the iPhone is eyecandy, or only useful in really specific circumstances, so I don't think we should be afraid of making things that aren't profound and respectable and adaptable. Of course people want to show off their work - and not just in CVs. There should be tons more apps that help researchers to look good and show off their stuff.

This one is available for download at the EPrints distribution repository: http://files.eprints.org/443/

Friday, 17 April 2009

EPrints and its Development

I'm in the process of writing a paper about the first ten years of EPrints (yes, it'll be 10 years old at the end of October 2009), and I've been trying to put together a comprehensive overview of the internal construction of EPrints as it stands in 2009. What you might call an "architecture diagram for users".

Stung into action by John Robertson's recent blog entry on repository developments which mentions only a few of the ideas that we are working on, I thought it might be a good idea to share a draft version of this diagram.

The PDF linked from this posting shows my understanding of the internals of EPrints, highlighting the bits that we are working on at the moment in the version 3.2 development track.

Some more details about the 24 new features planned for the next release of EPrints can be found on the EPrints Wiki. Presentations and demos will be forthcoming at Open Repositories 2009, ECDL, OAI6, Sun PASIG and all good repository workshops in your area :-)

Cloud, Web, Intranet and Desktop Connectivity - repository data can now be stored in the cloud, on the web, on an intranet storage service, on a local disk or on any combination of the above. Also, the contents of the repository can be mounted on the user's desktop as a 'virtual file system'.

Desktop Document Support - thumbnails and embedded metadata extraction is provided for Microsoft Office documents. Media copyright checklists are generated for PowerPoint slideshows to assist Open Access clearance for lecture slides. Complex thumbnails are now supported, such as multi-image thumbnails for a slideshow or an embedded FLV clip of a video.

Research Management - Support for new kinds of administrator-defined data objects with project, organisation and people datasets as standard to provide compatibility with Current Research Information Systems (CRIS). Citation reporting will use ISI's Web of Science as well as Google Scholar.

Preservation Support - Preservation Planning Capabilities embedded in the repository using PRONOM and DROID.

Improved EPrints Data Model - as well as eprints, now files, documents, users and all data objects have persistent URIs and arbitrary relationships between them. RDF export plugin provides linked data capabilities, and a new REST interface provides an API to all EPrints data.

Improved Interoperability and Standards - SWORD 2 (v1.3 Specification), new OAI-ORE Import and Export Plug-ins, RDF plugins improved to provide better support for W3C Linked Data, CERIF support for Current Research Information Systems and enhanced Compatability for DRIVER project systems.

Miscellaneous Improvements - there are more enhancements to repository administration and improvements to the way that abstract pages are generated. IRStats/EPStats are better integrated with EPrints distribution. Autocompletion/Name Authorities have been added for Institutions and Geographical Places (both with geolocation data). Enhanced User Profiles allow for more CV-relevant information than just publication lists. User-defined collections provide "shopping trolley" functionality for ephemeral compilations as well as persistent collections. A Scheduler / Calendar for planning for embargoes, licenses, preservation activities, periodic maintenance activities etc. Quality Assurance Issues can be manually raised and resolved. PDF coverpage capabilities will be provided as standard.

Google and Repositories

Continuing yesterday's comments on the effect of Google PageRanking in resource discovery, there is an added Google effect that compounds the problem of discovering resources in repositories. Google doesn't treat each resource separately, but instead it aggregates all the resources from a single site, showing only the top two resources from that site no matter how many should appear.

For example, if I search for the terms "ontology" and "hypertext" directly in our school repository, 8 articles are returned. If I do the same search in Google, then our repository appears gratifyingly at the top of the list of results, but only TWO of those items are listed together with a discrete link to more items from this site.

So, not only is your article in competition with all other web pages on the planet, it is doubly in competition with other articles in your repository which could deprive it of its rightful place in the rankings.

This means that we need to think about redesign our repository pages to link to "other related work" that the visitor may not have seen represented in Google.

Thursday, 16 April 2009

PageRank and Repositories

I commented before on the big impact of Google on repositories, and the way that overwhelms all other form of access to repository contents. Today I've had a look at the log files for our EPrints server to find all the web requests referred by Google (for any kind of page - abstract or full text or collection list). As a result of a conversation with someone on the topic of search, I wanted to check the "tenacity" of the Google enquirers. Since before the advent of the Web it's been common knowledge in the Hypertext research community that people tend not to scroll and click more than they have to when navigating an information system.

It would be nice to think that repository users (whoever they are) carefully looked through all the relevant and useful results returned by Google; but practical considerations mean that their investigations are more limited. In fact, 78% of our Google referrals came from the FIRST results page of a Google query.

This means that it is really important to make sure that your repository pages get a good PageRank - there are only ten opportunities for your content to appear in front of most Google users. If your paper happens to fall in at position 11 you have a much reduced chance of being found.