Friday 26 June 2009

Hardworking Repositories: The Global Picture

To round off the picture of hardworking repositories (ie repositories which receive regular daily deposits) here is the global top ten repositories listed with the number of days in the last year in which deposits were made. The data is obtained from the Registry of Open Access Repositories.

ORBi (University of Liege, Belgium)311
IR of the University of Groningen (Netherlands)301
KAR - Kent Academic Repository (UK)286
University of Southampton:
School of Electronics and Computer Science
UBC cIRcle (University of British Columbia, Canada)269
LSE Research Online (London School of Economics, UK)260
EEMCS EPrints Service (School of Electronics
and Computer Science, University of Twente, Netherlands)
LUP: Lund University Publications (Sweden)259
UPSpace at the University of Pretoria (South Africa)257
University of Tilburg (Netherlands)256

There are all sorts of caveats attached to this list! Firstly, I removed two entries because they were not "institutional" but "national" in scope. Secondly, I left in two "departmental" repositories (ECS and EEMCS) because - dammit, if a department can achieve regular deposits then so should a whole institution! Thirdly, this table depends on OAI harvested data from ROAR - if there are any problems with the OAI feed then it will affect the analysis. And perhaps most importantly, this table does not take into account the types of deposit that were made on the days in question. They could be research articles, research data, teaching material, holiday photographs, or bibliographic records sans open access full text. So for example, the UBC repository is mainly composed of student theses and dissertations.

As I have said in the last two postings in this blog, this list simply reflects how much deposit usage the repository is getting on a daily basis and it deliberately factors out the number of deposits in order to smooth over the effect of batch imports from external data sources. The emphasis is on finding a simple metric to highlight embedded usage of a repository across a whole institution.

Wednesday 24 June 2009

Hardworking Repositories: Comparing UK & US

To go with the list of UK Repositories, here are the top 10 most hardworking US repositories, based on the number of days deposit activity that they achieved in the last year according to ROAR.

RIT Digital Media Library253
Georgia Tech's Institutional Repository: SMARTech252
ScholarSpace at University of Hawaii at Manoa248
NITLE DSpace Service: Middlebury College245
Trinity University239
AgSpace: Home234
Florida State University D-Scholarship Repository231
DigitalCommons@Florida Atlantic University230

Once again, congratulations to those on the list. The methodology for drawing up this list was deliberately devised to promote daily engagement rather than numbers of deposits, in order to try and factor out bulk imports from external data services.

(I am slightly hesitant about publishing this list, because I am less familiar with US repository scene than with that in the UK. That means that I have difficulties in sanity-checking the list - in particular, the Middlebury College/Trinity services seem to be registered with the same host, even though their front ends are delivered from different host names. Do they genuinely count as separate repositories?)

These two lists (US/UK) do show some apparent differences in practice. If the headline numbers (days on which deposits are made) are subdivided into three categories (few deposits 1-9, medium 11-99 and high 100+) then it appears that the UK repositories are dominated by medium deposit days, and the US repositories by few deposit days.

Is this difference significant? Is it an artefact of the workflows and processes of the repository software platforms (the UK table is dominated by EPrints, the US table by DSpace)? Is it due to the different sizes of the host institutions? Or does it show a genuine difference in practice in terms of individual self-archiving vs proxy deposit? There needs to be some more analysis.

Tuesday 23 June 2009

Hard Working Repositories

There are lots of ways to measure the productivity of a repository, but in Size Isn't Everything: Sustainable Repositories as Evidenced by Sustainable Deposit Profiles I argued for counting the number of days per year that deposits had been made into the repository as a way of capturing its 'vitality' and 'embededness' and so highlighting repositories with broad-based researcher adoption.

Based on that metric, here is a top 10 list of the hardest working institutional repositories in the UK (data taken from ROAR).
If you factor out weekends, Christmas/Easter breaks and other public holidays there are about 233 days that a UK University is open for business. So congratulations particularly to Kent, the LSE, and my colleagues in the library at Southampton whose repositories are working unpaid overtime!

Friday 19 June 2009

Getting Metadata from the Semantic Desktop

In my last post, I discussed the metadata infrastructure that underpins the Macintosh desktop environment. In addition, thanks to some handholding from Chris Gutteridge, I've just configured the builtin Web server to download the documents themselves or metadata about those documents (in RDF, generated dynamically from the mdls command).

I've now got a pseudo repository on the desktop that contains all the source versions of my PowerPoint and Office documents, together with metadata about them. There are visualisation, search and editing services provided by the desktop and a Web dissemination system cobbled on the side.

I've also got a real repository on a server that contains the source and preprocessed versions of my Powerpoint documents, together with some metadata about them. There are visualisation, search and (planned) editing services provided by the repository's Web dissemination system.

Can these two work efficiently together, so that the conjunction of the desktop and the repository are greater than the sum of the components? Or is this just an exercise in reinventing the wheel just to make a point? I hope the former...

Sunday 14 June 2009

The Desktop Repository that's Already There

It's really time I acknowledged Peter Sefton who's doing a lot of work on Powerpoint and slide bursting for the Fascinator Desktop, part of his project to bring open HTML formats to the desktop. Peter visited Southampton earlier this year, and inspired me on the topic. I'd just got knocked back on a JISC proposal for looking at repository - desktop integration, so it was great to talk to someone else who wanted to do something in the area. We both seem to be goading each other on at the moment and we've been tweeting and emailing each other, but I've not given him his due credit in this blog so far.

I've been surprised to see how much of the infrastructure for a desktop repository is already in place in the operating system that he and I use (Mac OS X). The Mac already has a process that extracts metadata and data contents from each file into a central database (see mds(8) in the Unix manual pages); this process is alerted to update the database every time a new file is created or an old file is changed. There is an interface for querying the database (Spotlight), either looking just for matches of the contents, or for complex boolean queries based on the metadata and contents. There is also a sophisticated framework for generating and caching previews and thumbnails (QuickLook). A system that provides data and metadata handling in a centralised database with querying and visualisation facilities all sounds very repository-like to me. And in case you think that I'm overegging this pudding, here's a list of some of the common metadata that OS X will allow you to query (not including media-specific metadata):

AudiencesThe intended audience of the file.
AuthorsThe authors of the document.
CityThe document’s city of origin.
CommentComments regarding the document.
ContactKeywordsA list of contacts associated with the document.
ContentCreationDateThe document’s creation date.
ContentModificationDateLast modification date of the document.
ContributorsContributors to this document.
CopyrightThe copyright owner.
CountryThe document’s country of origin.
CoverageThe scope of the document, such as a geographical location or a period of time.
CreatorThe application that created the document.
DescriptionA description of the document.
DueDateDue date for the item represented by the document.
DurationSecondsDuration (in seconds) of the document.
EmailAddressesEmail addresses associated with this document.
EncodingApplicationsThe name of the application (such as “Acrobat Distiller”) that was responsible for converting the document in its current form.
FinderCommentThis contains any Finder comments for the document.
FontsFonts used in the document.
HeadlineA headline-style synopsis of the document.
InstantMessageAddressesIM addresses/screen names associated with the document.
InstructionsSpecial instructions or warnings associated with this document.
KeywordsKeywords associated with the document.
KindDescribes the kind of document, such as “iCal Event.”
LanguagesLanguage of the document.
LastUsedDateThe date and time the document was last opened.
NumberOfPagesPage count of this document.
OrganizationsThe organization that created the document.
PageHeightHeight of the document’s page layout in points.
PageWidthWidth of the document’s page layout in points.
PhoneNumbersPhone numbers associated with the document.
ProjectsNames of projects (other documents such as an iMovie project) that this document is associated with.
PublishersThe publisher of the document
RecipientsThe recipient of the document
RightsA link to the statement of rights (such as a Creative Commons or old-school copyright license) that govern the use of the document.
SecurityMethodEncryption method used on the document.
StarRatingRating of the document (as in the iTunes “star” rating).
StateOrProvinceThe document’s state or province of origin.
TitleThe title.
VersionThe version number.
WhereFromsWhere the document came from, such as a URI or email address.

That's a pretty impressive list, and it is fully typed as well, so dates are dates and numbers are numeric, meaning that you can do proper range searches not just text matches. Still, the Mac implementation has enough limitations to mean that we haven't yet thought of it as a repository
  1. it's a proprietary system. You can't access the thumbnails or export the metadata.
  2. there isn't any way of manually entering or editing the metadata - it's all automatically extracted from the file contents by the ingesters/importers
  3. there isn't any particularly useful way of displaying the metadata, apart from in the Finder's "Get Info" box or on the commandline (using the mdls program).
Issues (1) and (3) just reduce to coding better applications. There are a number of Finder replacements, but none of them really take the metadata seriously. There are also a number of tagging applications that have emerged in the last year or so, but they use a very narrow range of metadata. Someone could add a faceted browser interface to the Finder, or integrate some more explicitly bibliographic metadata into the Apple infrastructure.

Further reading around shows that issue (2) is also surmountable; extra metadata can be attached to a file through the use of the Mac filesystem's extended attributes. As well as the Title and Author information that the Microsoft Office importer produces, extended attributes with names like are inspected when the file is indexed. The value of that attribute is an "OS X Property List value" i.e. a number, boolean, date, string or array stored as binary or XML.

This looks like a very useful platform on which to build the researcher's desktop repository; a few added user-centric applications for browsing and editing metadata, together with some software to synchronise the desktop repository with the institutional repository (something like Time Machine) and we would have a very powerful system indeed.

Now I really do have to get on with that marking!

Friday 12 June 2009

More on the Desktop Repository

I've done some more experimentation on the Desktop Repository idea - strangely coinciding with another 100 exam scripts appearing on my desk to be marked.

Firstly, I've tried to have a go with moving the PowerPoint image data back to an EPrints repository. Each slideshow appears as a separate eprint record, with each of the individual slide images appearing as a separate subdocument, with its own metadata (title/caption etc). A document search allows individual slides to be selected on a specific topic from across all the slideshows. They can then be viewed or exported, and my previous comments about creating new slideshows apply as before.

Secondly, I've been thinking about how to manage individual slides out of the context of the PowerPoint slideshow wrapper that they were created in. Either a new document format has to be created, or I just use a singleton slideshow object (i.e. a PPTX file with just one slide in it). I think that the latter will be easier to handle, because the problem of how to discriminate between an n-slide slideshow and a 1-slide slideshow is easier to solve than the problem of how to manage a whole new document format!

Thirdly, a colleague of mine (Dave Challis, the webmaster here at Southampton) is creating some software for manipulating OpenOffice XML files so that a repository (such as EPrints) can use PowerPoint packages much more easily. The aim is to have Perl and Java modules that will enable collections and sets of repository items to be easily rewritten as slideshows; and if those items are individual slides in the first place (see above) then the ability to conjure slides between slideshows is guaranteed.

This is all a bit of a step back from the truly desktop repository, but EPrints does give me a framework to deal with structured data and metadata. The desktop itself is great at dealing with files, but delegates all of the complexity of those files to applications. The file system has facilities for storing metadata (see the BSD xattr command), but very few commandline tools for managing and manipulating it. So I'll use EPrints to give me some experience with handling large collections of personal data, and then see how far I can push those capabilities back to teh desktop.

Must dash, I have some marking to do.

Thursday 11 June 2009

Special Issue of the New Review on Information Networking

The New Review on Information Networking seeks original manuscripts for a special issue on Repository Architectures, Infrastructures and Services to appear in Autumn 2009.

The aim of this issue is to further our understanding on how repositories are delivering services and capability to the scholarly and scientific community by marshalling resources at the institutional scale and delivering at the global scale.

Considerable progress in this area has been achieved under the "Open Access" banner and this special issue aims to explore the technical aspects of facilitating the scientific and scholarly commons: open access to research literature, research data, scholarly materials and teaching resources.

Topics for this special issue include (but are not limited to):

  • Repository architecture, infrastructure and services
  • Repositories supporting scholarly communications
  • Repositories supporting e-research and e-researchers
  • Integrating with publishing and publishing platforms
  • Repositories and research information systems
  • Integrating with other infrastructure platforms e.g., cloud, Web2
  • Integrating with other data sources, linked data and the Semantic Web
  • Scaling repositories for extreme requirements
  • Computational services and interfaces across distributed repositories
  • Content & metadata standards
  • OAI services
  • Web services, Web 2.0 services, mashups
  • Social networking, annotation / tagging, personalization
  • Searching and information discovery
  • Reference, reuse, reanalysis, re-interpretation, and repurposing of content
  • Persistent and unambiguous citation and referencing for entities: individuals, institutions, data, learning objects
  • Repository metrics and bibliometrics: usage and impact of scholarly and scientific knowledge

Scope of the New Review on Information Networking

A huge number of reports has been published in recent years on the changing nature of users; on the changing nature of information; on the relevance of current organisational structures to generations apparently weaned on social networks. Reading this mass of literature, far less digesting it, then assimilating it into future strategy is a Sisyphean task, but one ideally suited to this journal. Individual services from Second Life to Twitter will no doubt wax and wane but we shall seek to publish those papers which address the fundamental underlying principles of the increasingly complex information landscape which organisations inhabit.

Important dates:

Submission of full paper: 31st July 2009

Notification deadline: 1st September 2009

Re-submission of revised papers: 15th September 2009

Publication: Autumn 2009

Submissions and Enquiries

Papers submitted to this special issue must not have been previously published or be currently submitted for journal publication elsewhere.

Submissions should ideally be in the range of 3,500 - 4,000 words.

Submissions and enquiries should be made by email to the editor of this special issue: Leslie Carr, University of Southampton, UK (

Tuesday 9 June 2009

A Desktop Repository

You can tell that it's exam marking season, because I am obsessed by displacement activities. Further to my last post, I've managed to create a kind of pseudo-repository on my desktop (DeskSpace? EDesk? Deskora?)

iPhoto is managing collections of PowerPoint slides (actually 2549 slides from 109 slideshows which represents about 10% of the total number of slideshows on my laptop). Every slide is of course just an image of its original self (iPhoto is a photo application after all!) but courtesy of each image's embedded EXIF metadata I can search for slides that contained a particular phrase, regardless of the presentation in which they were originally stored. Then I can export that collection of individual images to an external program that uses the provenance metadata stored in the images to construct a new slideshow from the source components of the original PowerPoint files.

At the moment it's the kind of repository that Heath Robinson would sell you (a set of scripts more than a set of services :-), but I think that it ticks most of the boxes: there is an ingest procedure, collection management, browsing, searching, metadata, packaging formats and dissemination processes. And to accomplish some form of preservation I could even print all the slides into a very desirable coffee-table book or burn a DVD slideshow.

(The top image is a screendump from iPhoto showing slides from four presentations, the bottom image shows a new PowerPoint presentation made from slides containing the term "Open Access". The slides were identified in iPhoto but created from PowerPoint source files.)

This brings up some nice repository challenges
  • managing packages and components simultaneously, even when the components can't have an independent existence. Slides can't exist outside a presentation in the same way that paragraphs can't exist outside a document or cells outside a spreadsheet.
  • visualising huge amounts of data. Being able to scroll through dozens of presentations at once is incredibly liberating, compared to opening them individually and watching PowerPoint draw the slide sorter previews v..e..r..y.....s..l..o..w...l....y at a choice of three sizes.
  • PowerPoint, like RSS, is a rather nice packaging format that could be used much more often by repositories. How about saving your search results as a powerpoint presentation?

Tuesday 2 June 2009

Managing PowerPoint? Repositories and the Office Desktop

It turns out that I have 1009 powerpoint files on my laptop and I don't know what most of them contain, let alone know what I can reuse for any future presentations that I am planning.

I'd at least like an overview of all the slides in all those presentations, so that I can organise them. Then I'd like to compare all these slideshows, delete the duplicates, note the variations and evolutionary history between different versions of the same presentation, and between different presentations on the same subject. I'd like to trace the cross-pollination of slides between different subjects. Microsoft SharePoint has the concept of a Slide Library ("a secure, online repository in which PowerPoint presentations can be stored, worked on and shared") but expects you to do all the organisational work, whereas I want something that will help to apply some organisation.

Should I do this on my laptop? Or should I try and do this on (shudder) an environment that sells itself as providing content curation and management services? Oh all right then, I'll do it in a repository. But I don't think it's going to be easy - for a start we're talking about efficient user tools for ingesting, comparing, contrasting and refining 1,000 items.

Still, there's a basis to build from: SWORD and Microsoft Office Repository tools should help me to at least get all these items into the repository. Once we're there we can take stock of any low-hanging fruit (searching, reporting, cataloguing, thumbnail previews, exporting collections). I've already done some of the preparatory work on the laptop - using AppleScript to create preview images and textual contents of every slide of every presentation. Now I can package up all these things appropriately and see whether a repository actually gives me any added value.