Thursday, 2 July 2009

Institutional Visualisation

I've been working on the problem of showing the spread of research on a particular topic across the institution. The aim is to enable the repository to show the contribution of the various schools, groups and individuals in areas of strategic interest, and to allow the repository to play an active part in research management.

There many standard techniques for plotting the magnitude of the contributions of individual authors, the relationships between co-authors (social network) and the patterns of co-operation between departments. Many of these visualisations are in the form of networks of nodes and arcs, produced by sophisticated layout algorithms which are difficult to control and difficult to interpret.

What I need to show my managers is a simple diagram that allows them to see the familiar structure of the university together with the dynamic and changing nature of the contributions in question. The image on the right shows an example of one such diagram that I am trying out. It shows the different layers of the university (the mega faculties in the centre, the 21 schools in the middle layer, and the various research groups as small stamps in the outmost layer). This diagram actually shows the relative research contribution of different schools and research groups to the topic "Renewable Energy", where dark colours mean more relevant outputs in the repository. (For the curious amongst you, "FESM" is the Faculty of Engineering, Science and Maths", so it is hardly surprising that it has the lion's share of contribution to the topic. But the value of the diagram is in its ability to show up activity where we hadn't expected it - in this case in the School of Biological Sciences.)

What surprised me was that I had to create this diagram by myself. There are no models, maps or diagrams of our institutional structure - even the so-called "org chart" is just a table in a Word document. Looking around other universities, I can't see any charts or diagrams that are meant to act as a model of the organisation. I can't believe that they don't exist. Can anyone point me to some?

(Technical background: I created the basic diagram in Excel using an "Exploded Doughnut" chart. I then saved that to PDF, imported the PDF into Illustrator and exported that into SVG, where I added some JavaScript to allow the diagram to shade itself according to the list of schools and research groups passed in as a CGI parameter. A repository export plugin passes the organisational affiliation data from a set of eprints to the SVG diagram.)

Friday, 26 June 2009

Hardworking Repositories: The Global Picture

To round off the picture of hardworking repositories (ie repositories which receive regular daily deposits) here is the global top ten repositories listed with the number of days in the last year in which deposits were made. The data is obtained from the Registry of Open Access Repositories.

ORBi (University of Liege, Belgium)311
IR of the University of Groningen (Netherlands)301
KAR - Kent Academic Repository (UK)286
University of Southampton:
School of Electronics and Computer Science
(UK)
271
UBC cIRcle (University of British Columbia, Canada)269
LSE Research Online (London School of Economics, UK)260
EEMCS EPrints Service (School of Electronics
and Computer Science, University of Twente, Netherlands)
260
LUP: Lund University Publications (Sweden)259
UPSpace at the University of Pretoria (South Africa)257
University of Tilburg (Netherlands)256

There are all sorts of caveats attached to this list! Firstly, I removed two entries because they were not "institutional" but "national" in scope. Secondly, I left in two "departmental" repositories (ECS and EEMCS) because - dammit, if a department can achieve regular deposits then so should a whole institution! Thirdly, this table depends on OAI harvested data from ROAR - if there are any problems with the OAI feed then it will affect the analysis. And perhaps most importantly, this table does not take into account the types of deposit that were made on the days in question. They could be research articles, research data, teaching material, holiday photographs, or bibliographic records sans open access full text. So for example, the UBC repository is mainly composed of student theses and dissertations.

As I have said in the last two postings in this blog, this list simply reflects how much deposit usage the repository is getting on a daily basis and it deliberately factors out the number of deposits in order to smooth over the effect of batch imports from external data sources. The emphasis is on finding a simple metric to highlight embedded usage of a repository across a whole institution.

Wednesday, 24 June 2009

Hardworking Repositories: Comparing UK & US

To go with the list of UK Repositories, here are the top 10 most hardworking US repositories, based on the number of days deposit activity that they achieved in the last year according to ROAR.

RIT Digital Media Library253
Georgia Tech's Institutional Repository: SMARTech252
ScholarSpace at University of Hawaii at Manoa248
NITLE DSpace Service: Middlebury College245
Trinity University239
AgSpace: Home234
Florida State University D-Scholarship Repository231
eScholarship@Amherst231
DigitalCommons@Florida Atlantic University230
eCommons@Cornell227

Once again, congratulations to those on the list. The methodology for drawing up this list was deliberately devised to promote daily engagement rather than numbers of deposits, in order to try and factor out bulk imports from external data services.

(I am slightly hesitant about publishing this list, because I am less familiar with US repository scene than with that in the UK. That means that I have difficulties in sanity-checking the list - in particular, the Middlebury College/Trinity services seem to be registered with the same host, even though their front ends are delivered from different host names. Do they genuinely count as separate repositories?)

These two lists (US/UK) do show some apparent differences in practice. If the headline numbers (days on which deposits are made) are subdivided into three categories (few deposits 1-9, medium 11-99 and high 100+) then it appears that the UK repositories are dominated by medium deposit days, and the US repositories by few deposit days.



Is this difference significant? Is it an artefact of the workflows and processes of the repository software platforms (the UK table is dominated by EPrints, the US table by DSpace)? Is it due to the different sizes of the host institutions? Or does it show a genuine difference in practice in terms of individual self-archiving vs proxy deposit? There needs to be some more analysis.

Tuesday, 23 June 2009

Hard Working Repositories

There are lots of ways to measure the productivity of a repository, but in Size Isn't Everything: Sustainable Repositories as Evidenced by Sustainable Deposit Profiles I argued for counting the number of days per year that deposits had been made into the repository as a way of capturing its 'vitality' and 'embededness' and so highlighting repositories with broad-based researcher adoption.

Based on that metric, here is a top 10 list of the hardest working institutional repositories in the UK (data taken from ROAR).
If you factor out weekends, Christmas/Easter breaks and other public holidays there are about 233 days that a UK University is open for business. So congratulations particularly to Kent, the LSE, and my colleagues in the library at Southampton whose repositories are working unpaid overtime!

Friday, 19 June 2009

Getting Metadata from the Semantic Desktop

In my last post, I discussed the metadata infrastructure that underpins the Macintosh desktop environment. In addition, thanks to some handholding from Chris Gutteridge, I've just configured the builtin Web server to download the documents themselves or metadata about those documents (in RDF, generated dynamically from the mdls command).

I've now got a pseudo repository on the desktop that contains all the source versions of my PowerPoint and Office documents, together with metadata about them. There are visualisation, search and editing services provided by the desktop and a Web dissemination system cobbled on the side.

I've also got a real repository on a server that contains the source and preprocessed versions of my Powerpoint documents, together with some metadata about them. There are visualisation, search and (planned) editing services provided by the repository's Web dissemination system.

Can these two work efficiently together, so that the conjunction of the desktop and the repository are greater than the sum of the components? Or is this just an exercise in reinventing the wheel just to make a point? I hope the former...

Sunday, 14 June 2009

The Desktop Repository that's Already There

It's really time I acknowledged Peter Sefton who's doing a lot of work on Powerpoint and slide bursting for the Fascinator Desktop, part of his project to bring open HTML formats to the desktop. Peter visited Southampton earlier this year, and inspired me on the topic. I'd just got knocked back on a JISC proposal for looking at repository - desktop integration, so it was great to talk to someone else who wanted to do something in the area. We both seem to be goading each other on at the moment and we've been tweeting and emailing each other, but I've not given him his due credit in this blog so far.

I've been surprised to see how much of the infrastructure for a desktop repository is already in place in the operating system that he and I use (Mac OS X). The Mac already has a process that extracts metadata and data contents from each file into a central database (see mds(8) in the Unix manual pages); this process is alerted to update the database every time a new file is created or an old file is changed. There is an interface for querying the database (Spotlight), either looking just for matches of the contents, or for complex boolean queries based on the metadata and contents. There is also a sophisticated framework for generating and caching previews and thumbnails (QuickLook). A system that provides data and metadata handling in a centralised database with querying and visualisation facilities all sounds very repository-like to me. And in case you think that I'm overegging this pudding, here's a list of some of the common metadata that OS X will allow you to query (not including media-specific metadata):

AudiencesThe intended audience of the file.
AuthorsThe authors of the document.
CityThe document’s city of origin.
CommentComments regarding the document.
ContactKeywordsA list of contacts associated with the document.
ContentCreationDateThe document’s creation date.
ContentModificationDateLast modification date of the document.
ContributorsContributors to this document.
CopyrightThe copyright owner.
CountryThe document’s country of origin.
CoverageThe scope of the document, such as a geographical location or a period of time.
CreatorThe application that created the document.
DescriptionA description of the document.
DueDateDue date for the item represented by the document.
DurationSecondsDuration (in seconds) of the document.
EmailAddressesEmail addresses associated with this document.
EncodingApplicationsThe name of the application (such as “Acrobat Distiller”) that was responsible for converting the document in its current form.
FinderCommentThis contains any Finder comments for the document.
FontsFonts used in the document.
HeadlineA headline-style synopsis of the document.
InstantMessageAddressesIM addresses/screen names associated with the document.
InstructionsSpecial instructions or warnings associated with this document.
KeywordsKeywords associated with the document.
KindDescribes the kind of document, such as “iCal Event.”
LanguagesLanguage of the document.
LastUsedDateThe date and time the document was last opened.
NumberOfPagesPage count of this document.
OrganizationsThe organization that created the document.
PageHeightHeight of the document’s page layout in points.
PageWidthWidth of the document’s page layout in points.
PhoneNumbersPhone numbers associated with the document.
ProjectsNames of projects (other documents such as an iMovie project) that this document is associated with.
PublishersThe publisher of the document
RecipientsThe recipient of the document
RightsA link to the statement of rights (such as a Creative Commons or old-school copyright license) that govern the use of the document.
SecurityMethodEncryption method used on the document.
StarRatingRating of the document (as in the iTunes “star” rating).
StateOrProvinceThe document’s state or province of origin.
TitleThe title.
VersionThe version number.
WhereFromsWhere the document came from, such as a URI or email address.

That's a pretty impressive list, and it is fully typed as well, so dates are dates and numbers are numeric, meaning that you can do proper range searches not just text matches. Still, the Mac implementation has enough limitations to mean that we haven't yet thought of it as a repository
  1. it's a proprietary system. You can't access the thumbnails or export the metadata.
  2. there isn't any way of manually entering or editing the metadata - it's all automatically extracted from the file contents by the ingesters/importers
  3. there isn't any particularly useful way of displaying the metadata, apart from in the Finder's "Get Info" box or on the commandline (using the mdls program).
Issues (1) and (3) just reduce to coding better applications. There are a number of Finder replacements, but none of them really take the metadata seriously. There are also a number of tagging applications that have emerged in the last year or so, but they use a very narrow range of metadata. Someone could add a faceted browser interface to the Finder, or integrate some more explicitly bibliographic metadata into the Apple infrastructure.

Further reading around shows that issue (2) is also surmountable; extra metadata can be attached to a file through the use of the Mac filesystem's extended attributes. As well as the Title and Author information that the Microsoft Office importer produces, extended attributes with names like com.apple.metadata:kMDItemPhoneNumbers are inspected when the file is indexed. The value of that attribute is an "OS X Property List value" i.e. a number, boolean, date, string or array stored as binary or XML.

This looks like a very useful platform on which to build the researcher's desktop repository; a few added user-centric applications for browsing and editing metadata, together with some software to synchronise the desktop repository with the institutional repository (something like Time Machine) and we would have a very powerful system indeed.

Now I really do have to get on with that marking!

Friday, 12 June 2009

More on the Desktop Repository

I've done some more experimentation on the Desktop Repository idea - strangely coinciding with another 100 exam scripts appearing on my desk to be marked.

Firstly, I've tried to have a go with moving the PowerPoint image data back to an EPrints repository. Each slideshow appears as a separate eprint record, with each of the individual slide images appearing as a separate subdocument, with its own metadata (title/caption etc). A document search allows individual slides to be selected on a specific topic from across all the slideshows. They can then be viewed or exported, and my previous comments about creating new slideshows apply as before.

Secondly, I've been thinking about how to manage individual slides out of the context of the PowerPoint slideshow wrapper that they were created in. Either a new document format has to be created, or I just use a singleton slideshow object (i.e. a PPTX file with just one slide in it). I think that the latter will be easier to handle, because the problem of how to discriminate between an n-slide slideshow and a 1-slide slideshow is easier to solve than the problem of how to manage a whole new document format!

Thirdly, a colleague of mine (Dave Challis, the webmaster here at Southampton) is creating some software for manipulating OpenOffice XML files so that a repository (such as EPrints) can use PowerPoint packages much more easily. The aim is to have Perl and Java modules that will enable collections and sets of repository items to be easily rewritten as slideshows; and if those items are individual slides in the first place (see above) then the ability to conjure slides between slideshows is guaranteed.

This is all a bit of a step back from the truly desktop repository, but EPrints does give me a framework to deal with structured data and metadata. The desktop itself is great at dealing with files, but delegates all of the complexity of those files to applications. The file system has facilities for storing metadata (see the BSD xattr command), but very few commandline tools for managing and manipulating it. So I'll use EPrints to give me some experience with handling large collections of personal data, and then see how far I can push those capabilities back to teh desktop.

Must dash, I have some marking to do.