Thursday, 17 December 2009

Starting out with Hypertext

Hypertext taught me that learning is really fulfilling, it's what helped me to understand algebra and generally get gave me the confidence to succeed at school. And not a computer in sight!

I was about seven (it was 1971) at the time that I discovered "Tutortext: Basic Mathematics" in the village library. It was a volume from an American series of popular education materials aimed at those who "wish to learn ... and yet have to teach themselves". The educational method was devised by Norman Crowder, from the Educational Science division of US Industries Incorporated - an outfit that in retrospect sounds like a front for James Bond!

And well it might have been, because its unusual style seemed just as futuristic as one of Q's gadgets. Each chapter started with a page of explanation and a question with multiple choice answers leading to other pages; some of them explained where you had gone wrong, and one of them congratulated you on your progress and took you on to the next step. It all seems rather pedestrian now with our history of computer-assisted learning and personalised and adaptive hypermedia, but to me at the time it was just magic. It forced me to engage with the problems and consider my solutions and to seek the praise that the text meted out!

I have just managed to track down a copy of the book (from an Amazon reseller!) and it has provided me with a tremendous dose of nostalgia. Still, I might just give this book (copyright dating from the year before I was born) to my youngest daughter to see if it helps her.

Monday, 30 November 2009

HyperCard is Dead. Long Live HyperCard!

I cut my professional teeth on HyperCard, writing VideoDisk XCMDs to allow a HyperCard stack to control a video presentation, back in 1986. Although I was a UNIX system programmer (cut me and I bleed regexp), it was Apple's HyperCard which best let me manipulate data for users.

And now it's back in the form of TileStack, a kind of re-imagination of HyperCard for a Web 2.0 environment. There have been other contenders (e.g Runtime Revolution) but they didn't have proper integration with the Web. Now I can write stacks (in a HyperTalk-like language) that use AJAX Web Services - XML, JSON the lot. I'm as happy as Larry!

The following embedded stack uses an idle handler to periodically make a Flickr API call and then set the icon of button n to media of item 1 of the items of JSONdata. I'd forgotten how simple this stuff was - come back Bill Atkinson, the Web needs you!

Monday, 23 November 2009

Evaluating Expertise Promotion

I thought I'd look further at the issue of effective communication of research impact and expertise. The University of Southampton Communications team made a press release on the subject of "Brain-Computer Interfacing" earlier this term. It's obviously because they believe that we as an institution are good at it and that we have something to promote.

I thought I'd take a look at how effective our communication on the subject is, and as you can probably guess, this equates to how high up the Google ranking do we feature compared to other universities? This is a pretty good measure of the effectiveness of our research expertise promotion because anyone who wants to find an expert on a topic is going to start by looking on Google. I knocked together some scripts to look at how our institution fares in the competition for Google eyeballs (basic web analytics).

The screendump on the left shows the results of a Google query for "Brain-Computing Interfacing". All the results from universities are coloured in red, all those from publishers in green, those from news sources, magazines and blogs in blue and unclassified resources are grey. You can quickly see that there are a couple of university-contributed results right at the top, and then an increasing number further down. Those results with a silver background are from Southampton (yay!)

Clearly, Google looks for resources which are about "Brain-Computer Interfacing" and then ranks them according to their "impact" or "importance". Exactly how it does that (PageRank or Black Magic) isn't really my concern here; however it happens, Google controls the order in which these results are presented, and the effect is that if you appear near the top you are more likely to get visited. The script that I use to generate these annotated pages actually gets 500 results, but most people get 10 results per page and don't bother to ask for more than a single page.

To better compare institutions' effectiveness at promoting their research expertise, I distilled this page to a spreadsheet. Once it's in a data form, then the sky's the limit and I can visualise it in different ways (such as this map of global expertise in the area).

The spreadsheet (reproduced below as a table of institutions) shows me that Southampton comes out rather well in the area - we are the third institution named after Oxford and Gronigen, and that slightly further down the list comes a couple of papers from our Institutional Repository. This seems to be a good result - we can claim to be doing alright on this topic. But what about all the other areas in which we think we have some expertise? There are hundreds and thousands of keywords that we need to analyse to see the effectiveness of our communications overall. I think its time to scale up my scripts to get a bigger picture!

InstitutionCountryServerTitle Computer Interfacing Project Thoughts - A Brain-Computer Interfacing Project at RuG Southampton BCI Research Programme | Brain Computer Interfacing and Assistive Technologies Team
LausanneSwitzerlandinfoscience.epfl.chAnticipation Based Brain-Computer Interfacing (aBCI) - Infoscience
U TwenteNetherlandseprints.eemcs.utwente.nlEEMCS EPrints Service - 11091 Brain-Computer Interfacing for . Computer Music Interfacing Demo Computer Interfacing - Systems & Control Engineering ...
Carnegie MellonUSwww-2.cs.cmu.eduClassifying Single Trial EEG: Towards Brain Computer Interfacing Soton - Brain-computer interfacing in rehabilitation tasks for driving a brain computer interfacing system: a ... Computer Interfacing
Rhode IslandUSwww.ele.uri.eduBrain - Computer Interfacing Mason P. Wilson IV URI department of ... Interfacing in Tetraplegic Patients with High ...
ColoradoUSwww.cs.colostate.eduTemporal and Spatial Complexity measures for EEG-based Brain ...
UC San"Gerwin Schalk, Ph.D."
WashingtonUSwww.cs.washington.eduDynamic Bayesian Networks for Brain-Computer Interfaces Nazarpour Home Page Note on Brain Actuated Spelling with the Berlin Brain-Computer ...
Uni SaarlandGermanypsydok.sulb.uni-saarland.dePsyDok - Brain Computer Interfaces for Communication in Paralysis ... Selection and Classification in Brain Computer Interfaces ...
U FreiburgGermanywww.bmi.uni-freiburg.deBMII: Ferran Galan
Kansas CityUSwww.csee.umkc.eduCIBIT Laboratory PowerPoint - LeslieSmith Southampton Brain-Computer Interfacing Research Programme - Aims
BrownUSwww.cs.brown.eduMichael J. Black: Neural Prosthesis Research Projects
U fur Medizinische Psychologie und Verhaltensneurobiologie ... Speakers
North FloridaUSwww.unf.eduUNF Webpage"Presence: Research Encompassing Sensory Enhancement, Neuroscience ..."
WashingtonUSwww.cs.washington.eduPradeep Shenoy University > School of Engineering > Research Groups > PhD ...
Carnegie MellonUSwww.cs.cmu.eduAutomated EEG feature selection for brain computer interfaces ... the Group - ISEL - Intelligent Systems Engineering Laboratory
TuftsUSwww.cs.tufts.eduCOMP 250-BCI Syllabus Publications of a Robot as Embodied Interface for Brain Computer ... Leslie S. Smith: Research Home Page
New JerseyUSembc2006.njit.eduBlind Source Separation in single-channel EEG analysis: An ... Papers
BielefeldGermanyni.www.techfak.uni-bielefeld.dePublications | Neuroinformatics Group
WashingtonUSese.wustl.eduEIT - Sample Web Page Template
ColoradoUSwww.cs.colostate.eduEEG Pattern Analysis Group @ Columbia University Curran - Kent Law School - University of Kent
Georgia StateUSwww.cis.gsu.eduEEG-based communication: a pattern recognition approach ...
TU GrazAustriabci.tugraz.atPublications - Laboratory of Brain-Computer Interfaces
NortheasternUSnuweb1.neu.eduNews University > School of Engineering > Contacts and People ... Music Research Articles Year Undergraduate Projects 2008-2009
UC San"IN RECENT years, brain-computer interface (BCI) systems" 1 - Faculty of Social Sciences
Carnegie MellonUSwww.cs.cmu.eduLinear and nonlinear methods for brain-computer interfaces ...
New JerseyUSembc2006.njit.eduOn-line Differentiation Of Neuroelectric Activities: Algorithms ...
BielefeldGermanybieson.ub.uni-bielefeld.deBieSOn - P300-based brain-computer interfacing
TU GrazAustriahci.tugraz.atPublication list of Alois Schloegl
Uni SaarlandGermanypsydok.sulb.uni-saarland.deBrain Computer Interfaces for Communication in Paralysis - Eingang ... Publications
BrownUSwww.cs.brown.eduMichael J. Black: Neural Prosthesis Research Projects Subspace Analysis
ColoradoUSwww.math.colostate.eduCurriculum Vitae: Michael Kirby (Professor) Co-Director Pattern ...'s Homepage : Resume | VGandhi
TU"Seminar Computational Intelligence E, SS 2007"
North CarolinaUScatalog.lib.ncsu.eduNCSU Libraries - Toward brain-computer interfacing / edited by ...

PS It did occur to me after I had published an earlier draft of this post that I should also have checked the results for "Brain-Computer Interface". It turns out we come further down the league-table for this variation on the phrase (7th institutional position rather than 3rd) but that this phrase is much less-commonly used. As long as potential funders, students and media researchers know which phrase to use we should be alright. Otherwise, we will have to become a bit more canny about our use of synonyms. (I'm not sure whether it's significant that an EBay sponsored link appears only on "Brain-Computer Interface"!)

Monday, 16 November 2009

Life is a Conference (Oh Chum)

Since EPrints has now celebrated its 10th birthday** I have been chewing over where this decade of repository activity is leading us. Bigger repositories? More repositories? Faster repositories? Better repositories? Well, yes to all the above, but collecting, curating and sharing data/documents seems to be only part of the picture.

At the same time, I have become a director of the Web Science Doctoral Training Centre at the University of Southampton. Its five year mission (no, really, it's there in the EPSRC grant letter) is to build up a cohort of interdisciplinary scientists who can understand the impact of the Web on our society - its economic activity, political exchange, social interactions, scientific knowledge transfer - and predict the future benefits and downsides of different kinds of Web technology.

For the last few days I have been trying to pull some of these pieces together: the Web, the Social Web, the Data Web, repositories, open access and open science. In recent years, the community has built a Web infrastructure for e-research that handles research outputs, research data, research process and workflows. This infrastructure has many desirable properties - it is dynamic and persistent and supports managed curation and auditable provenance.

One thing I believe is missing from the picture at the moment is research people, research careers and research meetings. Researchers engaged in human-oriented research activities, rather than research artefact and research experiments. Research after all isn't just about individual scientists turning dials on a piece of laboratory equipment, but about many individuals debating and evaluating their ideas in scientific discourse and scientific debate. Part of that discourse and debate happens through journal publications, but much of it happens in conferences and workshops, through face to face interactions. The proceedings of these meetings become part of the literature, and so part of the personal, dynamic, face-to-face engagement is captured for posterity, but the questions and answers, the ad hoc discussions, birds of a feather sessions and arguments over dinner - all the normal human interactions that generate inspiration as well as larger scale knowledge transfer - have not been captured.

Except that they are starting to be exposed beyond the boundaries of the conference meetings by microblogging services. The low barrier to communication afforded by a Twitter client on a smart phone means that ideas, controversies and emerging consensuses are broadcast beyond the immediately present delegates in a meeting. These communications are not edited, published and catalogued for posterity (and are only searchable for a short time), but they do (potentially) increase the efficiency of the meeting.

A decade ago, the only way to facilitate social networking in the research world was by face to face meetings; flying hundreds or thousands of people half-way across the world for a week in order to be able to talk to each other (or perhaps even to listen to each other). This is still the way by which much research business is conducted, despite being more aware of the environmental consequences of our conferences. There must be a better way.

Twitter, blogs, web, phone, email, papers, workshops, meetings, projects, texts (SMSes) are all ways of mediating engagement between knowledge generating people. In point of fact, conferences are not very efficient engagement mechanisms - most sessions are full of people doing email. Virtual conferences (whether held in Second Life or a rather more prosaic video/audio conferencing environment) also have shortcomings in fostering participation and engagement.

We need to redesign our social interactions to make them more pro-human, pro-diary, pro-budget and pro-environment. We need to use technology not to ape our large-scale face-to-face meetings (using enormous video walls of dozens and hundreds of virtual delegates), but to support us as we try to achieve scientific debate and argument with loose synchronisation across a dynamic community of individuals spread over a number of years.

It's ironic that a university is supposed to be a community of researchers, but none of us know what our neighbours do until we accidentally meet them at a conference on the other side of the world. This is no longer acceptable as the importance of interdisciplinary research increases! Let's instead use technology to improve the social transfer of our knowledge capital with our international research community and our institutional research community too!

I believe that's where our infrastructure needs to grow - supporting our research engagement as well as managing our research artefacts.

** Technically, it is 10 years since Stevan proposed EPrints at the first OAI meeting in Santa Fe at the end of October 1999. We will have a more tangible anniversary in June 2010, celebrating 10 years since the first release of the software at the second OAI meeting.

Friday, 13 November 2009

"Getting" Twitter

Recently I sat in on a demonstration of Twitter to a University research group that included our PVC for research. Because of his presence I was quite self-conscious about justifying the Web tools I normally take for granted, and although the demo itself was fine, it didn't seem to answer the question "is this really useful or just some gratuitous teenager technology?" I have always claimed that twitter is a fantastic tool for keeping up-to-date with the spread of ideas and debate in the community - lots of micro-comments keep me in the loop about which speakers have raised what issues at which conferences, even when I can't travel and engage directly. However, I have been worried recently that the twitter output that I see has been less technical/academic/professional and more personal/informal/gossipy. So I thought I would do a quick investigation to see if there is any evidence to support my positive experience of twitter. I chose to look at Twitter activity surrounding CETIS 2009 as several of my Twitter contacts had mentioned it in the run-up to the conference.

The CETIS conference ( is run by the JISC Centre For Educational Technology and Interoperability Standards, and attracts many people from the E-learning community. It took place at Aston on 10th and 11th November 2009, attracting 146 delegates according to the open list on the conference website.

Over the period that the conference had been mentioned (from the afternoon of Nov 5th up till midnight on Nov 12th) 566 tweets were sent by 89 separate contributors. The large majority of these (440, 78%) were sent during the conference sessions, with 255 (45%) on the first day and 185 (33%) on the second day. Outside the conference hours, 32 (5%) were sent in the break between the two days of the conference, 57 (10%) were sent before the start of the conference and 37 (7%) were sent after the end of the last session.

How many of these tweets are merely "backchat" or "electronic gossip" and how many of them are broadcasting helpful information? I used the twitter API to download all the tweets and individually categorised them as "informational" or not. An informational tweet contains some information about the conference that is useful to an external viewer (a non-delegate such as myself). It may contain a quote from a speaker, a URL to a relevant resource, or a brief (microsummary) of an issue raised. By contrast, an example of a non-informational tweet may be a complaint about the wireless network, a comment about the quality of the food or a message of thanks to the organisers. This categorisation requires some judgement on my behalf, but the criteria are resonably straightforward and repeatable.

The distribution of tweets over time can be seen in the following figure (click to see a bigger version), which also shows how the number of "informational" tweets (red) compares to the total number of tweets (blue). In total, 324 tweets (57%) were in the informational category.

During the conference sessions, the informational tweets account for most of the Twitter activity (307, 70%). In other words, the effort expended in twittering during conference sessions is not wasteful and distracting effort from engaging with the conference agenda. It is mainly valuable to an outside observer - which I would claim extends the impact and influence of the conference beyond the cohort of local delegates. Of course, this works best if the tweets can refer (and link) to a rich set of online resources to direct observers to.

Back to my obsession with showing that Twitter isn't just an electronic stream of gossip - the figure on the left shows how people break down into different Twitter categories: those who only twitter useful information (or did on this occasion), those who never twitter useful information (not useful to me anyway) and those who mainly or partly twitter useful information (those whose information rating were more or less than 50%).

But those who stick strictly to the facts don't provide the biggest chunk of information. This figure on the left shows the contribution of the various groups of twitterers to the total information content of the tweets: most of the useful twitter information is provided by people who mix "information" and "comment".

So perhaps I shouldn't get too worried by the criticism that Twitter is full of people telling us what they have for breakfast and what happened on their trip to work. Perhaps it is precisely those kinds of people who are more likely to let us all know what the key themes are emerging from that high profile conference that we couldn't attend.

Friday, 4 September 2009

Taking Communication Seriously

Today, the BBC News website put a nice research story on its front page: "Quantum computer slips onto chips" . I followed it up because it is intrinsically interesting to me (quantum computing) but I am mentioning it here because it is an example of how the academic community fails to tell its story and how the many contributor's to an institution's web presence can fail to produce a coherent information resource.

The BBC Web page is based on a Brief article in the current Science magazine, written by a person at Bristol University. The BBC page links to the front pages of Science and Bristol University, and it mentions in the full text the name of the author. It doesn't link to the researcher's home page or mention the department or research group where this work happened, or mention any of the other authors.

The Science homepage doesn't mention this article, so an interested person (potential benefactor) has to click on the link to "Current Issue" and then search through the rather long table of contents to look for the author's name.

The Bristol homepage has a rather prominent link to its press release about the work (hoorah for the Bristol Marketing Unit!). It also links to the Science homepage, rather than directly to the article, but it also links to the department where the work was undertaken. Unfortunately, this page doesn't mention the research directly, and strangely spends the majority of its contents talking about the new buildings that it occupies rather than the research that it performs. Hmm.

Returning to the Bristol home page, clicking on the "Contacting people" link allows me to search for the author's surname. This takes me (indirectly, through two further links) to the author's home page which lists his contact details, some currently funded projects (unlinked) and a metalist of "Selected publications" ie query links to 4 external digital libraries. There is a prominent link to his research group page, which totally fails to mention the research which took me there in the first place.

So I haven't found out any more about the research and the work that the research team is undertaking. I am going to have to take the old fashioned route and email or even phone the corresponding author and ask him my questions.

BTW, the author has no publications deposited in the Bristol Repository, but he is a physicist depositing in arxiv, so I shall be alright if my questions are entirely academic and answerable from the literature. What I am exercised about is the enquirers who aren't academics - potential students, funders, benefactors, industrial contacts, journalists - all of those whom we are looking to impact with our work.

BTW (2) I am not looking to bash Bristol about this. Exactly the same is true of Southampton and of my own department. It's difficult to get right, unless everyone involved (especially the academics) are aware of the problem.

Thursday, 2 July 2009

Institutional Visualisation

I've been working on the problem of showing the spread of research on a particular topic across the institution. The aim is to enable the repository to show the contribution of the various schools, groups and individuals in areas of strategic interest, and to allow the repository to play an active part in research management.

There many standard techniques for plotting the magnitude of the contributions of individual authors, the relationships between co-authors (social network) and the patterns of co-operation between departments. Many of these visualisations are in the form of networks of nodes and arcs, produced by sophisticated layout algorithms which are difficult to control and difficult to interpret.

What I need to show my managers is a simple diagram that allows them to see the familiar structure of the university together with the dynamic and changing nature of the contributions in question. The image on the right shows an example of one such diagram that I am trying out. It shows the different layers of the university (the mega faculties in the centre, the 21 schools in the middle layer, and the various research groups as small stamps in the outmost layer). This diagram actually shows the relative research contribution of different schools and research groups to the topic "Renewable Energy", where dark colours mean more relevant outputs in the repository. (For the curious amongst you, "FESM" is the Faculty of Engineering, Science and Maths", so it is hardly surprising that it has the lion's share of contribution to the topic. But the value of the diagram is in its ability to show up activity where we hadn't expected it - in this case in the School of Biological Sciences.)

What surprised me was that I had to create this diagram by myself. There are no models, maps or diagrams of our institutional structure - even the so-called "org chart" is just a table in a Word document. Looking around other universities, I can't see any charts or diagrams that are meant to act as a model of the organisation. I can't believe that they don't exist. Can anyone point me to some?

(Technical background: I created the basic diagram in Excel using an "Exploded Doughnut" chart. I then saved that to PDF, imported the PDF into Illustrator and exported that into SVG, where I added some JavaScript to allow the diagram to shade itself according to the list of schools and research groups passed in as a CGI parameter. A repository export plugin passes the organisational affiliation data from a set of eprints to the SVG diagram.)

Friday, 26 June 2009

Hardworking Repositories: The Global Picture

To round off the picture of hardworking repositories (ie repositories which receive regular daily deposits) here is the global top ten repositories listed with the number of days in the last year in which deposits were made. The data is obtained from the Registry of Open Access Repositories.

ORBi (University of Liege, Belgium)311
IR of the University of Groningen (Netherlands)301
KAR - Kent Academic Repository (UK)286
University of Southampton:
School of Electronics and Computer Science
UBC cIRcle (University of British Columbia, Canada)269
LSE Research Online (London School of Economics, UK)260
EEMCS EPrints Service (School of Electronics
and Computer Science, University of Twente, Netherlands)
LUP: Lund University Publications (Sweden)259
UPSpace at the University of Pretoria (South Africa)257
University of Tilburg (Netherlands)256

There are all sorts of caveats attached to this list! Firstly, I removed two entries because they were not "institutional" but "national" in scope. Secondly, I left in two "departmental" repositories (ECS and EEMCS) because - dammit, if a department can achieve regular deposits then so should a whole institution! Thirdly, this table depends on OAI harvested data from ROAR - if there are any problems with the OAI feed then it will affect the analysis. And perhaps most importantly, this table does not take into account the types of deposit that were made on the days in question. They could be research articles, research data, teaching material, holiday photographs, or bibliographic records sans open access full text. So for example, the UBC repository is mainly composed of student theses and dissertations.

As I have said in the last two postings in this blog, this list simply reflects how much deposit usage the repository is getting on a daily basis and it deliberately factors out the number of deposits in order to smooth over the effect of batch imports from external data sources. The emphasis is on finding a simple metric to highlight embedded usage of a repository across a whole institution.

Wednesday, 24 June 2009

Hardworking Repositories: Comparing UK & US

To go with the list of UK Repositories, here are the top 10 most hardworking US repositories, based on the number of days deposit activity that they achieved in the last year according to ROAR.

RIT Digital Media Library253
Georgia Tech's Institutional Repository: SMARTech252
ScholarSpace at University of Hawaii at Manoa248
NITLE DSpace Service: Middlebury College245
Trinity University239
AgSpace: Home234
Florida State University D-Scholarship Repository231
DigitalCommons@Florida Atlantic University230

Once again, congratulations to those on the list. The methodology for drawing up this list was deliberately devised to promote daily engagement rather than numbers of deposits, in order to try and factor out bulk imports from external data services.

(I am slightly hesitant about publishing this list, because I am less familiar with US repository scene than with that in the UK. That means that I have difficulties in sanity-checking the list - in particular, the Middlebury College/Trinity services seem to be registered with the same host, even though their front ends are delivered from different host names. Do they genuinely count as separate repositories?)

These two lists (US/UK) do show some apparent differences in practice. If the headline numbers (days on which deposits are made) are subdivided into three categories (few deposits 1-9, medium 11-99 and high 100+) then it appears that the UK repositories are dominated by medium deposit days, and the US repositories by few deposit days.

Is this difference significant? Is it an artefact of the workflows and processes of the repository software platforms (the UK table is dominated by EPrints, the US table by DSpace)? Is it due to the different sizes of the host institutions? Or does it show a genuine difference in practice in terms of individual self-archiving vs proxy deposit? There needs to be some more analysis.

Tuesday, 23 June 2009

Hard Working Repositories

There are lots of ways to measure the productivity of a repository, but in Size Isn't Everything: Sustainable Repositories as Evidenced by Sustainable Deposit Profiles I argued for counting the number of days per year that deposits had been made into the repository as a way of capturing its 'vitality' and 'embededness' and so highlighting repositories with broad-based researcher adoption.

Based on that metric, here is a top 10 list of the hardest working institutional repositories in the UK (data taken from ROAR).
If you factor out weekends, Christmas/Easter breaks and other public holidays there are about 233 days that a UK University is open for business. So congratulations particularly to Kent, the LSE, and my colleagues in the library at Southampton whose repositories are working unpaid overtime!

Friday, 19 June 2009

Getting Metadata from the Semantic Desktop

In my last post, I discussed the metadata infrastructure that underpins the Macintosh desktop environment. In addition, thanks to some handholding from Chris Gutteridge, I've just configured the builtin Web server to download the documents themselves or metadata about those documents (in RDF, generated dynamically from the mdls command).

I've now got a pseudo repository on the desktop that contains all the source versions of my PowerPoint and Office documents, together with metadata about them. There are visualisation, search and editing services provided by the desktop and a Web dissemination system cobbled on the side.

I've also got a real repository on a server that contains the source and preprocessed versions of my Powerpoint documents, together with some metadata about them. There are visualisation, search and (planned) editing services provided by the repository's Web dissemination system.

Can these two work efficiently together, so that the conjunction of the desktop and the repository are greater than the sum of the components? Or is this just an exercise in reinventing the wheel just to make a point? I hope the former...

Sunday, 14 June 2009

The Desktop Repository that's Already There

It's really time I acknowledged Peter Sefton who's doing a lot of work on Powerpoint and slide bursting for the Fascinator Desktop, part of his project to bring open HTML formats to the desktop. Peter visited Southampton earlier this year, and inspired me on the topic. I'd just got knocked back on a JISC proposal for looking at repository - desktop integration, so it was great to talk to someone else who wanted to do something in the area. We both seem to be goading each other on at the moment and we've been tweeting and emailing each other, but I've not given him his due credit in this blog so far.

I've been surprised to see how much of the infrastructure for a desktop repository is already in place in the operating system that he and I use (Mac OS X). The Mac already has a process that extracts metadata and data contents from each file into a central database (see mds(8) in the Unix manual pages); this process is alerted to update the database every time a new file is created or an old file is changed. There is an interface for querying the database (Spotlight), either looking just for matches of the contents, or for complex boolean queries based on the metadata and contents. There is also a sophisticated framework for generating and caching previews and thumbnails (QuickLook). A system that provides data and metadata handling in a centralised database with querying and visualisation facilities all sounds very repository-like to me. And in case you think that I'm overegging this pudding, here's a list of some of the common metadata that OS X will allow you to query (not including media-specific metadata):

AudiencesThe intended audience of the file.
AuthorsThe authors of the document.
CityThe document’s city of origin.
CommentComments regarding the document.
ContactKeywordsA list of contacts associated with the document.
ContentCreationDateThe document’s creation date.
ContentModificationDateLast modification date of the document.
ContributorsContributors to this document.
CopyrightThe copyright owner.
CountryThe document’s country of origin.
CoverageThe scope of the document, such as a geographical location or a period of time.
CreatorThe application that created the document.
DescriptionA description of the document.
DueDateDue date for the item represented by the document.
DurationSecondsDuration (in seconds) of the document.
EmailAddressesEmail addresses associated with this document.
EncodingApplicationsThe name of the application (such as “Acrobat Distiller”) that was responsible for converting the document in its current form.
FinderCommentThis contains any Finder comments for the document.
FontsFonts used in the document.
HeadlineA headline-style synopsis of the document.
InstantMessageAddressesIM addresses/screen names associated with the document.
InstructionsSpecial instructions or warnings associated with this document.
KeywordsKeywords associated with the document.
KindDescribes the kind of document, such as “iCal Event.”
LanguagesLanguage of the document.
LastUsedDateThe date and time the document was last opened.
NumberOfPagesPage count of this document.
OrganizationsThe organization that created the document.
PageHeightHeight of the document’s page layout in points.
PageWidthWidth of the document’s page layout in points.
PhoneNumbersPhone numbers associated with the document.
ProjectsNames of projects (other documents such as an iMovie project) that this document is associated with.
PublishersThe publisher of the document
RecipientsThe recipient of the document
RightsA link to the statement of rights (such as a Creative Commons or old-school copyright license) that govern the use of the document.
SecurityMethodEncryption method used on the document.
StarRatingRating of the document (as in the iTunes “star” rating).
StateOrProvinceThe document’s state or province of origin.
TitleThe title.
VersionThe version number.
WhereFromsWhere the document came from, such as a URI or email address.

That's a pretty impressive list, and it is fully typed as well, so dates are dates and numbers are numeric, meaning that you can do proper range searches not just text matches. Still, the Mac implementation has enough limitations to mean that we haven't yet thought of it as a repository
  1. it's a proprietary system. You can't access the thumbnails or export the metadata.
  2. there isn't any way of manually entering or editing the metadata - it's all automatically extracted from the file contents by the ingesters/importers
  3. there isn't any particularly useful way of displaying the metadata, apart from in the Finder's "Get Info" box or on the commandline (using the mdls program).
Issues (1) and (3) just reduce to coding better applications. There are a number of Finder replacements, but none of them really take the metadata seriously. There are also a number of tagging applications that have emerged in the last year or so, but they use a very narrow range of metadata. Someone could add a faceted browser interface to the Finder, or integrate some more explicitly bibliographic metadata into the Apple infrastructure.

Further reading around shows that issue (2) is also surmountable; extra metadata can be attached to a file through the use of the Mac filesystem's extended attributes. As well as the Title and Author information that the Microsoft Office importer produces, extended attributes with names like are inspected when the file is indexed. The value of that attribute is an "OS X Property List value" i.e. a number, boolean, date, string or array stored as binary or XML.

This looks like a very useful platform on which to build the researcher's desktop repository; a few added user-centric applications for browsing and editing metadata, together with some software to synchronise the desktop repository with the institutional repository (something like Time Machine) and we would have a very powerful system indeed.

Now I really do have to get on with that marking!

Friday, 12 June 2009

More on the Desktop Repository

I've done some more experimentation on the Desktop Repository idea - strangely coinciding with another 100 exam scripts appearing on my desk to be marked.

Firstly, I've tried to have a go with moving the PowerPoint image data back to an EPrints repository. Each slideshow appears as a separate eprint record, with each of the individual slide images appearing as a separate subdocument, with its own metadata (title/caption etc). A document search allows individual slides to be selected on a specific topic from across all the slideshows. They can then be viewed or exported, and my previous comments about creating new slideshows apply as before.

Secondly, I've been thinking about how to manage individual slides out of the context of the PowerPoint slideshow wrapper that they were created in. Either a new document format has to be created, or I just use a singleton slideshow object (i.e. a PPTX file with just one slide in it). I think that the latter will be easier to handle, because the problem of how to discriminate between an n-slide slideshow and a 1-slide slideshow is easier to solve than the problem of how to manage a whole new document format!

Thirdly, a colleague of mine (Dave Challis, the webmaster here at Southampton) is creating some software for manipulating OpenOffice XML files so that a repository (such as EPrints) can use PowerPoint packages much more easily. The aim is to have Perl and Java modules that will enable collections and sets of repository items to be easily rewritten as slideshows; and if those items are individual slides in the first place (see above) then the ability to conjure slides between slideshows is guaranteed.

This is all a bit of a step back from the truly desktop repository, but EPrints does give me a framework to deal with structured data and metadata. The desktop itself is great at dealing with files, but delegates all of the complexity of those files to applications. The file system has facilities for storing metadata (see the BSD xattr command), but very few commandline tools for managing and manipulating it. So I'll use EPrints to give me some experience with handling large collections of personal data, and then see how far I can push those capabilities back to teh desktop.

Must dash, I have some marking to do.

Thursday, 11 June 2009

Special Issue of the New Review on Information Networking

The New Review on Information Networking seeks original manuscripts for a special issue on Repository Architectures, Infrastructures and Services to appear in Autumn 2009.

The aim of this issue is to further our understanding on how repositories are delivering services and capability to the scholarly and scientific community by marshalling resources at the institutional scale and delivering at the global scale.

Considerable progress in this area has been achieved under the "Open Access" banner and this special issue aims to explore the technical aspects of facilitating the scientific and scholarly commons: open access to research literature, research data, scholarly materials and teaching resources.

Topics for this special issue include (but are not limited to):

  • Repository architecture, infrastructure and services
  • Repositories supporting scholarly communications
  • Repositories supporting e-research and e-researchers
  • Integrating with publishing and publishing platforms
  • Repositories and research information systems
  • Integrating with other infrastructure platforms e.g., cloud, Web2
  • Integrating with other data sources, linked data and the Semantic Web
  • Scaling repositories for extreme requirements
  • Computational services and interfaces across distributed repositories
  • Content & metadata standards
  • OAI services
  • Web services, Web 2.0 services, mashups
  • Social networking, annotation / tagging, personalization
  • Searching and information discovery
  • Reference, reuse, reanalysis, re-interpretation, and repurposing of content
  • Persistent and unambiguous citation and referencing for entities: individuals, institutions, data, learning objects
  • Repository metrics and bibliometrics: usage and impact of scholarly and scientific knowledge

Scope of the New Review on Information Networking

A huge number of reports has been published in recent years on the changing nature of users; on the changing nature of information; on the relevance of current organisational structures to generations apparently weaned on social networks. Reading this mass of literature, far less digesting it, then assimilating it into future strategy is a Sisyphean task, but one ideally suited to this journal. Individual services from Second Life to Twitter will no doubt wax and wane but we shall seek to publish those papers which address the fundamental underlying principles of the increasingly complex information landscape which organisations inhabit.

Important dates:

Submission of full paper: 31st July 2009

Notification deadline: 1st September 2009

Re-submission of revised papers: 15th September 2009

Publication: Autumn 2009

Submissions and Enquiries

Papers submitted to this special issue must not have been previously published or be currently submitted for journal publication elsewhere.

Submissions should ideally be in the range of 3,500 - 4,000 words.

Submissions and enquiries should be made by email to the editor of this special issue: Leslie Carr, University of Southampton, UK (

Tuesday, 9 June 2009

A Desktop Repository

You can tell that it's exam marking season, because I am obsessed by displacement activities. Further to my last post, I've managed to create a kind of pseudo-repository on my desktop (DeskSpace? EDesk? Deskora?)

iPhoto is managing collections of PowerPoint slides (actually 2549 slides from 109 slideshows which represents about 10% of the total number of slideshows on my laptop). Every slide is of course just an image of its original self (iPhoto is a photo application after all!) but courtesy of each image's embedded EXIF metadata I can search for slides that contained a particular phrase, regardless of the presentation in which they were originally stored. Then I can export that collection of individual images to an external program that uses the provenance metadata stored in the images to construct a new slideshow from the source components of the original PowerPoint files.

At the moment it's the kind of repository that Heath Robinson would sell you (a set of scripts more than a set of services :-), but I think that it ticks most of the boxes: there is an ingest procedure, collection management, browsing, searching, metadata, packaging formats and dissemination processes. And to accomplish some form of preservation I could even print all the slides into a very desirable coffee-table book or burn a DVD slideshow.

(The top image is a screendump from iPhoto showing slides from four presentations, the bottom image shows a new PowerPoint presentation made from slides containing the term "Open Access". The slides were identified in iPhoto but created from PowerPoint source files.)

This brings up some nice repository challenges
  • managing packages and components simultaneously, even when the components can't have an independent existence. Slides can't exist outside a presentation in the same way that paragraphs can't exist outside a document or cells outside a spreadsheet.
  • visualising huge amounts of data. Being able to scroll through dozens of presentations at once is incredibly liberating, compared to opening them individually and watching PowerPoint draw the slide sorter previews v..e..r..y.....s..l..o..w...l....y at a choice of three sizes.
  • PowerPoint, like RSS, is a rather nice packaging format that could be used much more often by repositories. How about saving your search results as a powerpoint presentation?

Tuesday, 2 June 2009

Managing PowerPoint? Repositories and the Office Desktop

It turns out that I have 1009 powerpoint files on my laptop and I don't know what most of them contain, let alone know what I can reuse for any future presentations that I am planning.

I'd at least like an overview of all the slides in all those presentations, so that I can organise them. Then I'd like to compare all these slideshows, delete the duplicates, note the variations and evolutionary history between different versions of the same presentation, and between different presentations on the same subject. I'd like to trace the cross-pollination of slides between different subjects. Microsoft SharePoint has the concept of a Slide Library ("a secure, online repository in which PowerPoint presentations can be stored, worked on and shared") but expects you to do all the organisational work, whereas I want something that will help to apply some organisation.

Should I do this on my laptop? Or should I try and do this on (shudder) an environment that sells itself as providing content curation and management services? Oh all right then, I'll do it in a repository. But I don't think it's going to be easy - for a start we're talking about efficient user tools for ingesting, comparing, contrasting and refining 1,000 items.

Still, there's a basis to build from: SWORD and Microsoft Office Repository tools should help me to at least get all these items into the repository. Once we're there we can take stock of any low-hanging fruit (searching, reporting, cataloguing, thumbnail previews, exporting collections). I've already done some of the preparatory work on the laptop - using AppleScript to create preview images and textual contents of every slide of every presentation. Now I can package up all these things appropriately and see whether a repository actually gives me any added value.

Friday, 29 May 2009

Google Wave

There's an urgent need to develop preservation / e-research / e-learning / rights management strategies for Google Wave.

There. That's my bid for some inevitable digital library memes.

Wednesday, 27 May 2009

Don't ever stop adding to your body of work

I've just returned from the high octane, tech-frenzied social whirl that is Open Repositories 2009 (or #or09 to its delegates). It's a week full of diverse and diverging agendas (cloud this, desktop that, policy the-other) that make your head spin. There are new product announcements (EPrints 3.2 / DSpace 1.5 / Zentity) and new initiatives being explained (DuraSpace). And new demos of new features. It's normal to go to conferences to show off products that you've only just finished, hoping that the demos hang together. Now the Developer Challenge means that we're all there showing off things that we hadn't even started! It's mad, completely mad, and I wouldn't miss it for the world.

So I came back with a kind of tech-hangover - and spent a couple of days feeling the backlash response of "what does it all mean?" and "what is the point?" It's all very exciting, but are we actually going anywhere that we all want to be?

Surprisingly, the cure came in the form of a Presidential address reported in the Washington Post. Under the headline "Don't ever stop adding to your body of work" Barack Obama talked about the need to keep on contributing to a lifetime of achievement. I'm a sucker for a good metaphor, and I read this as a message to institutions and faculty about using a repository to reify their contribution to science and scholarship, to manifest their body of work. 
That is what building a body of work is all about - it's about the daily labor, the many individual acts, the choices large and small that add up to a lasting legacy. It's about not being satisfied with the latest achievement, the latest gold star - because one thing I know about a body of work is that it's never finished. It's cumulative; it deepens and expands with each day that you give your best, and give back, and contribute to the life of this nation. (Barack Obama delivering the commencement address at Arizona State University.)
This is what repositories are really about: making the abstract concrete and fleshing out CVs. Collecting evidence of intellectual creativity, supporting research activities and profiling the emergence of innovative individuals, collaborations and communities. Evidence that spans whole careers and beyond. 

This was also the message of David Schulenberger's closing keynote at the SPARC Digital Repositories meeting in November 2008: the job of the institutional repository is to tell the story of "what we've achieved" to its faculty and its institution's funders and supporters.

Back at home, this is why we keep doing what we doing. Not just so that we can play with new development features, but so we can get a job done. So that we can build the infrastructure of our institutional memory, we can tell our institutional story and we can provide a platform for our future institutional success.

That's me done. I'm back to hacking shell scripts and XML.

Friday, 22 May 2009

A Distilled Guide to EPrints v3.2

Having spent an entire morning talking about new EPrints features at OR09, I thought that it would be great to have a really (really) condensed version of the talk as a public guide to how EPrints is evolving. I spent my last day in Atlanta reducing the presentation to just 9 pages - if you don't include the title and acknowledgements. The result is a brief account of all the features that make EPrints a serious repository platform: effective data model, flexible storage options, choice of APIs, support for the researcher's tasks, and reporting usage, impact and research information.

I'll try and update this as v3.2 develops; please let me know what other information you would like to see!

Thanks for a great Open Repositories experience in Atlanta - see y'all again soon!

Friday, 15 May 2009

PhD studentship in Digital Rights and Digital Scholarship

EPrints Services are funding a PhD studentship in Digital Rights and Digital Scholarship at the EPSRC Web Science Doctoral Training Centre at the University of Southampton.

The Web has had a huge impact on society and on the scientific and scholarly communications process. As more attention is paid to new e-research and e-learning methodologies it is time to stand back and investigate how rights and responsibilities are understood when "copying", "publishing" and "syndicating" are fundamental activities of the interconnected digital world.

Applicants with a technical background (a good Bachelors degree in Computer Science, Information Science, Information Technology or similar) are invited for this 4-year research programme, which begins in October 2009 with a 1-year taught MSc in Web Science and is followed by a three year PhD supervised jointly by the School of Law and the School of Electronics and Computer Science. The full four-year scholarships (including stipend) is available to UK residents.

EPrints Services provide repository hosting, training and bespoke development for the research community and are funding this research opportunity to promote understanding of the context of the future scholarly environment.

Further information:
EPSRC Web Science Doctoral Training:
EPrints Services:
Enquiries should be addressed to Dr Leslie Carr ( in the first instance.

Thursday, 14 May 2009

Repositories and Research information

I've just spent three days in Athens at the euroCRIS meeting, discussing the relationship between repositories and Current Research Information Systems. The idea behind a CRIS (plural CRIS, not CRISes) is that it forms a cross-institutional information layer that aggregates information from the library (publications), human resources (personnel and organisational structure), finance department (projects and grants), estates management (facilities and equipment) and external sources (funding programmes, citation data), and so integrates at some level with the set of services provided by a repository.

The CRIS initiative comes out of an administrative background (starting in 1991) and so predates repositories and exists tangentially to them. A CRIS is typically concerned with repository metadata (how many papers? which publishers? written by whom?) but not its data contents. So my concern was that the repository should not be sidelined or marginalised, but instead the repository should be seen as a mature partner in the aggregate of information services provided across the institution. The experience gained in the UK's recent research assessment exercise (documented in Institutional Repository Checklist for Serving Institutional Management) has very clearly been that the library, through the repository, provides enormous experience in dealing with bibliographic information, ensuring quality and basic auditing capability on claims of authorship and publication. Treating the repository as a superfluous adjunct to an administrative catalogue is to miss the benefit that a managed repository has to offer.

At the meeting many universities from across Europe spoke of how they were trying to make the two systems work together in one form or another. In some ways, the innovation is not technical, but simply in the concept that institutional information should not be siloed, but that it can be shared between administrative domains for the benefit of the whole institution.

On the technical side, CERIF (Common European Research Information Format) is the data sharing and interoperability standard that euroCRIS are promoting. Now on its third major iteration since 1991, it models many of the entities found in the research environment, particularly people, institutions, projects and research publications, patents and products. The standard is expressed in the language of the relational database, with individual tables defined for each kind of entity. Its particular novelty is that that roles like "author" or "project manager" are relationships between independent entities (people, publications or projects) rather than attributes of those entities, and that all relationships are constrained to an explicit time-period.

These requirements are straightforward to satisfy in EPrints - each new entity type (e.g. project) is just an extra dataset with an independent metadata schema and its own workflow and display rules. So an EPrints repository should be able to take on a useful role within a CRIS environment, deployong its comprehensive set of services for ingesting and managing project and personnel data, as well as research publication data. What is not yet clear is whether EPrints should be a helpful adjunct to, a useful component of, or a competent replacement for a CRIS.

That dilemma will be partly solved by the new JISC R4R (Ready for REF) project, whose aim is to investigate the use of CERIF as a mechanism for exchanging research information between universities (e.g. supporting the movement of staff throughout their careers). R4R, which is a joint activity between the Kings College, London and the University of Southampton, is focusing on the transfer of research information in the context of the forthcoming UK Research Excellence Framework (REF) activities.

In the meantime, there is a lot of interest in this area: the report on Serving Institutional Management that I mentioned above was the most-downloaded item of the OR08 conference.

Thursday, 7 May 2009

Batch Updates

I've been taking advantage of the new ISI license to import citation counts into our school repository.

Now we have Web of Science and Google Scholar citation counts listed for matching eprint records, you can search for eprints that fall into a citation range (e.g. 10 or more) and you can order search results by either type of citation count.

Now I'm being asked to provide reports of h-factors and citation averages and community normalised bibliometrics. What larks! I've had to draft in Perl assistance to write the necessary scripts.

But what it's taught me is that we're still missing out on an awfully big proportion of our school's research outputs - and we're an engineering school, not a humanities school. So I'm looking to add a THIRD source of citation data - the ACM Digital Library. The ACM run many of the journals and conferences that our researchers publish in - journals and conferences that ISI don't index. And then there's Scopus - that would potentially be a FOURTH citation data source. It looks like we'll need to have a separate "evidence of impact" dataset in the repository.

Integrating all this extra data has been made very easy by some developments from Chris Gutteridge and Tim Brody. Firstly, the EPrints import framework now supports an update option that allows you to merge new data with existing records. Secondly, the Microsoft Excel exporter (which is so useful for generating complex reports and charts) now has a matching importer. Combine these two features together and you can use all the user interface features of a spreadsheet to do large-scale, batch data amendments outside the repository environment and then commit the updates to the repository. This is great for spotting and fixing metadata errors.

Tuesday, 5 May 2009

Repository as Platform? Or Product?

Is repository software (DSpace, EPrints, Fedora) a platform to build on, or a shrinkwrapped product to unpack and use? There are at least two answers to this question, and each software has to try and strike the right balance for its intended community.

I note the results of the recent DSpace community survey, that shows that 80% of repositories use the default metadata configuration, 78% have made at most "minor cosmetic" changes to the configuration and 62% use no addons beyond the distributed core code (stats, SWORD, google indexing etc).

This seems to support the view that if you come up with a new feature but it isn't a standard part of the core repository then it won't be used. It's a challenge for repository software designers, and for repository projects. For example, how do you make the repository user interface pleasing and useful to artists, engineers, teachers and researchers all at the same time.

Answers on a postcard, please!

Tuesday, 28 April 2009

Radio EPrints

I have just found out that UMAP 2009 is publishing MP3s of all its conference abstracts before the conference starts. I'm going to keep a weather eye on how this goes, because the organisers clearly think that this is a way of getting the message out about their work.

This isn't particularly new: in the 1980s the cool Mac crowd were using HyperCard to record Usenet messages to listen to in the car. But when proper conference organisers (rather than maverick geeks) start to invest the effort to do it for their conference, then it is more serious. After all, these guys have a conference to organise!

Is this the right time to take the step and do similar for our repositories? Record summaries, commentaries or adverts for each of our papers / lectures / reports ? Why not have an "elevator pitch" to advertise important work? There must be all sorts of opportunities for syndicating audio content, and mixing it up into academic playlists.

Wednesday, 22 April 2009

There's an app for that

I bought an iPhone last year to replace my six-year-old cellphone because (a) I needed to use it for work and (b) all the other mobile phones I looked at were so complicated that I had to ask my teenage children how to work them.

But the iPhone isn't being advertised on the basis of its design or on its ease of use. Instead Apple are churning out advert after advert about all the applications that people have written for the device. The phone-as-a-platform for useful little 'apps' seems to be a winning story, and I have seen rabid anti-Apple friends and colleagues being won around to buying one on this basis.

Apple's slogan is "there's an app for that" - whether you're trying to find a local restaurant, go bird watching, get directions, authorise a visa transaction, print shipping labels, rent an apartment or buy a textbook. And they don't even mention the word "phone" in their ads any more.

You can probably see where I'm going with this - the repository as a platform, a locus for useful services, an environment for innovation. I've said it before - we need to cram a lot more interesting and useful services into our repositories - stuff that will pique the interest of researchers and users, not just digital librarians and developers. Apparently not everyone goes faint at the thought of preservation, APIs and service oriented architectures.

So I was delighted that Adam Field came up with a new EPrints export plugin today that helps me to show off the contents of my repository. Called the DocumentGrid it simply displays a linked table of thumbnails for the records in a collection or in the results of a search. It is really simple and it took him about twenty minutes to write, but it is one of those really useful functions that I find myself needing. It answers the question "show me what's in your repository", or "let me see what's in that collection". It's useful, and it's eye candy at the same time.

Let's have some more of these! Half the stuff for the iPhone is eyecandy, or only useful in really specific circumstances, so I don't think we should be afraid of making things that aren't profound and respectable and adaptable. Of course people want to show off their work - and not just in CVs. There should be tons more apps that help researchers to look good and show off their stuff.

This one is available for download at the EPrints distribution repository:

Friday, 17 April 2009

EPrints and its Development

I'm in the process of writing a paper about the first ten years of EPrints (yes, it'll be 10 years old at the end of October 2009), and I've been trying to put together a comprehensive overview of the internal construction of EPrints as it stands in 2009. What you might call an "architecture diagram for users".

Stung into action by John Robertson's recent blog entry on repository developments which mentions only a few of the ideas that we are working on, I thought it might be a good idea to share a draft version of this diagram.

The PDF linked from this posting shows my understanding of the internals of EPrints, highlighting the bits that we are working on at the moment in the version 3.2 development track.

Some more details about the 24 new features planned for the next release of EPrints can be found on the EPrints Wiki. Presentations and demos will be forthcoming at Open Repositories 2009, ECDL, OAI6, Sun PASIG and all good repository workshops in your area :-)

Cloud, Web, Intranet and Desktop Connectivity - repository data can now be stored in the cloud, on the web, on an intranet storage service, on a local disk or on any combination of the above. Also, the contents of the repository can be mounted on the user's desktop as a 'virtual file system'.

Desktop Document Support - thumbnails and embedded metadata extraction is provided for Microsoft Office documents. Media copyright checklists are generated for PowerPoint slideshows to assist Open Access clearance for lecture slides. Complex thumbnails are now supported, such as multi-image thumbnails for a slideshow or an embedded FLV clip of a video.

Research Management - Support for new kinds of administrator-defined data objects with project, organisation and people datasets as standard to provide compatibility with Current Research Information Systems (CRIS). Citation reporting will use ISI's Web of Science as well as Google Scholar.

Preservation Support - Preservation Planning Capabilities embedded in the repository using PRONOM and DROID.

Improved EPrints Data Model - as well as eprints, now files, documents, users and all data objects have persistent URIs and arbitrary relationships between them. RDF export plugin provides linked data capabilities, and a new REST interface provides an API to all EPrints data.

Improved Interoperability and Standards - SWORD 2 (v1.3 Specification), new OAI-ORE Import and Export Plug-ins, RDF plugins improved to provide better support for W3C Linked Data, CERIF support for Current Research Information Systems and enhanced Compatability for DRIVER project systems.

Miscellaneous Improvements - there are more enhancements to repository administration and improvements to the way that abstract pages are generated. IRStats/EPStats are better integrated with EPrints distribution. Autocompletion/Name Authorities have been added for Institutions and Geographical Places (both with geolocation data). Enhanced User Profiles allow for more CV-relevant information than just publication lists. User-defined collections provide "shopping trolley" functionality for ephemeral compilations as well as persistent collections. A Scheduler / Calendar for planning for embargoes, licenses, preservation activities, periodic maintenance activities etc. Quality Assurance Issues can be manually raised and resolved. PDF coverpage capabilities will be provided as standard.

Google and Repositories

Continuing yesterday's comments on the effect of Google PageRanking in resource discovery, there is an added Google effect that compounds the problem of discovering resources in repositories. Google doesn't treat each resource separately, but instead it aggregates all the resources from a single site, showing only the top two resources from that site no matter how many should appear.

For example, if I search for the terms "ontology" and "hypertext" directly in our school repository, 8 articles are returned. If I do the same search in Google, then our repository appears gratifyingly at the top of the list of results, but only TWO of those items are listed together with a discrete link to more items from this site.

So, not only is your article in competition with all other web pages on the planet, it is doubly in competition with other articles in your repository which could deprive it of its rightful place in the rankings.

This means that we need to think about redesign our repository pages to link to "other related work" that the visitor may not have seen represented in Google.

Thursday, 16 April 2009

PageRank and Repositories

I commented before on the big impact of Google on repositories, and the way that overwhelms all other form of access to repository contents. Today I've had a look at the log files for our EPrints server to find all the web requests referred by Google (for any kind of page - abstract or full text or collection list). As a result of a conversation with someone on the topic of search, I wanted to check the "tenacity" of the Google enquirers. Since before the advent of the Web it's been common knowledge in the Hypertext research community that people tend not to scroll and click more than they have to when navigating an information system.

It would be nice to think that repository users (whoever they are) carefully looked through all the relevant and useful results returned by Google; but practical considerations mean that their investigations are more limited. In fact, 78% of our Google referrals came from the FIRST results page of a Google query.

This means that it is really important to make sure that your repository pages get a good PageRank - there are only ten opportunities for your content to appear in front of most Google users. If your paper happens to fall in at position 11 you have a much reduced chance of being found.

Sunday, 29 March 2009

Repository as a Trusted Intermediary

The idea of a trusted intermediary that makes content both durable and usable with a "chinese menu" of added-value services is my new favourite definition of repository. These words come from the DuraSpace project's midterm report, and although they were not penned with repositories per se in mind, I believe that they provide an excellent description of their rationale ie to increase trust in material created in
  • a random place on the Web.
  • my rented niche in the Cloud
  • my departmental filestore
  • my own desktop.
So I am particularly pleased to congratulate the JISC EdSpace team on their recent upgrade to the EdShare learning resource repository at Southampton, because they have helped deliver on the first bullet point - adding trust to web resources.

I have been using EdShare to distribute material from the modules that I teach. Much of this material consists of PowerPoint lecture slides that I have created, but a significant proportion of it is material available on the open Web - perhaps other people's slides, papers or reports from their own web sites.

In the past I have had two choices: either deposit a link to the web page or deposit a copy of the web page. The former is a lightweight solution and obviously the right "Web thing" to do when you just want to provide a URL pointer to someone else's resource. But the latter is the right "repository thing" to do in terms of making a safe and durable copy. Except that I don't automatically have the right to clutter up Google space with ad hoc copies of the same material and reducing their Pagerank. So most of the time I have settled for "just linking", at the price of accepting that some of this material will move or disappear before I teach the topic again. In the words of Humphrey Bogart, I know that I'll regret it -- maybe not today, maybe not tomorrow, but soon, and for the rest of my course.

Now EdShare lets me have my cake and eat it. I can deposit and disseminate a link to the external material (as before) but the repository will make a dark copy and start serving that if the original disappears. Essentially they treat important material that I find on the Web in the same way that they treat important material that I move into the repository. Both get managed, indexed, thumbnailed and subjected to the normal range of repository services.

So I'm delighted that I can now do the right Web thing and the right repository thing at the same time.

Thursday, 26 February 2009

Why I Need Trusted Storage

Yesterday I went to the Apple Store to get my laptop hard disk repaired for the FIFTH time. Each time the data has been unrecoverable - or worse - partially recoverable. Each time I have lost material that was not backed up and each time my different sets of historical backups were recovered but only partially integrated with each other. After all, each fatal disk crash knocks a week out of your working life, what with trying to extract anything important from the smoking remains of the still-spinning disk, taking the machine to the repair shop, waiting for them to repair it, working on another system, getting email sorted, getting your old machine back, restoring the operating system and applications and then trying to copy all your old backups onto the new disk. The effect is that each subsequent crash has confused my backups / restores to the point where I have folders within folders within folders of different sets of partially restored material and I no longer know what has been restored where.

How could I possibly get into such a lunatic state, I hear you ask. Is this man the worst kind of professional incompetent? Everyone knows you have to back your stuff up. Why doesn't he just buy a big disk and use Time Machine? These are good questions. I ask them of myself all the time.
  • Our school systems team disavow responsibility for all laptops. We are literally on our own if we dare to have mobile machines.
  • When I started on this voyage of data loss some three years ago, Time Machine wasn't invented.
  • Disks that you buy for backup are just as likely to go foom as your own personal laptop disk. My main coherent, level 0 backup on a LaCie Terabyte disk just stopped working one day, just when I tried to restore my work.
  • Large disks are forever being used for other urgent purposes. Students need some space for something. A project needs some temporary storage. You need to be able to transfer a large amount of data from one machine to another. It gets difficult to manage the various assortments of undistinguished grey bricks that build up in your office. Which one has the old duplicate backup on it that is no longer necessary?

There are lots of other mitigating circumstances with which I won't bore you, but what I would like to lay down are my beliefs that (a) backup management is a complex task that requires serious attention and preferably support from professionals who can devote some attention to it and (b) it is never urgent enough to displace any of the truly important and terribly overdue academic tasks that you are trying to accomplish TODAY so you don't get sacked.

I've had a lot of time to reflect on this since my laptop started plunging me into regular data hell, and the idea of trusted storage for me isn't just about having files that don't disappear. It's about having an organised, stable, useful, authoritative picture of my professional life - research and teaching - that grows and tells an emerging story as my career develops. That's mainly what has been disrupted - I can pretty much find any specific thing that I want by grep/find or desktop search. But the overall understanding of what I had and what I had been working on has been disrupted and damaged and fragmented.

So an intelligent store should help me understand what I have - a bit like the way that user tools like iPhoto help you understand and organise thousands of images. It should be possible to get a highly distilled overview/representation/summary/visualisation of all my intellectual content/property/achievements as well as a detailed and comprehensive store of all my individual documents and files.

I guess you can see where I'm going with this. I've gone and got the ideal desktop storage and the dream repository all mixed up. Well perhaps I have - but why not?

Anyway, all's well that ends well. My colleagues all clubbed together and got a terabyte Time Capsule for work, that is run by a sympathetic member of the systems team. And Apple just phoned up to offer me a brand new 17" MacBook Pro in exchange for my broken old one.

Still, I'd really like to make my data store intelligible as well as safe!