Sunday 30 November 2008

More Repository Value for users - making PowerPoint files from RSS feeds

Executive Overview: I tried and failed to find some existing user desktop applications that could work with repository collections; in the end I had to write some simple programs to demonstrate the kind of thing that can be achieved with a repository, some content and Microsoft Office. The screendump on the left is an example. Scroll down for some more examples from EPrints and DSpace.

Since my previous posting about Repository Value, I have been very aware that all my examples were focused on the web, and that the services and widgets that I described were for web page authors to use on web pages. In fact, many of them might be better used by web admins and web developers than rather than academic users and researchers themselves.

Although we live in a Web-dominated world, most of my colleagues do not use the Web to originate material. Work is found on the web, but created on the desktop (or laptop), and consequently the tools and services that people are mainly familiar with are those of the desktop, rather than webtop. Cloud computing may be the way of the future, but currently services like Google Docs gain a lot of attention but few users according to a recent report by ClickStream Technologies.

So I particularly wanted to find some good old-fashioned desktop applications that could do useful work with material in repositories. Something that could take a collection of material from a repository and synthesise a useful overview document that describes it, or disply it in a slideshow. And most importantly, it had to be something that could be used by a highly educated but non-technical researcher. Or even an extremely clever but technically incompetent professor. Not a software developer!

I looked at different ways of using Microsoft Office, but it doesn't schlep material from the Web very well. Word can read in individual web pages, but collections are represented in a group of pages which are difficult to handle without programming. I had some previous experience with forcing Adobe Acrobat to create a slideshow from a repository, but that was too fiddly. I had some success playing around with the Firefox "Scrapbook" extension to grab all the linked pages from a collection as a single (humongous) web page, but I couldn't find a way of doing anything productive with that page.

The fundamental problem is that repository web pages (as displayed by DSpace, EPrints, Fedora, Digital Commons or anyone else) are too complex to interpret without creating a bespoke application that is tailored to datascraping each kind of repository (and each version of each kind of repository.) So to have any chance of broad-spectrum success, I needed to find a simpler format to work with that is supported by all repositories.

Luckily, RSS is just such a format. Repositories use it to describe lists of recent deposits, or lists of items in their collections. Previously I was investigating it for Web-based services, but now I want to use it for a desktop based-service. Which, to be honest, pretty much means Microsoft Office. Other productivity suites exist, but if you want to target as many repository users as possible then Microsoft Office is a pragmatic first step.

So, does Office have any built-in facility for reading RSS files? Unfortunately it doesn't. I seem to be out of luck with finding an off-the-shelf application that everyone has access to. I'm just going to have to make something myself.

But despite not having an RSS capability, Office has something almost as good - an open data format. I might not be able to get PowerPoint to make a slideshow directly from an RSS feed, but I can knock up an application to make the slideshow for me. And I don't even have to be a Microsoft developer to do it - because it's an open format I can just use whatever tools I am most familiar with. 

A Microsoft Office 2007 file (or Office 2008 if you are on a Mac) is actually a Zip file containing a collection of XML files. Like a "web page" some of those files are content files (each slide in a PowerPoint slideshow is a separate XML file) and some are stylesheets (each slide layout is a separate XML file) and some are media files (images and videos). I wrote two shell scripts to create the appropriate XML and media files to (a) create a new slideshow with just a title slide and (b) add a new slide onto the end of a slideshow. I then wrote some XSLT to turn an RSS feed into a shell script that successively calls those two commands. Each slide in the slideshow has a title, description and image taken from the corresponding item in the RSS feed. The slideshows are created with no styling, but opening them in PowerPoint allows a theme to be chosen for them with a single click.

The following examples show some of the slideshows I have made from repository material.


These are my latest papers and posters from the ECS EPrints repository at Southampton University. Note that this example (like the ones below, but unlike the example at the head of this post) has had a "Theme" applied by PowerPoint.

These are recent items deposited in the JISC KULTUR project repository.

This is from a photo collection at Demetrius, the DSpace repository at the Australian National University.

These items aren't from repository at all - they are from the ECS Press Release feed. After all, there's more to the web than repositories, and there's also more to lecturers' sources!


This is some partial success - I have had to write some programs rather than use readily-available applications. On the other hand, these rough-and-ready demos could be turned into a web service quite easily - go to a web form, paste in the RSS feed URL and download a PPTX file. (Of course, I'd have to debug them first. And re-implement them in a rather more robust environment.)

But there's a lot more that could be done. For an example of the sort of well-thought-out service that operates in this space, have a look at the New England Journal of Medicine's Image Library service. You provide some query terms, and it returns all the figures from recent NEJM articles that match your query. You then choose the relevant ones, press a button and it turns them into a PowerPoint slideshow for you, with all the necessary citation information (and licenses) attached. Imagine if OAIster could provide that for you, based on all the articles on any subject in all the repositories in the world...

Here are a few other ideas off the top of my head.
  • Most powerpoint presentations that are created are variants on a previous presentation. Gradually a presenter's message evolves over time, and their presentations reflect that. A repository not only allows the author to keep track of his/her previous presentations, it could track all the variants of each slide and allow the author to recombine them in new ways. It's very common for me to search through a dozen PowerPoint files on my laptop, to find a particular version of a particular slide that explained a point with just the right emphasis to a particular audience. Let the repository find all the individual slides that mention a particular phrase and allow me to choose le slide juste.

  • Or let the repository ingest process extract all the images in a presentation and determine the source and verify the rights clearance.

  • Or the next time I have to do a presentation to potential students, or potential funders, let the repository copy my basic set of core slides, and update it with new slides from the latest research repository feeds and the latest press release feeds.
Watch this space!

Thursday 20 November 2008

Addendum to Repository Rats and other wildlife

Another thing that characterises me is that I'm located in a single department and deal with only a pair of focal disciplines - Electronic Engineering and Computer Science. Most repository managers cover an entire institution, or in the case of a subject repository, the whole world. By contrast I'm very parochial and limited in scope. (I'm thinking of herds of majestic wildebeest sweeping across the Serengeti vs territorial animals like bears and robins.)

GlobalGeneral
ECS

SotonxX
arXivX
WisconsinxX

Repository Rats and Other Wildlife

Dorothea (cavlec) is well-known for coining the term "repository rat", and documenting the frustrations that go with that role. Kudos to her for identifying and speaking out about the experiences of the role, and for further characterising the activities and limitations of a repository manager in her blog piece Meet Ulysses Acqua.

I don't think that Dorothea's intention was to impose a one-size-fits-all characterization of all repository-workers - rather to see a fair representation of her own experience when others seemed to be ignoring it. I don't feel as if I am a repository rat exactly because my activities seem to have few rat-like characteristics, and many of my repository colleagues don't seem to be rats either. But my colleagues aren't exactly like me either. I seem to occupy a particular niche in the repository ecology, and they too are adapted to their own environments. So on the plane back from SPARC DR2008 I started to wonder about the different characteristics of that ecology, and the kinds of animals - rat and non-rat - that had evolved within it.

Solitary or socialSome animals hunt in packs, co-ordinating their activities (lions, raptors), where others operate by themselves (rats, squirrels).Do you work with a teams of subject librarians, or are you left to work by yourself?
Wild or domesticatedSome animals fit into and contribute to human organisations (dogs, horses), whereas others operate on the edge (cats) or totally outside the structure (foxes, wolves).Are you involved in institutional committees, consulted by management and is your repository a core service?
Hunter or scavengerSome animals actively search out sources of food for the kill, whereas others eat whatever is left around, or whatever is offered to them.Do you search out material that is suitable for deposit by looking through Web of Science or by interviewing faculty?
Preener or sloven?Some animals preen themselves (birds) or others (chimpanzees) obsessively.Do you have strict QA standards that you apply to depositors' material?
Valued or verminMany animals are known as hoarders or collectors of food. Some are pampered (hamster), persecuted (rat) or tolerated (squirrel).How do you feel about your role? This is more about professional esteem, than observable traits.

(I don't claim that this is a comprehensive view of the repository world - to be more general I could have started out with the essential question "Do you have a backbone?" )

How do I fit in this taxonomy? I operate alone with no assistance in managing the repository content or policies. Thanks to our de facto mandate I don't need to actively chase material, I just wait for it to be deposited into the repository. Up to this point I haven't done much QA, I have let our library colleagues do this for the important records. I am involved in the departmental infrastructure. So that makes me solitary/scavenger/domesticated/sloven. My colleagues in the library who run the institutional repository are social/hunter/domesticated/preeners - perhaps? And Dorothea - well, I think that she is social (she works with a team of librarians on different campuses) / hunter / preener / wild*. Expressed in tabular form, this looks as follows (I have thrown arXiv in for good measure):


SolitaryWildHunterPreener
ECSX---
Soton--XX
arXivXX-x
Wisconsin-XXX

I will leave others to come up with animals with the appropriate characteristics. The lesson that I want to draw out is that there is variety in the repository kingdom. Even in the four repositories above we have all possible combinations of solitary/wild. The hunter vs preener columns look more correlated - perhaps you don't go to the effort of hunting for material and then not bother doing QA on it. And conversely, if you wait for material to drop into your lap you are less likely to care about its condition. (arxiv scores a 'half' on the preener scale because SPIRES provides a feed of "corrected" references for published material.)

So in conferences and community activities, messages that go down well with repository managers who have ticked the 'domesticated' box are going to irritate the 'wild' ones and vice versa. Those giving authoritative conference messages need to realise that they aren't speaking to a monoculture! That may be a lot to ask at the moment - they are still coming to terms with the existence and possibilities of repositories and the role of repository managers. A finely nuanced appreciation of the variation of the species is some way off yet.

So I hope Dorothea will forgive me for avoiding the "rat" identity - it is not out of a lack of solidarity with her difficult position or an appreciation of the hard work that she does. I'm just a manager in a different set of circumstances. 

And the real reason that I avoided matching animals to the different repository roles? The only domesticated, non-social, non-hunting, non-preening animal I could think of while on the plane was a rabbit. And I just can't abide the thought of being known as a "repository bunny".

Tuesday 18 November 2008

The Value that Repositories Add

One of the things I failed to do during the Evidence of Researcher Engagement meeting was to give a presentation that I had been working on for over a week. The discussion just ran away with me! So I have been persuaded to post it to the Web to try and get the message "out there".

A repository should be able to provide lots of benefits to its users. In particular, it should make things more valuable when they are deposits than when they are just files on a laptop or on a web server. This presentation is written to inform researchers of the kinds of things that should be able to do with their material in repositories. It starts off with the basic functions that are provided FOR THEM (wide access, persistence, backups, bibliography pages, administrative reports etc) and then tackles the kinds of ways that researchers can take advantage of the material FOR THEMSELVES.

This is not a complete list - I would love to have lots more suggestions and examples - and in some ways it is a bit optimistic. No repository will do all the things that I have listed - but it shouldn't be too hard for any repository to provide some of these services.



The other thing that I failed to do was to attract many visitors to the EPrints table during the sponsored breakfast. I was convinced that my brilliant marketing idea of a platter of Apple and Raisin fritters would get people lining up to read my leaflets, but unfortunately the quality of the rest of the breakfast buffet was just so great that I couldn't compete. Oh well, onwards and upwards!

Evidence of Researcher Engagement - stories, narratives and anecdotes

On the evening before the SPARC Digital Repositories conference I hosted a meeting to discuss the evidence of researcher engagement from individual testimonials and anecdotes. That seems to have been a bit of a theme throughout the conference: Jennifer Campbell-Meier spoke about gathering stories about repositories as a means of advocacy in the new horizons panel and Bob Witeck spoke about the importance of stories in marketing open access to faculty and management. I hope that we're going to be able to start up a central place for collecting personal testimonials about repository benefits under SPARC's auspices. The idea is that a repository manager or subject librarian can have somewhere to go and look for success stories as told by faculty and researchers from particular disciplines. (Do you wake up in the morning, dreading an advocacy meeting with the Chemistry Department? Why not download a couple of repository testimonials from the chemistry page on repositoryluv.com!) More on this later!

Sunday 16 November 2008

Unlikely Heroes?

On the shuttle bus from Dulles to Baltimore yesterday there were a load of people heading for a large (30,000 delegate) neuroscience conference. They all introduce themselves and their research to each other, and then they turn to me. I hate that kind of situation - being confronted with hard scientists. You see there's  the researcher pecking order that has to be upheld and it roughly tallies with the Impact Factor of your discipline's major journals. So biomedicine is up there at the top, and computer science is, well, you see, we are a conference discipline. That doesn't even register.
So I tell them that I'm heading to a small workshop in Baltimore on Digital Libraries (SPARC DR2008). "Really?" is the polite response. "Yes," I venture "it's all about ways of providing open access to your research." Instant kudos. "Wow, that's brilliant. We so need that." And then come the stories of how they still have to fall back on their grad school library facilities when they are now independent researchers in other institutions with their own students.
And I am the hero on the bus! Still, it's been a long flight and there's two hours to go until I get to my hotel and so I fall asleep. When I wake up it all seems like a dream.

Tuesday 11 November 2008

Someone Stop Me!

I had a meeting with some representatives from other Schools last week - they wanted to deposit some Masters theses in a repository but they were hindered from doing so by the policies of the respective services. The long and short of it was that I volunteered to set up a demo repository to allow them to get their documents housed somewhere safe, but also because I know that we need somewhere to store four years of our school's masters and undergraduate dissertations. We'll use the demo to make a business case to the university to extend the "institutional repository umbrella" while we're getting some experience with the issues.

Anyway, I set up the repository over the weekend and deposited the first batch of 100 dissertations, and - this is my point - it just feels so GOOD to be in control. I don't know if other repository managers get that feeling too, but when you get to make all the metadata decisions and press all the import buttons and BANG you've got a new batch of stuff all sitting pretty then I get a warm glow. Is this wrong? It's the same feeling as home baking, except that the cakes disappear by the morning whereas the dissertation are still there!

It's certainly more satisfying than setting exams, which is what I was supposed to be doing.

Friday 7 November 2008

Repositories Making Life Easier For Faculty?

Could it be that repositories will help make life easier for faculty? "Pull the other one" I hear the repository-weary skeptics cry. "We've heard it before!"

Well, if there's one chore that academics are bad at - aside from depositing items in to repositories - it's keeping our web pages up to date. About 1/3 of our lecturers at ECS don't have working home pages - and neither did 20% of CS professors at MIT the last time I checked against the internal staff list. And those who do have working pages seem to keep them several years out of date. Certainly mine had its last major update three summers ago!

Now the school provides me with a set of official portal pages which are generated by its internal databases, but they are a bit, well, impersonal. If only there was a way to keep my personal pages updated as effortlessly, but in a way that didn't look too corporate and databasey. I'm caught between regularly-updated/dull and individual/bespoke/stale.

I think that the answer (or something like it) may be found at PageFlakes. It's a personalised content aggregator that is typically used for pulling together news feeds from a variety of sources (CNN, Yahoo, Youtube etc) but with luck, if you can find the right set of information feeds about YOU and YOUR SCHOOL then PageFlakes can do a very passable job at creating a home page about you.

The example page that is illustrated above (the actual URL is http://www.pageflakes.com/lescarr/25235060) is formed from an RSS feed from my school press releases, a feed from our student bloggers and two feeds from two repositories - the researchy repository which gives the latest set of papers/presentations that I have written and the materials that I have most recently made available for my teaching. It also has a short description and photo that I put in by hand. All in all that makes a good current description of my status - research outputs, teaching outputs, student activity and school activity.

Because the repository has created preview images of all the documents it holds, that makes the RSS feeds much more visual and interesting (and personal) than a simple table of contents. It feels like a home page, rather than an aggregation of syndicated content.

I know that all the cool dudes discovered PageFlakes 18 months ago, but I'm quite jazzed about it as a vehicle for personalised repository content. And I do think that as institutions get to grips with marketing themselves through the web, the repository can have a role as a content provider for building rich media Web content for widgets, mashups and all kinds of social network applications.

Without getting too carried away, the repository will start to make my online life easier by managing all my research and teaching material, so that I can use it to create a bespoke web presence - my home page.

Wednesday 5 November 2008

More Things to Do With a Repository Feed

Yesterday I blogged about using an RSS feed to create an interesting visualisation
of a collection of items from a repository.

The image on the right is taken from another demo page that was put together using the Widgetbox service. You simply have to paste in the URL of the RSS feed that you want to use, make a couple of selections about the colour and size of the widget and then it provides you with a bunch of HTML that you can copy and paste into your web page. It will even add the widget directly to your blog or Facebook page automatically.

Other services like Yahoo Pipes are good at combining, filtering and generally tweaking RSS feeds, so you could even create a federated widget.

Tuesday 4 November 2008

Visualising Repository Contents

Those who have followed this blog will know that I'm a sucker for a good visualisation that provides a helpful way of displaying and accessing the contents of a collection or a whole repository.


So I read with interest about cooliris, a convincing and polished implementation of the displaywall metaphor that works on media resources described in RSS feeds. Using XSLT I turned the XML export of an EPrints search result into the required MediaRSS format (making use of the eprint item thumbnails) and embedded it into a web page as a demo. The results are best viewed in their installable full-screen viewer rather than the web page-embedded Flash program, especially if the feed extends to thousands of objects!

This technique is obviously best for visually attractive items, rather than a wall full of text-based journal articles, and would probably form an accompaniment to a collection listing, rather than replacing it.

Sunday 26 October 2008

Patterns in Repository Access

The clocks have gone back this morning, and I was looking for something to do with my extra hour. Having tidied the kitchen cupboards, I thought I'd have a play with the Google Analytics result for our school repository.

I've only ever reported summaries of download data to our research committee - and that data is pretty constant at 30,000 full-text downloads per month, or a million papers every three years. So I was interested to see how the daily pattern of repository accesses varies over the academic year, and how that variance itself seems to repeat every year. The image attached to this posting shows the daily downloads (recorded by Google Analytics) plotted over the last year (October 27 2007 - October 26 2008) in blue, with the previous year's data also plotted in green.

The rapid oscillations are the weekly rise and fall - a peak on Mon/Tues followed by a gradual, slight decline over the week and a slump on Saturday (to around 1/3 of peak levels) with a slight rise on Sunday. Invoking Excel on the Google Analytics results, and ignoring weeks with public holidays or traditional staff vacations (where access levels are significantly lower and patterns of attendance are less predictable) the general pattern for the remaining 58 high-activity weeks' access is Monday 18%, Tuesday 18%, Wednesday 17%, Thursday 17%, Friday 16%, Saturday 7%, Sunday 8%.

What surprised me was how much the gentle falls and rises over the academic year seem so similar on both curves. The places where the match is less than exact correspond to the start of the graph (there is no data for Oct-Nov 2006) and to Easter in each year (mid March in 2007 and early April in 2008). 

I'm not sure that there's a moral to this posting, apart from the fact that there seems to be a hidden regularity in the repository downloads. I must set a student to investigate!


Tuesday 21 October 2008

Data Access in Repositories - Don't Overlook What We Already Have!

Dorothea Salo's latest blog entry takes EPrints and DSpace to task for not being able to help users analyse (query, slice-and-dice, facet, analyse, number-crunch, mash-up) data files.

You can already do that, at least you can in Microsoft Excel anyway. As an example, I chose a data file that is already in the MINDS reporisoty (DSpace) and one that is in my school repository (EPrints) and created a new spreadsheet on my desktop that referenced data ranges in both of the archived data sets. I have put it on the Web so that you can check it out yourselves.

The screen shot shows the new spreadsheet that calculates the average publication date of the 2900 records in the ARCL WSS dataset, and the count of the number of data points in A Longitudinal Study of Self Archiving .



The Excel cell reference syntax isn't very pretty - it is a backward compatible munging (that's a technical term) of a URL into a UNC syntax. (And by the way, the munging was done automatically by Excel 2008 on a Mac.)
=COUNT('http:[//eprints.ecs.soton.ac.uk/13906/3/TIMINGS.xls]a.txt'!B2:J1617)
=AVERAGE('http:[//minds.wisconsin.edu/bitstream/handle/1793/23529/ACRLWSS.Resource.2007.xls]ACRLWSS.Resource.2007'!$H$2:$H$2940)
It is an interesting issue, to think what the data-oriented functions are that a repository can provide. However, we should not overlook the functions that we already have! And in the future, I would hope that URI-based data reference will become common-place in all our desktop applications.

Wednesday 15 October 2008

Repository Benefits - Expertise Finding

The UK's continuing focus on research assessment has led some repository managers to offer the repository as the key means of gathering evidence of research outputs for their institutions. The experience of those repository managers has been distilled into a set of recommendations for repository management.

A notable consequence of our obsession with research assessment is an enhanced role for research management within the institution. Suddenly all the senior managers want to know how best to capitalize on our existing strengths to make the most of future funding and publishing opportunities. And that means knowing what our strengths are. And that means knowing what our researchers do. And how they work together to do it best. And that's where the repository comes in - capturing our institution's intellectual outputs and providing services over them.

So my boss has asked for our repository to provide an Expertise Finder - for him to be able to find out what groups of people are working together in any particular area.

As it turns out that was quite easy to do as the repository already creates "communities of practice" focused around each person -the screendump on the left is taken from my school publication page. The cloud of names shows all of my co-authors, and the size of each name is related to the number of times they have written a paper with me.

All we had to do was put that functionality into an export plugin so that the authors from any set of papers can be visualised in the same way. That way you can find out who is involved in a specific topic like "Web Science" by doing a search for "Web science" and exporting the results as an "author cloud". You can try it out on our repository.

Now he wants this as a network diagram so he can see the relationships between the named authors, how they fall into subgroups who work together, and which people link up the different groups. I think we'll have something developed soon, and I hope that it'll be useful to other repository managers!

Tuesday 14 October 2008

A Present for Open Access Day!

Here's a present for Open Access Day 2008 - a handy patchwork quilt made from the top 150 Open Access resources on the Web!

Well, it's not really a quilt - it's a web page. But it is lovingly stitched together from thumbnails of the highest ranked web pages that Google returns on the subject of Open Access. However involved you are with open access and institutional repositories, I bet you haven't seen a lot of this material.

Click on the image to the left (a thumbnail of the whole quilt) and it will take you to the quilt page. There, each resource is represented by a clickable thumbnail that will take you to the real page. Of course, you can get much the same result by doing a Google search for Open Access, but it's not as jolly and cheerful.

Ho ho ho! Happy Open Access Day!


Monday 21 July 2008

Top Gear, Top Blokes

Fans of the BBC's Top Gear show are having to wait 21 years for studio tickets as the waiting list is now over 336,000 people long, according to Autoblog.

Mind you that's nothing compared to the 100 year wait that institutional repository fans might have to endure to reap the benefits of the ROAD project's latest experiment. In a stunt very reminiscent of the Top Gear program, Stuart Lewis and his team of repository torturers are going to stuff a million items into the ingest interfaces of DSpace, EPrints and Fedora repositories. If this really were "Top Gear", two repositories would explode and the winner would be Stuart Lewis with a wallet of rewritable DVDs. Since this isn't "Top Gear" all that will happen is that some of the repositories might slow down unacceptably and will need to have their storage or metadata modules re-engineered to work efficiently at this scale.

But what's the 100 year wait about? That's how long it would take for an Institutional Repository working at full efficiency to accumulate a million items, given that the average institution has about 1000 academics who each deposit a research or teaching output around once a month (or 10 times a year given time off for vacations and admin). That makes about 10K items per year, 100K items per decade or a million items per century accruing to your repository. And given that most IR's aren't operating at that level of efficiency yet, the Repository Managers of the next century can safely drink a toast to the ROAD team for setting their minds at ease.

Thursday 10 July 2008

Open Access: Nurture? Or Nature?

In the aftermath of the announcement about Nature depositing author postprints into PubMedCentral, I tried to use papers from Nature as an example for some EPrints sessions I am running at an Open Access workshop at the International Centre for Theoretical Physics (ICTP) in Trieste. This morning I was trying to find some papers for the delegates to practise depositing into an EPrints repository, and I have discovered that you need an ICTP library password to be able to download Nature PDFs - there isn't a blanket IP subscription. Fair enough, I have no problem with how they manage their subscription. However, it turns out that if I go to Nature.com's front page, I am forbidden from seeing the picture of Nature magazine's current front page! Literally I get the normal web page with a hole in it - and a request to type in my subscription password.

This is the difference between what I see from my home institute and at ICTP.
With SubscriptionWithout subscription
Now, I don't really think that Nature is trying to withhold a commercially valuable image from dirty-rotten-internet-freeloading-scoundrels. I am sure that it was just a mistake in translating company policy into HTML code. But I the fact that such a mistake is possible is evidence that Nature is genuinely conflicted between subscription access thinking and open access thinking. This is what my friend Stevan Harnad has recently pointed out - on the one hand Nature is offering something positive for OA, but on the other hand they are still restricting OA. On reflection and on balance, it would be better for them to give us what we cannot take for ourselves (permission for immediate OA) rather than giving us what we could have done anyway (deposits in PMC and repositories).

Some have suggested it is rather churlish or ill-mannered of Stevan to point this out, and that we should be grateful and just shut up. I don't agree. We still want Open Access to Research Outputs, not a 6-month intellectual headstart for paying customers.

Thursday 26 June 2008

Inspirational Teachers

I listened to John Willinsky give an inspirational keynote at ELPub 2008 this morning. He banged the drum for Open Access and announced an OA mandate for the Stanford School of Education. According to the story, he was describing the Harvard mandate to his colleagues in a meeting and they instantly voted to adopt a similar mandate themselves. Way to go!

However, the message that I shall take home was his discussion of the connection between "public" forms of knowledge and "highly authoritative" forms of knowledge. He gave the specific example of the links made between between Wikipedia and the Stanford New Encyclopedia of Philosophy, ie opportunities where a general and democratic information resource links back to a resource which is written and governed by domain experts. A really very good thing, according to Willinsky, who believes that the sustainability of the entire research infrastructure is based on its perception as a Public Good, one that is open and encourages the participation and engagement of its sustaining community.

In other words, the fact that many non-researchers seem to be downloading papers from our repositories shouldn't be seen as a suspicious thing. "Things on the Web are just downloaded by teenagers and pornographers" according to some colleagues who are less than Web-friendly! "If a download isn't attributable to someone in a University then it shouldn't count - it's obviously a mistake or being read by someone who can't possibly understand it." That's the attitude.

But perhaps not. According to Willinsky, our (Higher Education's) ongoing existence as a part of society depends on us acknowledging that less esoteric forms of debate and knowledge do exist (public forums and websites) and that we should expect and encourage the public to refer to our work, and link to our work and even read our work.

And I think that if repositories have a role in making collections of research material accessible, then perhaps we should be thinking about how to make them a bit more accessible to the public, in helping us become inspirational teachers with half an eye to the rest of society.

Wednesday 25 June 2008

Repositories Should be More Like Email (apparently)

See below of a summary of an interesting JCDL 2008 paper that adds to the "repositories - they're all wrong" debate. Cathy is well-known (and, I think, well-loved) from the hypertext community for her ethnographic studies of information handling, and here she reports on a small scale study of the information management practices of research authors as they go about the task of writing papers, and the implications for repositories. The paper is noteworthy because it highlights the role of email as a personal archiving solution and argues that any repository platform will need to do better than email in a range of criteria to gain user acceptance.

Well, it's a new target for repository developers, and perhaps a new marketing slogan to look forward to (EPrints: Sucks Less Than Hotmail).

From my experience, the paper rings true in its description of ad-hoc and distributed author processes, but it is focused on a small group of Computer Scientists all of whom use LaTeX and BibTeX, so I don't know exactly how applicable its message is across the whole institution.


Marshall, C. C. 2008. From writing and analysis to the repository: taking the scholars' perspective on scholarly archiving. In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (Pittsburgh PA, PA, USA, June 16 - 20, 2008). JCDL '08. ACM, New York, NY, 251-260. doi: 10.1145/1378889.1378930.

(For those without subscriptions for the ACM Digital Library, Google Scholar will point you at a preprint available at tamu.edu.)

ABSTRACT: This paper reports the results of a qualitative field study of the scholarly writing, collaboration, information management, and long-term archiving practices of researchers in five related subdisciplines. The study focuses on the kinds of artifacts the researchers create in the process of writing a paper, how they exchange and store materials over the short term, how they handle references and bibliographic resources, and the strategies they use to guarantee the long term safety of their scholarly materials. The findings reveal: (1) the adoption of a new CIM infrastructure relies crucially on whether it compares favorably to email along six critical dimensions; (2) personal scholarly archives should be maintained as a side-effect of collaboration and the role of ancillary material such as datasets remains to be worked out; and (3) it is vital to consider agency when we talk about depositing new types of scholarly materials into disciplinary repositories.

The Bits I Underlined


Furthermore, from the point of view of the researchers and scientists themselves, institutional archiving arrives on the scene late in the process; the deposit of publications and datasets is an afterthought to the actual work, the research and writing. What would make archiving more integral to the entire process? What does scholarly archiving look like today from the scholar's perspective? How can normal collaborative interactions be used to improve repository quality?

I make an effort to focus closely on the practices and artifacts relevant to maintaining personal archives and contributing to institutional repositories.

Second, participants feel that versions record the development of ideas, a trail that may prove important. But how important? Much of the history and provenance of an idea can be reconstructed from communications media like email, especially when it is combined with intrinsic metadata such as file dates. Thus benign neglect coupled with imaginative interpretation will get you pretty far in reconstructing a publication's history.

What is most apparent throughout this discussion is that personal archiving is a side effect of collaboration and publication: for example, if email is used as the mechanism for sharing files, it also becomes the nexus for archiving files. If one's CV is the means by which a public list of publications is maintained, it is also used as a pointer for oneself to the most authoritative version of a publication. Personal archiving can be both opportunistic and social: participants talked about tracking down public versions of their own publications to reclaim copies of lost work.

Email is cited as a good permanent store for three reasons: (1) it is easy to browse chronologically, which makes retrieval easy and lifts the filing and organizing burden; (2) intrinsic metadata supports the reconstruction of context (for example, who made particular revisions and why); and (3) email is usually accessible from any web browser. If email is used as an archive, some care must be taken to ensure everything that is important is actually in email. Some archival material is normally in email (reviews, for example) and no extra effort needs to be expended to make it part of the record. Other types of artifacts‚ (run output, for example) must be put into email deliberately. Email is a sufficiently good archive that some participants made the effort...

It is easy to see how email provides just enough mechanism to fulfill the minimal version of these requirements. Any CIM infrastructure must beat email along all of those dimensions if it is to be adopted in email's stead

Tuesday 24 June 2008

Publishing - A One-Word Oxymoron?

Why do they call it "publishing"? Wouldn't it be much more accurate to say "I've just had a paper privatised?"

Just thinking aloud.

Wednesday 11 June 2008

Negative Click Repositories

The topic of "negative cost" repositories has been doing the rounds in the blogosphere. Chris Rushbridge has rebadged it as the negative click repository on the grounds that there is a positive cost associated with setting up and using a repository. I think I would rather talk about value or profit - the final outcome when you take the costs and benefits into consideration. Do you run a positive value repository? Is it frankly worth the effort? Are your users in scholarly profit, or are you a burden on their already overtaxed resources?

Chris quotes from Cavlec's (imaginary) repository apologist who attempts to defend a very high-cost, low-benefit repository. But he then goes on to treat that passage as if it were a factual evaluation of a real repository, a damning piece of evidence on the fundamental uselessness of repositories. It isn't! Ulysses Acqua is a straw man and his repository is a caricature of a real repository. I certainly don't accept that he describes my repository and I can easily answer yes to many of those questions. So while I'm not complacent and I recognise that there are many new services I want my repository to offer, I think we're not doing too bad on the value scale already, thank you very much.

Negative click/positive value. It's a nice rhetorical stance and a useful banner to rally the troops to, but let's not flagellate ourselves unduly. Let's recognise where good value exists and promote it! Lets foster new services around the material that repositories capture, manage and expose. Otherwise we'll just give up and run to the next bandwagon which will always sound more enticing because it has less experience with dealing with real practice!

Anyway, I think that I am in violent agreement with Chris, so to show solidarity I will do what he asked and list some positive value generators: publicity and profile (CVs, Web pages, displays, adverts for MScs/PhDs/staff), community discovery, laptop backup and asset management.

Sunday 8 June 2008

Even Simple Services can be Annoyingly Complicated

I know I've mentioned this before (on this blog and elsewhere) but I think that displays are important. Whether it's bibliographic displays of papers for CVs and online bibliographies or whether its very visual galleries, slideshows and montages of papers, presentations, posters, images and videos then a core part of the academic life is "showing off" the things you have done, and telling stories about them. Hopefully, your repository can help you with these displays.

In a previous blog (Cobbling it Together) I described a quick and dirty way of making a slideshow from a set of documents in a repository by using Acrobat to do all the heavy lifting. Several months later I created a slideshow a different way with the EPrints ZIP exporter and the Mac's iPhoto slideshow software. Both of these methods required quite a lot of manual labour to provide an end-to-end service. Now mediated deposit is one thing, but mediated use is quite another, and so I have been trying to find an easier way to produce a good looking slideshow with EPrints.

We have done various experiments with the display side, and that turns out to be quite easy. Whether it is with Flash, or a 3D renderer, or an external Graphics environment you can build interesting displays, assuming you have the right data files. The problem is getting the right data files! Quite often the items you most wish to show off are the most visual ones - posters or presentations, rather than papers. This means rich media, which almost certainly means commercial desktop presentation programs like Microsoft Powerpoint or Apple Keynote. Now, repositories make every effort to create previews and thumbnails of ingested documents, but this is mainly limited to images, videos and PDFs. Office documents have to be downloaded and opened in their native application to be able to view their contents.

When I was managing the OR08 repository, I encouraged authors to send in the source of their presentations and the poster artwork. Many sent me PDFs only, some sent me PDFs with the original PowerPoint, and some sent me the PowerPoint files only. For that last group I made sure that I manually converted the document into PDF (using PowerPoint 2008 on the Mac). The repository automatically created preview files for the first page of the PDF files, so each presentation ended up with a preview image of sorts. (Previews are actually created in three sizes by default by the open source Ghostscript interpreter.)

The problems that I have are that (a) previews aren't made of PowerPoint documents unless a derived PDF version is supplied and (b) the previews are relatively low-resolution and (c) the creation process not reliable. Of the 144 PDF documents in the OR08 repository, about 10% have no preview because the conversion process failed. Of the remainder, a further 10% have an inaccurate preview (missing fonts, incorrect geometry, badly positioned graphic components). To produce a full-screen slideshow of posters, what is really needed is high fidelity, high quality (200dpi) images of the documents. Even when the preview-generation software is changed to increase the resolution, this does not fix the 20% of previews that do not pass the expected quality threshold. And, of course, it still requires the human depositor to submit a PDF document to make the Office original 'previewable'.

The best results I have obtained for generating previews have been from using bona-fide PowerPoint to create PDFs and Acrobat (or the Mac's "Preview" program) to create images from those PDFs. Since Office is a personal desktop piece of software, it can't really be used in the context of a server and the Microsoft documentation advises against it since various functions try to communicate with the user's desktop environment. This seems to preclude having an preview generation service tucked away in the repository software.

I've been experimenting with Chris G to find a manageable way to handle the production of high quality viewable PDF and preview images for Microsoft Office documents. It's not finished yet, but we're getting there. Parts of the process are in place, but they're all in pieces on the kitchen floor (metaphorically speaking). What I'm trying to do is document our experiences here, and in particular, why the whole charade is looking very different to the neat little repository service I imagined at the start.

Basically, whatever we try it involves running Office on someone's desktop, and that means transferring files from the repository and back again. We looked at file sharing technologies (e.g. SAMBA or WebDAV), or file transfer protocols (e.g. FTP or EMAIL), but we had problems because our server is behind a firewall and none of our user desktop machines are allowed there. Our initial expectation was that the repository would initiate and control the production of previews using some kind of push technology (e.g. messages or drop folders). In the end we settled on the new EPrints REST API and allowing the user/client/service to control the choice of items which require previews.

I've ended up using Automator on a Mac to control the use of Word, PowerPoint and Preview to process Office and PDF documents. Later I will investigate using .NET to control Office and Acrobat on a Windows desktop. At the moment it is happening on a physically separate computer, although we might look at virtualising the process and running it on the same machine as the repository.

This process is Automate-d but not completely automatic, because user tools are involved. Every so often when a Word document opens up, Word puts up a dialog box to ask me whether I want to trust its macros. Or when it tries to save to PDF I get a message warning me that a footer is out of a printable region. Also, Automator is wont to crash after processing about 50 documents. This means that it took about 4 days elapsed time to convert 60 powerpoint and 100 word documents in to PDFs and thence to 200dpi PNG files. If I had been constantly in the office to give it the total 10 minutes attention it needed, the whole process would have taken about 2 hours. Still, a "slightly hands on" process is better than "no process at all" becasue I need those files. And I hope that I'm only going to have to process this backlog once any way!

Now that I've got the new files on my desktop computer, I can put them back using the REST API, but how does the repository know that they are preview files? The automated built in previews are stored in a separate place in the repository; third party services can access them but there is no public API to create them or update them. Also, only preview images are handled, not PDFs. So the files that I have created aren't stored as 'repository backstage previews' but as independent documents within the original eprint that have a specific relationship (mechanicallyDerivedVersion) to the original Office document.

This means that any eprint which contains an Office document may now acquire a number of service-generated additional documents. At the moment, EPrints doesn't use them in its internal processes (for providing thumbnails on abstract pages), but export plugins can treat them how they like. The only thing that EPrints understands is that if a document changes (is updated, or deleted) then all of the documents that were mechanicallyDerived from it must be deleted. The assumption is that the 'preview' has been rendered obsolete or out-of-date by the changes/deletions and it is the responsibility of whatever third-party service that created the documents to recreate them.

My 'slideshow' exporter can now look for all the PowerPoint documents it wants, and then use any image file that is mechanicallyDerived from them. Job done! The "preview" semantics of the derived files are only understood by the third party plugins that create and use them, but their temporary and dependent relationship to the other documents are understood by the repository core. As we refine this process we will doubtlessly add something like a derivationRelationship to the mix, so that we can tell our thumbnails from our MD5 signatures.

The main shift in expectations remains that "preview generation" (something that was an automatic, internal service) is being supplemented by an external, partially manual (handheld) service. Sometimes, it seems, the services that a repository takes credit for are actually provided by human cranking the handle on other pieces of software! I've just had exactly this discussion with Bill Hubbard, and it's making the boundary of my repository architecture diagram look very fuzzy!

Friday 6 June 2008

Friday Afternoon Features

I can now keep up with the latest repository submissions from my mobile phone! Yes, in a fit of Web 2.0 experimentation, Chris and I connected eprints.ecs.soton.ac.uk to Twitter so that every time a new publication goes live it sends out a tweat to the eprintsecs user which I am now following.

Thursday 15 May 2008

Repository Deposits Double in the UK

The graph shows how monthly UK institutional repository deposits have doubled in the last 18 months. Each repository was receiving an average of 40 deposits per month in October 2006 and is now receiving about 80 deposits per month in April 2008.

The data is taken from ROAR and is corrected for some obviously anomolous activity and for some missing deposit data. Further investigation is required to check whether the trend applies to all repositories or whether it is driven by a small number of better preforming repositories. Work is also required to determine seasonal variations and also to understand longer-term trends.


Sunday 11 May 2008

Institutional Policies and Institutional Managers

It's not often you get to high-five your institution's chief librarian, but the moment seemed to demand it on Friday morning.

Attending the Repository Steering Group meeting was Bernadette Kelly, the Service Manager of the business unit (ISS) that is responsible for all computing, data and communications facilities for the University. Her responsibility is the delivery and sustainability of core information services that are required for the University's business - particularly the finance and management administration systems. She has been responsible for the repository for a while, but it has rather been eclipsed by other more mainstream applications. In fact, there was supposed to be a technical support team being built up for the repository, but several years ago all bar one of it members were assigned to the rollout of an important Management Information System, and they have never been assigned back! (This singular and wonderful trooper is Adam White who joined the EPrints Southampton team straight from graduating from a degree in Computing.)

Bernadette came along because of the tension between the use and the resourcing of the repository. It has gained a very high profile internally because of its role in generating the University's submission in the national Research Assessment Exercise, and it looks increasingly likely that the repository will form a platform for the University's ongoing management of research intelligence. The trouble is that this role is just not in keeping with a 1-man support team and we have all been worried about what would happen if Adam fell under a bus, or worse, got a better paid job with a competitor institution!

I think that it's fair to say that ISS have been supportive of the repository from the get-go and in fact the initial rollout was commended by the Vice Chancellor as an excellent example of inter-service collaboration between ISS, the Library and my school ECS. But even so, their natural and professional instincts are not in favour of home-grown, open source solutions. So there has tended to be somewhat less than a wholehearted enthusiasm for its future.

I think that a historical problem has been that ISS have listened to us talking about Open Access, scholarly communications and research quality assessment to other audiences, and they have never seen our ambitions as part of their core mission. So on this occasion we had the opportunity to talk business to them. We talked about the role that the repository has been adopting in gathering and marshalling intelligence about the University's main business products (research outputs and teaching interactions) and about our vision for addressing key current business concerns by enhancing our international profile (Google), increasing revenue (by advertising courses to potential Masters and Postgrad students on repository pages), addressing engagement with industry (similarly by focused advertising in the repository) and delivering research intelligence on citation impact and Pagerank) to staff and their managers.

There was definitely a lot more enthusiasm as we talked and I think that ISS have moved from seeing the repository as Yet Another Service That Needs Resourcing From Their Overburdened Budget to An Important and Productive Business System That is A Nett Contributor to the University. This is definitely a good result, so as the meeting broke up I took the opportunity to register my excitement with the aforementioned high five.

And it's another example of the need to be able to talk different languages to different people if you really want to offer "a set of services" to the whole institution. Management and admin are a hard-to-ignore part of an institution, but there's a certain amount of worry in the repository community that we will lose our way if we become a "tool of the management" rather than focusing exclusively on supporting grassroots researchers. I take the opposite view - an institutional repository will never be more than a localized library contrivance unless it seeks to serve the concerns of the institutional managers as well as the institutional researchers.

I believe that this position is supported by the experience of a handful of repositories that have participated in research assessment in the UK. All the presenters at the OR08 discussion on Research Assessment Experience came to the conclusion that the hard work involved was more than compensated for by the increase in profile and respect that the repository and the library achieved. I gave the same message in a recent talk on using a repository for research assessment at the Beyond the RAE 2008 meeting at Kings College, London in April 2008.

We will explore these issues further at the forthcoming ELPub workshop on Repositories that Support Research Management at ELPub 2008 in June. Part of the workshop will be commenting on the checklist for serving institutional management that came out of the OR08 discussions.

Wednesday 23 April 2008

Cloud Computing and Cloud Thinking

Hello from the 2008 Web Conference in Beijing! Yesterday I took part in the Web Science workshop on Web Evolution and spent my evening uploading all the presentations to the Web Science EPrints repository and feeling a bit like Cinderella while my senior colleagues from Southampton went to a reception hosted by Microsoft. While I was uploading a 15Mb PDF over a very slow connection I took the opportunity to have dinner in the hotel's Brazilian restaurant. Several Caipirinhas later I returned to finish off the repository management tasks in much improved humour:-)

Today is the first day of the main conference, and the keynote speech was given by a Chinese VP from Google on "Cloud Computing". He covered all the basics about Cloud Computing and particularly about Google's internal cloud infrastructure and their cloud-based user applications. Now I'm very interested in Cloud Computing as a Computer Science Researcher and Lecturer, and I'm looking at including it in my teaching and in my work. Hurrah for David Flanders and his Fedorazon project who are giving us advice about running EPrints in the Amazon cloud.

However, it also seems to me that all this hard work and infrastructure is just moving our current working practices from our laptops and workstations to yet another exciting new platform. Instead of having my files stored on an identifiable piece of hardware in a known location, they are now stored somewhere unknown and unknowable, but invisibly managed, replicated and always available. This might offer various advantages, but it is a fairly superficial change in my working life.

What I'm really interested in is not a shift in technology, but a shift in human behaviour. Not cloud computing but cloud thinking. Encouraging researchers and scholars to move their ideas from the private and inaccessible domain of their laptops or workstations or manuscripts or CD-Rs into the public domain of the Web to increase the efficiency of the research process and to improve the sum total of human knowledge. Just putting documents or data in the cloud doesn't make it any less private. Moving all of research into the cloud wouldn't increase the sum total of disclosed human knowledge - and that's what I think is really important.

It's all part of the Open Access ideal - don't withhold your intellectual capital unnecessarily. And cloud computing (like service oriented architectures and any other platform infrastructure) may be a useful step in the right direction, or it may be a complete red herring.

Saturday 19 April 2008

Beware What You Wish For

Now that our repository has been upgraded to EPrints 3.1, the repository technical support team (that's Chris) has agreed that the repository management (that's me) should be allowed to have control over the new web-based management tools. In theory, I had the right to this level of control before, but in practice it meant logging into the command line of an infrastructure machine for which I wasn't supposed to have login access. This was part of the management/technician rift that made us put as much repository administration as possible into the web interface of 3.1.

Still, now it's actually arrived, I've realised that all the excuse making and prevaricating that I did before just won't work. The magic words "Oh yes, I need to get the web programming team to look at that" is something that has saved me a lot of work in the past. Now the game is up, my cover is blown and I'll just have to do it myself.

The first thing is to fix the citation styles (we have the italics in the wrong place, and book sections aren't flagged as such). I've got a nice email from Pauline Simpson on the topic somewhere. Then alter the QA audit to ferret out never-published papers. Then update the by-group view pages and their sub-orderings. Then I can take a look at the new tagcloud view-by-keyword styles and the new community of practice co-author listings.

Wish me luck. I hope I don't press the wrong buttons!

Wednesday 16 April 2008

Georgia State vs The Publishers

Apparently Georgia State University has been providing teaching materials to its students without getting the necessary copyright clearance. See the publishers' press release for one side of this story.

I really shouldn't raise my head in public about this lawsuit, because I try to keep quiet about non-OA issues in case I confuse the issues. However, what stands out to me in the above document is something commonly seen in the Open Access debate: publishers glorifying their role. Here's a quote:

“University presses are integral to the academic environment, providing scholarly publications that fit the needs of students and professors and serving as a launch pad from which academic ideas influence debate in the public sphere,” said Niko Pfund, Vice-President of Oxford University Press. “Without copyright protections, it would be impossible for us to meet these needs and provide this service.”

The inference to be drawn from the above paragraph is the obviously false "without copyright protections there would be no scholarship". I suggest the following translation into more grounded reality (copy editing services provided free on this occasion):

“University-based publishing companies are part of the academic food-chain, selling scholar's publications to needy students and professors and serving as one of the channels from which academics' ideas influence debate in the public sphere,” said Niko Pfund, Vice-President of Oxford University Press. “Without copyright protections, it would be impossible for us to meet our needs and provide this business.”

Sunday 13 April 2008

Cow Tipping and All That Jazz

Last week (being the week after That Conference) I was able to escape the country and visit fellow blogging repositarian Dorothea Salo in Wisconsin. Despite warnings of freezing weather and record-breaking snowfalls, I arrived at Dane County airport to the very English sight of grey clouds and heavy drizzle. Dorothea introduced me to Kristin Eschenfelder who is a researcher in social informatics and we all spent a very pleasant evening talking social epistemology and information flows in open source software networks and at Indian restaurant.

The following day I had the pleasure of sitting in on a MINDS management meeting (MINDS is the DSpace Institutional repository of Wisconsin University). Despite the fact that Southampton and Wisconsin have different educational and funding contexts at the national level and different university structures and management at the institutional level, it was very clear that the challenges and activities of repository management are identical for host and guest. There really ought to be an international repository managers organisation, independent of the software platforms and the agendas. Neither of us was able to be at the Repository Managers session at OR08 (Dorothea didn't have the funding to attend the conference and I was too involved in conference administration during the event) but I hope that there might be some movement towards that in the aftermath of the conference.

Then it was on to Chicago (even more rain) where I had been invited to speak about EPrints at a CARLI meeting (Consortium of Academic and Research Libraries in Illinois), alongside Tim Donohue (DSpace Committer) and Sarah Shreeve (IDEALS repository manager). Together with Dorothea, Tim and Sarah have been developing BibApp - a bibliography managing application that works alongside repositories. BibApp was one of the finalists in the OR08 Developer Challenge, but this was my first chance to get a close-up look at the software. Previously it had been DSpace-specific software, but in its latest version it integrates with EPrints via SWORD. It contains some potentially very useful functionality for librarians - it extracts lists of publishers from authors' bibliographies and alerts them to those that have the most permissive Open Access policies as stated in the ROMEO database. The intriguing thing from my POV is that BibApp is deliberately implemented as a separate application that works alongside repositories, but how much of it can be achieved inside a repository? What is the best location for repository-enhancing functionality? Where are services located, and who takes responsibility for them? More of this later I think!

PS If you're wondering about the title of this post, Cow Tipping is a rural Wisconsin pass time and All That Jazz is a song from the musical "Chicago".

Upgrading Repositories

Repository upgrades are a blessing to their users (better interface, better services, fewer bugs) but can be a worry to the technical support staff. The key issue is that while Version (n) = Version (n-1) + Upgrade + 1 hour or less it may be the case that LocalizedRepositoryVersion(n) = LocalizedRepositoryVersion (n-1) + Upgrade + 1 month or more.

When we released EPrints v3 last year, we knew that the fundamental rewrite needed to achieve such a big jump in terms of repository functionality was going to lead to a bigger upgrade effort. Although anyone starting off with an EPrints v3 repository found it easy to install, upgrading required a migration wizard to assist the process.

Having gone through all that, it was always our ambition that EPrints 3.1 would be a "trivial" upgrade process, and in fact that was part of the design objective for EPrints 3.0. Still, as the list of new features in EPrints 3.1 grew and grew, I began to worry about what this would mean for people who had to install it. But good news - we installed it on our main server last week and it took "less than an hour". Bear in mind that our main server runs EIGHT repositories from the same installation code, and so required eight sets of checks and configuration checks and tweaks.

(In case you're wondering, those eight repositories consist of four major repositories - the ECS school repository, the public EPrints demo repository, the public EPrints sofware distribution repository and the Cogprints research repository - and four experimental repositories used by minor projects and workshops.)

Based on this experience, we can say with some confidence that a single repository can be upgraded to version 3.1 in less than ten minutes. Of course, once you've upgraded you'll probably want to spend some considerable time playing around with the new facilities and configuration options, but that won't be the technical support guy's job. In EPrints 3.1 the repository configuration is all done by the repository manager, through the web interface.

What a Long, Strange Trip It's Been

The University's Easter Vacation is just coming to an end, and things are returning to normal after the week-long international festival of repository vitality that was OR2008. I've still got to sort out the financials and finalise the web site, but I've been spending most of time on the conference repository (http://pubs.or08.ecs.soton.ac.uk/) in the last week.

The thing that no-one tells you about repositories is that they are a lot like children. They end up being wonderfully satisfying, but they take an awful lot out of you and they go through phases of being messy and uncontrollable. This has dawned on me over the last few weeks in dealing with the OR08 repository, which is just emerging into the phase where I'm feeling really proud of it. It started off only a few days before the conference, when I realised that it was going to be easier to put all the presentations into a repository than manage them all on a website. The last conference I ran (WWW2006) we put all the presentations in a directory on a webserver and generated all the pages and links from a flatfile database using php. I didn't seriously consider using a repository for this conference for a couple of reasons (a) politics - choosing a specific repository platform (like EPrints) didn't seem very much in keeping with the non-partisan nature of the conference series and (b) policy - I have no perpetual mandate for launching a repository for the conference series, and making one for the single event seems a bit profligate given the rhetoric of persistence that repositories are couched in. In the end practicality won out over politics and policy, because repositories have moved on so much in the last couple of years that they have become genuinely useful tools for large-scale information acquisition, processing and dissemination for the web. Sure, if you have a small workshop with a dozen papers to publicise just bung up a website, but with 30 papers+presentations and 50 posters+artwork and user groups providing another 40 presentations (plus a couple of BOFs), a repository becomes an invaluable infrastructure for collecting and displaying material.

I touch on this dichotomy in my own paper at the conference (End-of-Life Scenarios for Virtual Organisation's Repositories) which is all about balancing the immediate usefulness of a repository with the responsibility for sustaining it into the future. In some ways it's an argument not about repositories in particular, but about web resources in general. And perhaps the analogy with children is apt once more - there's a certain excitement in making them, but then someone has to stick around and pay the bills.

Monday 11 February 2008

New Requests: QA and Citation Counting

I am being pushed by the head of research committee to have the repository send out more QA alerts to all the self-depositing users. Yes, they really do want to be prompted about problems with their metadata! I'm meeting with Chris G to try and decide the best way to do this for everyone, but I think that some of the experiments we tried last summer (see previous blog postings on QA) will help us produce a sleek user interface for the end-users.

I am also being pushed into responding to the national obsession with research metrics by adding citation counting and tracking to EPrints. After Christmas I managed to produce some demo scripts to track the citations of repository holdings using Google Scholar, but they got wiped out in my January Laptop Disk Crash (not to be confused with the February one). I'm delegating the rewriting of the scripts (hey, I'm a senior lecturer!) but things are moving so fast in the UK that they will need to see prime time very quickly!

Saturday 9 February 2008

Let's Do the TimeWarp

One of the reasons I believe in the Preservation ideal is that as a mid-career researcher, I have become very aware of the temporary and unreliable nature of my own personal IT infrastructure. Both the hardware and organisational support offered to help manage my intellectual journey (pretentious? moi?) are totally inadequate. I've just gone through my third hard disk in three months, and each time I've ended up with a period of splintered emails, diary entries, papers, proposals in different folders, using different applications on borrowed machines while the "Support" team try to diagnose and fix my hardware.

I keep going through these processes every few years - stolen laptops, broken laptops, borrowed laptops, new computers that I don't quite have time to transfer all my old environment over to. It just takes so much time, effort AND CONCENTRATION. Juggling backups from various periods, trying to reconcile duplicate files and remember what is on which machine. You never discover you've failed until 6 months later when you look for a document that you wrote 3 years ago on "just this topic" and it's not there - the whole project is missing. Arrgh!

So I (and a whole bunch of my colleagues) have quite fallen in love with Apple's Time Machine software that just creates daily snapshots of your hard disk, and allows you to browse through your hard disk backwards through history. It's like the Wayback Machine, but with an interaction paradigm that someone has actually thought about. It's very effective. And now there's this new wireless hard disk (the Time Capsule) that allows your machine to be backed up, automatically, without even having to plug the backup drive in. Fantastic!

For the first time in history, I'm seeing my colleagues get excited by backups. It was always such a tedious obligation before, and most people didn't do it very often. Certainly not on their laptops, for which our Support team disclaim all responsibility. And now, it can just happen, without thinking about it.

So perhaps, this is how we should make repositories work. Don't ingest individual, exquisitely formed digital items, complete with metadata and licensing information. Just ingest the whole flipping hard disk, offering at least a backup service. As Caveat Lector recently pointed out, everyone wants backup so everyone would use the repository. I was dubious at the time, but as you can see, the idea is growing on me.

So perhaps we ought to augment the OAIS model of the repository which (paraphrased) says that the repository is like a digestive system: stuff goes in one end, gets stored in the middle and then goes out the other end. (They even use the term "ingest", so don't tell me that the metaphor wasn't on their minds.) I'd like to tweak this model to be more like a cow, with multiple stomachs, each of which has a different task in the overall digestion process.

A hard disk (ie computer file system) goes in and gets stored for backup. Multiple versions are handled over time, so the whole history of the data contents are available. Only to the owner, of course, at this stage the contents are opaque to the repository staff. This level of privacy would need to be strictly enforced to make users feel happy about entrusting their files to a third party. At this stage, the benefit is all to the user - they have backup.

In the next stage, the user can "break down" the file system into important components - folders for projects, experiments, papers, proposals, lecture courses etc. Important individual documents can be identified. Metadata can be inferred by looking at the relationship between the low-level items (files) and the high level structures (folders and directories). File names, office document metadata, file system metadata, file contents, proximity to other files and their contents can all help to profile the individual items and ease the task of metadata entry.

In the next stage, the user can organise or "map" the important components from the disk (above) into a set of entries in the repository. (e.g. all the doc/ppt and pdf files from this folder to go into a single eprint whose title comes from the name of the word file and whose journal name comes from the folder name).

Then we get to the normal ingest stage, where the metadata can be checked and improved and all the normal processes can go on.

Perhaps this is just the hysteria of the marking season (I've still got to mark 65 students XML and XSLT files before the end of the weekend). Or perhaps its a strange state of mind that comes from living on a borrowed iMac in the spare room until I get my proper laptop back and all its files restored. But it might satisfy the need to get the repository closer to the user, and encourage the greater use of the repository for preservation and open access.

Thursday 7 February 2008

Pride and Prejudice

Early on in my Open Access career I learned to Always Listen To Librarians. I was taught this lesson by the formidable but fabulous team of Pauline Simpson and Jessie Hey who ran the JISC TARDiS project that developed the Southampton Institutional Repository. As a computer scientist I originally thought that I knew everything about digital information management. Now it's not that I think that librarians are always right, or that they are always more right than other groups with a stake in OA, but they do have a lot of experience in managing lots of information sources on behalf of a disparate community of users. They have "form". Or "previous" as you used to hear in TV cop shows. And you ignore them at your peril!

So Caveat Lector is a daily read of mine. I understand where the writer is coming from - repository management is not yet a well funded, well supported or well understood profession, and few repositories have the luxury of a whole team of professionals to dance in attendance on it. Or a single professional, as it happens. As a repository developer and open access advocate I LIKE to hear praise about how good repositories are, but I NEED to hear criticism about how much they suck and Caveat Lector isn't afraid of offering up some well-thought-out criticism on occasions.

Aside: if you talk to Chris Gutteridge in the bar at OR08 he'll tell you that he thinks that our slogan should be "EPrints: we suck less". It's one of those open source developer attitudes, but I'm not sure that I'll be committing it to any T-shirts just yet :-)


I read today's entry on "taking name entries that are obviously for the same person and making sure they have a single representation" and I'm taking it to heart. Look at any repository that has a "list by author" view, and you'll find that you don't need to go far down the first page before you see multiple entries for the same author. Not just DSpace repositories (no finger pointing here) but EPrints, Fez and Vital too. More about EPrints below, but back to the issues that this posting raises.

Firstly, Quality Assurance. All repository managers need to check over their repository, and its not a task that has been made particularly easy by the repository software. Checking author names (or journal names, or conference names, or any entities) for consistency is a great example of something laborious where escaping from the user interface to the underlying storage layer may actually be a relief. Wow. Doesn't that say something about our software? And I have to put my hands up (another cop show thing) because it's as true in EPrints for that particular task as it is in DSpace. It's a feature that is on the EPrints v 3.1 list of things to do, so I hope to be able to announce some progress at OR08, but at the moment it's a fair cop guv.

But secondly, why bother? As CL puts it: When the rubber meets the road, libraries don’t think IRs are important enough to waste even a smidgen of authority-control effort on. CL puts this down to pride in the repository, but I'd like to suggest it's much more significant than that. In fact, this is a huge, enormous, 3-lane motorway pileup of an issue for an institutional repository - until it's dealt with, no-one's going to be going anywhere down that road. Why? Because your institutional repository becomes institutional when it is embedded into the institution, and that means making it useful to the institution, and that means making it do things that the institution (rather than the just the faculty) want. And the first thing that the institution (read managers, administrators, marketeers etc) wants is lists - what papers are attributable to this person, research group, department, school, project? And the most fundamental part of that is the ability to be able to accurately and authoritatively deliver a list of items attributable to an individual (from there everything else is aggregation).

How do you do that? You can't escape the fact that every local author has to have an id. It can be an email or a staff number, or anything you like, but as well as being unique it has to be persistent to avoid problems with staff changes. When we were creating the Southampton IR with Pauline and Jessie, we got the pilot version wrong because we avoided adding staff ids - they just looked like too much hard work for the depositers. However, we quickly got back the message that what everyone wanted was up to date lists of publications - faculty wanted them for their CVs, departments wanted them for their web pages, and the admin staff wanted them for their never-ending form filling. If you just rely on author names as they are entered (or even, as they appear in the published item) then each author appears as 4, 5 or 6 different names and worse, multiple authors appear as the same individual.

That's why all person names in EPrints (whether authors, editors, lyricists, accompanists, experimenters or anything else) are now a compound of (a) title, (b) given name, (c) family name, (d) lineage (e.g. Jr. or III) and (e) id. And that's also why EPrints now has autocompletion and name authority lists, so that the id can be entered without imposing a burden on the depositor.

Back to the title of this posting: in case you hadn't realised, "pride" was the title of CL's posting. And "prejudice" describes my original attitude towards librarians. But it is also a challenge to repository managers who are from the library community - are you prejudiced into seeing the Institutional Repository as Library Property? A Library Plaything? Or a core Institutional Service?

Tuesday 15 January 2008

The Myth of Complex Objects II

Following on from my previous posting, I'd like to say a few more things about complexity. In particular, I'd like to acknowledge that complexity does exist while at the same time standing by my assertion that repository users themselves aren't creating "complex objects". It's the act of putting things in a repository that creates complexity, and that has to be managed in as straightforward a way as possible.

First of all, a definition. Something is complex if it "consists of interconnected or interwoven parts" (according to the American Heritage dictionary at Answers.com).

What authors and researchers in general are doing is creating lots of simple things - a paper, a database, a presentation. They're creating them as files and directories on their computers (laptops, workstations, servers). What authors and researchers need from us are repositories to capture these simple things and manage them (for preservation and access purposes).

Complexity appears when we as repository designers notice that "many things" are being deposited by a content creator, and that these things are not entirely independent of each other. The original source of a paper, its PDF, the presentation that was created to discuss it at a conference, the video of the presentation. These things are all implicitly interconnected in the cataloguer's mind, even though personal experience says that they are probably not explicitly grouped together on the author's hard disk. We want to capture this implicit interconnection and turn it into benefit for the author or the reader.

It may be that the connection is even stronger - a group of papers in a reading list, a set of questions for an exam or a collection of photos of a single event. These examples are more likely to be stored together by the author, simply because they are naturally used together. Even so, they are still created and managed as single files in a directory structure, because those are the day-to-day tools that content creators are familiar with.

So, as responsible information designers we have to decide how to treat this complexity - this implicit relationship between files.

The easiest thing to do is to ignore interconnectedness. We can achieve that end in two ways, either by forcing users to deposit individual files in individual records (leaving the user ignorant of the relationships between the records) or else by allowing users to deposit undistinguished clumps of files in a single record (leaving the user ignorant of the nature of the relationships between the files inside a record). In both cases, subversive use of the record's metadata by the depositor may help overcome the shortcomings of the repository design and reassemble some ad hoc sense of relationship.

The most natural way for a repository to support interconnectedness and relationship amongst the files it holds is to model them in a way that its users will recognise. Hence EPrints allows each record to have many 'documents', where each document has its own metadata to describe its role and purpose. That allows a preprint, a postprint, a poster and a presentation to co-exist within the same record. Even though they may all be PDF files, there is no danger of confusing them, because they all have their own metadata descriptions. To cope with the cases where one of the documents is really a collection of things (like a photo album or a web page), each of the documents is allowed to consist of many separate files.

That means there is more than one way to store a group of inter-related files like a photo collection in an EPrints repository. (a) Store each image as a separate eprint record with its own metadata and perhaps even create a top-level repository 'view' for it (b) Store the collection as a single eprint record, and store each image as a separate document (c) Store the collection as a single document made up of all of the images. (d) Turn the collection into a single ZIP/TAR/METS file and store it as a single item. Which of those choices you take really depends on the significance of the collection and the use to which you wish to put it.

So despite the fact that authors aren't themselves engaged in creating complex objects outside the repository, an EPrints repository supports sufficient complexity to allow for implicit connections and relationships between authored items to be made explicit and for users and software to take advantage of it.