Sunday 30 November 2008

More Repository Value for users - making PowerPoint files from RSS feeds

Executive Overview: I tried and failed to find some existing user desktop applications that could work with repository collections; in the end I had to write some simple programs to demonstrate the kind of thing that can be achieved with a repository, some content and Microsoft Office. The screendump on the left is an example. Scroll down for some more examples from EPrints and DSpace.

Since my previous posting about Repository Value, I have been very aware that all my examples were focused on the web, and that the services and widgets that I described were for web page authors to use on web pages. In fact, many of them might be better used by web admins and web developers than rather than academic users and researchers themselves.

Although we live in a Web-dominated world, most of my colleagues do not use the Web to originate material. Work is found on the web, but created on the desktop (or laptop), and consequently the tools and services that people are mainly familiar with are those of the desktop, rather than webtop. Cloud computing may be the way of the future, but currently services like Google Docs gain a lot of attention but few users according to a recent report by ClickStream Technologies.

So I particularly wanted to find some good old-fashioned desktop applications that could do useful work with material in repositories. Something that could take a collection of material from a repository and synthesise a useful overview document that describes it, or disply it in a slideshow. And most importantly, it had to be something that could be used by a highly educated but non-technical researcher. Or even an extremely clever but technically incompetent professor. Not a software developer!

I looked at different ways of using Microsoft Office, but it doesn't schlep material from the Web very well. Word can read in individual web pages, but collections are represented in a group of pages which are difficult to handle without programming. I had some previous experience with forcing Adobe Acrobat to create a slideshow from a repository, but that was too fiddly. I had some success playing around with the Firefox "Scrapbook" extension to grab all the linked pages from a collection as a single (humongous) web page, but I couldn't find a way of doing anything productive with that page.

The fundamental problem is that repository web pages (as displayed by DSpace, EPrints, Fedora, Digital Commons or anyone else) are too complex to interpret without creating a bespoke application that is tailored to datascraping each kind of repository (and each version of each kind of repository.) So to have any chance of broad-spectrum success, I needed to find a simpler format to work with that is supported by all repositories.

Luckily, RSS is just such a format. Repositories use it to describe lists of recent deposits, or lists of items in their collections. Previously I was investigating it for Web-based services, but now I want to use it for a desktop based-service. Which, to be honest, pretty much means Microsoft Office. Other productivity suites exist, but if you want to target as many repository users as possible then Microsoft Office is a pragmatic first step.

So, does Office have any built-in facility for reading RSS files? Unfortunately it doesn't. I seem to be out of luck with finding an off-the-shelf application that everyone has access to. I'm just going to have to make something myself.

But despite not having an RSS capability, Office has something almost as good - an open data format. I might not be able to get PowerPoint to make a slideshow directly from an RSS feed, but I can knock up an application to make the slideshow for me. And I don't even have to be a Microsoft developer to do it - because it's an open format I can just use whatever tools I am most familiar with. 

A Microsoft Office 2007 file (or Office 2008 if you are on a Mac) is actually a Zip file containing a collection of XML files. Like a "web page" some of those files are content files (each slide in a PowerPoint slideshow is a separate XML file) and some are stylesheets (each slide layout is a separate XML file) and some are media files (images and videos). I wrote two shell scripts to create the appropriate XML and media files to (a) create a new slideshow with just a title slide and (b) add a new slide onto the end of a slideshow. I then wrote some XSLT to turn an RSS feed into a shell script that successively calls those two commands. Each slide in the slideshow has a title, description and image taken from the corresponding item in the RSS feed. The slideshows are created with no styling, but opening them in PowerPoint allows a theme to be chosen for them with a single click.

The following examples show some of the slideshows I have made from repository material.


These are my latest papers and posters from the ECS EPrints repository at Southampton University. Note that this example (like the ones below, but unlike the example at the head of this post) has had a "Theme" applied by PowerPoint.

These are recent items deposited in the JISC KULTUR project repository.

This is from a photo collection at Demetrius, the DSpace repository at the Australian National University.

These items aren't from repository at all - they are from the ECS Press Release feed. After all, there's more to the web than repositories, and there's also more to lecturers' sources!


This is some partial success - I have had to write some programs rather than use readily-available applications. On the other hand, these rough-and-ready demos could be turned into a web service quite easily - go to a web form, paste in the RSS feed URL and download a PPTX file. (Of course, I'd have to debug them first. And re-implement them in a rather more robust environment.)

But there's a lot more that could be done. For an example of the sort of well-thought-out service that operates in this space, have a look at the New England Journal of Medicine's Image Library service. You provide some query terms, and it returns all the figures from recent NEJM articles that match your query. You then choose the relevant ones, press a button and it turns them into a PowerPoint slideshow for you, with all the necessary citation information (and licenses) attached. Imagine if OAIster could provide that for you, based on all the articles on any subject in all the repositories in the world...

Here are a few other ideas off the top of my head.
  • Most powerpoint presentations that are created are variants on a previous presentation. Gradually a presenter's message evolves over time, and their presentations reflect that. A repository not only allows the author to keep track of his/her previous presentations, it could track all the variants of each slide and allow the author to recombine them in new ways. It's very common for me to search through a dozen PowerPoint files on my laptop, to find a particular version of a particular slide that explained a point with just the right emphasis to a particular audience. Let the repository find all the individual slides that mention a particular phrase and allow me to choose le slide juste.

  • Or let the repository ingest process extract all the images in a presentation and determine the source and verify the rights clearance.

  • Or the next time I have to do a presentation to potential students, or potential funders, let the repository copy my basic set of core slides, and update it with new slides from the latest research repository feeds and the latest press release feeds.
Watch this space!

Thursday 20 November 2008

Addendum to Repository Rats and other wildlife

Another thing that characterises me is that I'm located in a single department and deal with only a pair of focal disciplines - Electronic Engineering and Computer Science. Most repository managers cover an entire institution, or in the case of a subject repository, the whole world. By contrast I'm very parochial and limited in scope. (I'm thinking of herds of majestic wildebeest sweeping across the Serengeti vs territorial animals like bears and robins.)

GlobalGeneral
ECS

SotonxX
arXivX
WisconsinxX

Repository Rats and Other Wildlife

Dorothea (cavlec) is well-known for coining the term "repository rat", and documenting the frustrations that go with that role. Kudos to her for identifying and speaking out about the experiences of the role, and for further characterising the activities and limitations of a repository manager in her blog piece Meet Ulysses Acqua.

I don't think that Dorothea's intention was to impose a one-size-fits-all characterization of all repository-workers - rather to see a fair representation of her own experience when others seemed to be ignoring it. I don't feel as if I am a repository rat exactly because my activities seem to have few rat-like characteristics, and many of my repository colleagues don't seem to be rats either. But my colleagues aren't exactly like me either. I seem to occupy a particular niche in the repository ecology, and they too are adapted to their own environments. So on the plane back from SPARC DR2008 I started to wonder about the different characteristics of that ecology, and the kinds of animals - rat and non-rat - that had evolved within it.

Solitary or socialSome animals hunt in packs, co-ordinating their activities (lions, raptors), where others operate by themselves (rats, squirrels).Do you work with a teams of subject librarians, or are you left to work by yourself?
Wild or domesticatedSome animals fit into and contribute to human organisations (dogs, horses), whereas others operate on the edge (cats) or totally outside the structure (foxes, wolves).Are you involved in institutional committees, consulted by management and is your repository a core service?
Hunter or scavengerSome animals actively search out sources of food for the kill, whereas others eat whatever is left around, or whatever is offered to them.Do you search out material that is suitable for deposit by looking through Web of Science or by interviewing faculty?
Preener or sloven?Some animals preen themselves (birds) or others (chimpanzees) obsessively.Do you have strict QA standards that you apply to depositors' material?
Valued or verminMany animals are known as hoarders or collectors of food. Some are pampered (hamster), persecuted (rat) or tolerated (squirrel).How do you feel about your role? This is more about professional esteem, than observable traits.

(I don't claim that this is a comprehensive view of the repository world - to be more general I could have started out with the essential question "Do you have a backbone?" )

How do I fit in this taxonomy? I operate alone with no assistance in managing the repository content or policies. Thanks to our de facto mandate I don't need to actively chase material, I just wait for it to be deposited into the repository. Up to this point I haven't done much QA, I have let our library colleagues do this for the important records. I am involved in the departmental infrastructure. So that makes me solitary/scavenger/domesticated/sloven. My colleagues in the library who run the institutional repository are social/hunter/domesticated/preeners - perhaps? And Dorothea - well, I think that she is social (she works with a team of librarians on different campuses) / hunter / preener / wild*. Expressed in tabular form, this looks as follows (I have thrown arXiv in for good measure):


SolitaryWildHunterPreener
ECSX---
Soton--XX
arXivXX-x
Wisconsin-XXX

I will leave others to come up with animals with the appropriate characteristics. The lesson that I want to draw out is that there is variety in the repository kingdom. Even in the four repositories above we have all possible combinations of solitary/wild. The hunter vs preener columns look more correlated - perhaps you don't go to the effort of hunting for material and then not bother doing QA on it. And conversely, if you wait for material to drop into your lap you are less likely to care about its condition. (arxiv scores a 'half' on the preener scale because SPIRES provides a feed of "corrected" references for published material.)

So in conferences and community activities, messages that go down well with repository managers who have ticked the 'domesticated' box are going to irritate the 'wild' ones and vice versa. Those giving authoritative conference messages need to realise that they aren't speaking to a monoculture! That may be a lot to ask at the moment - they are still coming to terms with the existence and possibilities of repositories and the role of repository managers. A finely nuanced appreciation of the variation of the species is some way off yet.

So I hope Dorothea will forgive me for avoiding the "rat" identity - it is not out of a lack of solidarity with her difficult position or an appreciation of the hard work that she does. I'm just a manager in a different set of circumstances. 

And the real reason that I avoided matching animals to the different repository roles? The only domesticated, non-social, non-hunting, non-preening animal I could think of while on the plane was a rabbit. And I just can't abide the thought of being known as a "repository bunny".

Tuesday 18 November 2008

The Value that Repositories Add

One of the things I failed to do during the Evidence of Researcher Engagement meeting was to give a presentation that I had been working on for over a week. The discussion just ran away with me! So I have been persuaded to post it to the Web to try and get the message "out there".

A repository should be able to provide lots of benefits to its users. In particular, it should make things more valuable when they are deposits than when they are just files on a laptop or on a web server. This presentation is written to inform researchers of the kinds of things that should be able to do with their material in repositories. It starts off with the basic functions that are provided FOR THEM (wide access, persistence, backups, bibliography pages, administrative reports etc) and then tackles the kinds of ways that researchers can take advantage of the material FOR THEMSELVES.

This is not a complete list - I would love to have lots more suggestions and examples - and in some ways it is a bit optimistic. No repository will do all the things that I have listed - but it shouldn't be too hard for any repository to provide some of these services.



The other thing that I failed to do was to attract many visitors to the EPrints table during the sponsored breakfast. I was convinced that my brilliant marketing idea of a platter of Apple and Raisin fritters would get people lining up to read my leaflets, but unfortunately the quality of the rest of the breakfast buffet was just so great that I couldn't compete. Oh well, onwards and upwards!

Evidence of Researcher Engagement - stories, narratives and anecdotes

On the evening before the SPARC Digital Repositories conference I hosted a meeting to discuss the evidence of researcher engagement from individual testimonials and anecdotes. That seems to have been a bit of a theme throughout the conference: Jennifer Campbell-Meier spoke about gathering stories about repositories as a means of advocacy in the new horizons panel and Bob Witeck spoke about the importance of stories in marketing open access to faculty and management. I hope that we're going to be able to start up a central place for collecting personal testimonials about repository benefits under SPARC's auspices. The idea is that a repository manager or subject librarian can have somewhere to go and look for success stories as told by faculty and researchers from particular disciplines. (Do you wake up in the morning, dreading an advocacy meeting with the Chemistry Department? Why not download a couple of repository testimonials from the chemistry page on repositoryluv.com!) More on this later!

Sunday 16 November 2008

Unlikely Heroes?

On the shuttle bus from Dulles to Baltimore yesterday there were a load of people heading for a large (30,000 delegate) neuroscience conference. They all introduce themselves and their research to each other, and then they turn to me. I hate that kind of situation - being confronted with hard scientists. You see there's  the researcher pecking order that has to be upheld and it roughly tallies with the Impact Factor of your discipline's major journals. So biomedicine is up there at the top, and computer science is, well, you see, we are a conference discipline. That doesn't even register.
So I tell them that I'm heading to a small workshop in Baltimore on Digital Libraries (SPARC DR2008). "Really?" is the polite response. "Yes," I venture "it's all about ways of providing open access to your research." Instant kudos. "Wow, that's brilliant. We so need that." And then come the stories of how they still have to fall back on their grad school library facilities when they are now independent researchers in other institutions with their own students.
And I am the hero on the bus! Still, it's been a long flight and there's two hours to go until I get to my hotel and so I fall asleep. When I wake up it all seems like a dream.

Tuesday 11 November 2008

Someone Stop Me!

I had a meeting with some representatives from other Schools last week - they wanted to deposit some Masters theses in a repository but they were hindered from doing so by the policies of the respective services. The long and short of it was that I volunteered to set up a demo repository to allow them to get their documents housed somewhere safe, but also because I know that we need somewhere to store four years of our school's masters and undergraduate dissertations. We'll use the demo to make a business case to the university to extend the "institutional repository umbrella" while we're getting some experience with the issues.

Anyway, I set up the repository over the weekend and deposited the first batch of 100 dissertations, and - this is my point - it just feels so GOOD to be in control. I don't know if other repository managers get that feeling too, but when you get to make all the metadata decisions and press all the import buttons and BANG you've got a new batch of stuff all sitting pretty then I get a warm glow. Is this wrong? It's the same feeling as home baking, except that the cakes disappear by the morning whereas the dissertation are still there!

It's certainly more satisfying than setting exams, which is what I was supposed to be doing.

Friday 7 November 2008

Repositories Making Life Easier For Faculty?

Could it be that repositories will help make life easier for faculty? "Pull the other one" I hear the repository-weary skeptics cry. "We've heard it before!"

Well, if there's one chore that academics are bad at - aside from depositing items in to repositories - it's keeping our web pages up to date. About 1/3 of our lecturers at ECS don't have working home pages - and neither did 20% of CS professors at MIT the last time I checked against the internal staff list. And those who do have working pages seem to keep them several years out of date. Certainly mine had its last major update three summers ago!

Now the school provides me with a set of official portal pages which are generated by its internal databases, but they are a bit, well, impersonal. If only there was a way to keep my personal pages updated as effortlessly, but in a way that didn't look too corporate and databasey. I'm caught between regularly-updated/dull and individual/bespoke/stale.

I think that the answer (or something like it) may be found at PageFlakes. It's a personalised content aggregator that is typically used for pulling together news feeds from a variety of sources (CNN, Yahoo, Youtube etc) but with luck, if you can find the right set of information feeds about YOU and YOUR SCHOOL then PageFlakes can do a very passable job at creating a home page about you.

The example page that is illustrated above (the actual URL is http://www.pageflakes.com/lescarr/25235060) is formed from an RSS feed from my school press releases, a feed from our student bloggers and two feeds from two repositories - the researchy repository which gives the latest set of papers/presentations that I have written and the materials that I have most recently made available for my teaching. It also has a short description and photo that I put in by hand. All in all that makes a good current description of my status - research outputs, teaching outputs, student activity and school activity.

Because the repository has created preview images of all the documents it holds, that makes the RSS feeds much more visual and interesting (and personal) than a simple table of contents. It feels like a home page, rather than an aggregation of syndicated content.

I know that all the cool dudes discovered PageFlakes 18 months ago, but I'm quite jazzed about it as a vehicle for personalised repository content. And I do think that as institutions get to grips with marketing themselves through the web, the repository can have a role as a content provider for building rich media Web content for widgets, mashups and all kinds of social network applications.

Without getting too carried away, the repository will start to make my online life easier by managing all my research and teaching material, so that I can use it to create a bespoke web presence - my home page.

Wednesday 5 November 2008

More Things to Do With a Repository Feed

Yesterday I blogged about using an RSS feed to create an interesting visualisation
of a collection of items from a repository.

The image on the right is taken from another demo page that was put together using the Widgetbox service. You simply have to paste in the URL of the RSS feed that you want to use, make a couple of selections about the colour and size of the widget and then it provides you with a bunch of HTML that you can copy and paste into your web page. It will even add the widget directly to your blog or Facebook page automatically.

Other services like Yahoo Pipes are good at combining, filtering and generally tweaking RSS feeds, so you could even create a federated widget.

Tuesday 4 November 2008

Visualising Repository Contents

Those who have followed this blog will know that I'm a sucker for a good visualisation that provides a helpful way of displaying and accessing the contents of a collection or a whole repository.


So I read with interest about cooliris, a convincing and polished implementation of the displaywall metaphor that works on media resources described in RSS feeds. Using XSLT I turned the XML export of an EPrints search result into the required MediaRSS format (making use of the eprint item thumbnails) and embedded it into a web page as a demo. The results are best viewed in their installable full-screen viewer rather than the web page-embedded Flash program, especially if the feed extends to thousands of objects!

This technique is obviously best for visually attractive items, rather than a wall full of text-based journal articles, and would probably form an accompaniment to a collection listing, rather than replacing it.