Sunday, 30 November 2008

More Repository Value for users - making PowerPoint files from RSS feeds

Executive Overview: I tried and failed to find some existing user desktop applications that could work with repository collections; in the end I had to write some simple programs to demonstrate the kind of thing that can be achieved with a repository, some content and Microsoft Office. The screendump on the left is an example. Scroll down for some more examples from EPrints and DSpace.

Since my previous posting about Repository Value, I have been very aware that all my examples were focused on the web, and that the services and widgets that I described were for web page authors to use on web pages. In fact, many of them might be better used by web admins and web developers than rather than academic users and researchers themselves.

Although we live in a Web-dominated world, most of my colleagues do not use the Web to originate material. Work is found on the web, but created on the desktop (or laptop), and consequently the tools and services that people are mainly familiar with are those of the desktop, rather than webtop. Cloud computing may be the way of the future, but currently services like Google Docs gain a lot of attention but few users according to a recent report by ClickStream Technologies.

So I particularly wanted to find some good old-fashioned desktop applications that could do useful work with material in repositories. Something that could take a collection of material from a repository and synthesise a useful overview document that describes it, or disply it in a slideshow. And most importantly, it had to be something that could be used by a highly educated but non-technical researcher. Or even an extremely clever but technically incompetent professor. Not a software developer!

I looked at different ways of using Microsoft Office, but it doesn't schlep material from the Web very well. Word can read in individual web pages, but collections are represented in a group of pages which are difficult to handle without programming. I had some previous experience with forcing Adobe Acrobat to create a slideshow from a repository, but that was too fiddly. I had some success playing around with the Firefox "Scrapbook" extension to grab all the linked pages from a collection as a single (humongous) web page, but I couldn't find a way of doing anything productive with that page.

The fundamental problem is that repository web pages (as displayed by DSpace, EPrints, Fedora, Digital Commons or anyone else) are too complex to interpret without creating a bespoke application that is tailored to datascraping each kind of repository (and each version of each kind of repository.) So to have any chance of broad-spectrum success, I needed to find a simpler format to work with that is supported by all repositories.

Luckily, RSS is just such a format. Repositories use it to describe lists of recent deposits, or lists of items in their collections. Previously I was investigating it for Web-based services, but now I want to use it for a desktop based-service. Which, to be honest, pretty much means Microsoft Office. Other productivity suites exist, but if you want to target as many repository users as possible then Microsoft Office is a pragmatic first step.

So, does Office have any built-in facility for reading RSS files? Unfortunately it doesn't. I seem to be out of luck with finding an off-the-shelf application that everyone has access to. I'm just going to have to make something myself.

But despite not having an RSS capability, Office has something almost as good - an open data format. I might not be able to get PowerPoint to make a slideshow directly from an RSS feed, but I can knock up an application to make the slideshow for me. And I don't even have to be a Microsoft developer to do it - because it's an open format I can just use whatever tools I am most familiar with. 

A Microsoft Office 2007 file (or Office 2008 if you are on a Mac) is actually a Zip file containing a collection of XML files. Like a "web page" some of those files are content files (each slide in a PowerPoint slideshow is a separate XML file) and some are stylesheets (each slide layout is a separate XML file) and some are media files (images and videos). I wrote two shell scripts to create the appropriate XML and media files to (a) create a new slideshow with just a title slide and (b) add a new slide onto the end of a slideshow. I then wrote some XSLT to turn an RSS feed into a shell script that successively calls those two commands. Each slide in the slideshow has a title, description and image taken from the corresponding item in the RSS feed. The slideshows are created with no styling, but opening them in PowerPoint allows a theme to be chosen for them with a single click.

The following examples show some of the slideshows I have made from repository material.

These are my latest papers and posters from the ECS EPrints repository at Southampton University. Note that this example (like the ones below, but unlike the example at the head of this post) has had a "Theme" applied by PowerPoint.

These are recent items deposited in the JISC KULTUR project repository.

This is from a photo collection at Demetrius, the DSpace repository at the Australian National University.

These items aren't from repository at all - they are from the ECS Press Release feed. After all, there's more to the web than repositories, and there's also more to lecturers' sources!

This is some partial success - I have had to write some programs rather than use readily-available applications. On the other hand, these rough-and-ready demos could be turned into a web service quite easily - go to a web form, paste in the RSS feed URL and download a PPTX file. (Of course, I'd have to debug them first. And re-implement them in a rather more robust environment.)

But there's a lot more that could be done. For an example of the sort of well-thought-out service that operates in this space, have a look at the New England Journal of Medicine's Image Library service. You provide some query terms, and it returns all the figures from recent NEJM articles that match your query. You then choose the relevant ones, press a button and it turns them into a PowerPoint slideshow for you, with all the necessary citation information (and licenses) attached. Imagine if OAIster could provide that for you, based on all the articles on any subject in all the repositories in the world...

Here are a few other ideas off the top of my head.
  • Most powerpoint presentations that are created are variants on a previous presentation. Gradually a presenter's message evolves over time, and their presentations reflect that. A repository not only allows the author to keep track of his/her previous presentations, it could track all the variants of each slide and allow the author to recombine them in new ways. It's very common for me to search through a dozen PowerPoint files on my laptop, to find a particular version of a particular slide that explained a point with just the right emphasis to a particular audience. Let the repository find all the individual slides that mention a particular phrase and allow me to choose le slide juste.

  • Or let the repository ingest process extract all the images in a presentation and determine the source and verify the rights clearance.

  • Or the next time I have to do a presentation to potential students, or potential funders, let the repository copy my basic set of core slides, and update it with new slides from the latest research repository feeds and the latest press release feeds.
Watch this space!


  1. Stupid I-haven't-grokked-the-spec-yet question:

    Will OAI-ORE make stuff like this easier to hack up?

  2. ORE addresses the problem of knowing what web resources are a part of an item or collection. A powerpoint slideshow is an aggregation of XML files (slides, stylesheets etc) and so in unzipped mode, yes, ORE would help keep track of these.

  3. Les, that's so clever, can you let us know how to make one of these up for our news releases - I was actually looking for something like this earlier today!! Joyce

  4. Les, are your slides showing the front pages of the content from the repositories? If so, those of us using boring cover sheets would miss out on such a function.... and we'd better think it through!