Friday 13 February 2009

Microsoft Office at #dev8D

I joined my colleagues Chris Gutteridge and Dave Tarrant (aka BitTarrant) at the JISC Developer Happiness  (#dev8D) event in London this week. At least, I came to the tail end of the event after I had dispatched some JISC bids! It was a great time, with lots of food for thought. During the closing Repositories session the discussion touched on the role of the repository in mediating between desktop documents and the world of the web (1.0, 2.0 and the cloud). In particular, one of the Fedora developers suggested that the repository could expose new "endpoints" (i.e. points of access) for the kinds of complex documents that were normally encountered as a take-it-or-leave-it package. Documents like Microsoft Word files, which are now stored as explicit bundles of text, media, metadata and relationships.

This fits into so many of my soap boxes - providing more value for end users, supporting desktop activities, taking advantage of the new Office openness - so I got really excited about the possibilities. At the end of the session I sat down with Chris and we (i.e. he) started implementing an EPrints service to do that.   If the wireless network hadn't gone down, he would have finished before the conference dinner. However, he did finish and refine it the next day during the repository briefing sessions.

The image on the left shows what happens when you upload a Word 2007 document. Firstly, the Dublin Core metadata (author, title etc) of the document is applied to the eprint record itself (aka automatic metadata extraction). This has obvious advantages because it means that if you want to create a sensible, standalone record then you might be able to get away with just uploading the document and not filling out any extra metadata. If you can then look up the author and title in Web of Science you really might not need to fill out any extra metadata at all. That would be nice!

 Secondly, each of the images is extracted as a separate document in their own right. That means they get their own metadata and URLs and you could download and reuse individual figures without downloading the whole document. (In the image I have shown the figure captions as part of the metadata, but I cheated by cutting and pasting them from the original.)

Another example is a record for a PowerPoint document shown here. By bursting out all the images used in the slideshow, the repository has automatically created a catalogue of media resources which could be used in a copyright audit to check that it is safe to make this teaching resource Open Access.

Since each media resource is a separate entity - and its not just limited to pictures and videos, it could be embedded spreadsheets and other complex documents - it is linked internally to a specific slide entity, so it would be easy to make a rather more sophisticated table of slides and resources. 

And once you have all the slides listed for an individual slideshow, then the repository can make a page that views all of the slides from all of the individual Powerpoints. Or just the ones from a particular project. Or just the ones from a particular research group.

So I think that there's a lot of mileage in this approach, especially when you combine it with SWORD and allow the Office application to automatically save the Office document into the repository in the first place.


  1. You missed a feature!

    I also made a new preview-generator plugin which unpacks the word or ppt and if it contains a preview image, uses that as the document preview image.

  2. Hope Chris is submitting this to the dev8D prototype competition!

  3. @dfflanders @cjg - well, not only submit it for dev8d, but stick it into EPrints asap - would love to extend this for ODF and friends.

  4. Just out of curiosity, were you submitting the files via the SWORD interface? I'm only interested because I thought it would be a nice demonstration of the sorts of interactions SWORD aka AtomPub can enable in a talk I'm giving next week.