Sunday, 8 June 2008

Even Simple Services can be Annoyingly Complicated

I know I've mentioned this before (on this blog and elsewhere) but I think that displays are important. Whether it's bibliographic displays of papers for CVs and online bibliographies or whether its very visual galleries, slideshows and montages of papers, presentations, posters, images and videos then a core part of the academic life is "showing off" the things you have done, and telling stories about them. Hopefully, your repository can help you with these displays.

In a previous blog (Cobbling it Together) I described a quick and dirty way of making a slideshow from a set of documents in a repository by using Acrobat to do all the heavy lifting. Several months later I created a slideshow a different way with the EPrints ZIP exporter and the Mac's iPhoto slideshow software. Both of these methods required quite a lot of manual labour to provide an end-to-end service. Now mediated deposit is one thing, but mediated use is quite another, and so I have been trying to find an easier way to produce a good looking slideshow with EPrints.

We have done various experiments with the display side, and that turns out to be quite easy. Whether it is with Flash, or a 3D renderer, or an external Graphics environment you can build interesting displays, assuming you have the right data files. The problem is getting the right data files! Quite often the items you most wish to show off are the most visual ones - posters or presentations, rather than papers. This means rich media, which almost certainly means commercial desktop presentation programs like Microsoft Powerpoint or Apple Keynote. Now, repositories make every effort to create previews and thumbnails of ingested documents, but this is mainly limited to images, videos and PDFs. Office documents have to be downloaded and opened in their native application to be able to view their contents.

When I was managing the OR08 repository, I encouraged authors to send in the source of their presentations and the poster artwork. Many sent me PDFs only, some sent me PDFs with the original PowerPoint, and some sent me the PowerPoint files only. For that last group I made sure that I manually converted the document into PDF (using PowerPoint 2008 on the Mac). The repository automatically created preview files for the first page of the PDF files, so each presentation ended up with a preview image of sorts. (Previews are actually created in three sizes by default by the open source Ghostscript interpreter.)

The problems that I have are that (a) previews aren't made of PowerPoint documents unless a derived PDF version is supplied and (b) the previews are relatively low-resolution and (c) the creation process not reliable. Of the 144 PDF documents in the OR08 repository, about 10% have no preview because the conversion process failed. Of the remainder, a further 10% have an inaccurate preview (missing fonts, incorrect geometry, badly positioned graphic components). To produce a full-screen slideshow of posters, what is really needed is high fidelity, high quality (200dpi) images of the documents. Even when the preview-generation software is changed to increase the resolution, this does not fix the 20% of previews that do not pass the expected quality threshold. And, of course, it still requires the human depositor to submit a PDF document to make the Office original 'previewable'.

The best results I have obtained for generating previews have been from using bona-fide PowerPoint to create PDFs and Acrobat (or the Mac's "Preview" program) to create images from those PDFs. Since Office is a personal desktop piece of software, it can't really be used in the context of a server and the Microsoft documentation advises against it since various functions try to communicate with the user's desktop environment. This seems to preclude having an preview generation service tucked away in the repository software.

I've been experimenting with Chris G to find a manageable way to handle the production of high quality viewable PDF and preview images for Microsoft Office documents. It's not finished yet, but we're getting there. Parts of the process are in place, but they're all in pieces on the kitchen floor (metaphorically speaking). What I'm trying to do is document our experiences here, and in particular, why the whole charade is looking very different to the neat little repository service I imagined at the start.

Basically, whatever we try it involves running Office on someone's desktop, and that means transferring files from the repository and back again. We looked at file sharing technologies (e.g. SAMBA or WebDAV), or file transfer protocols (e.g. FTP or EMAIL), but we had problems because our server is behind a firewall and none of our user desktop machines are allowed there. Our initial expectation was that the repository would initiate and control the production of previews using some kind of push technology (e.g. messages or drop folders). In the end we settled on the new EPrints REST API and allowing the user/client/service to control the choice of items which require previews.

I've ended up using Automator on a Mac to control the use of Word, PowerPoint and Preview to process Office and PDF documents. Later I will investigate using .NET to control Office and Acrobat on a Windows desktop. At the moment it is happening on a physically separate computer, although we might look at virtualising the process and running it on the same machine as the repository.

This process is Automate-d but not completely automatic, because user tools are involved. Every so often when a Word document opens up, Word puts up a dialog box to ask me whether I want to trust its macros. Or when it tries to save to PDF I get a message warning me that a footer is out of a printable region. Also, Automator is wont to crash after processing about 50 documents. This means that it took about 4 days elapsed time to convert 60 powerpoint and 100 word documents in to PDFs and thence to 200dpi PNG files. If I had been constantly in the office to give it the total 10 minutes attention it needed, the whole process would have taken about 2 hours. Still, a "slightly hands on" process is better than "no process at all" becasue I need those files. And I hope that I'm only going to have to process this backlog once any way!

Now that I've got the new files on my desktop computer, I can put them back using the REST API, but how does the repository know that they are preview files? The automated built in previews are stored in a separate place in the repository; third party services can access them but there is no public API to create them or update them. Also, only preview images are handled, not PDFs. So the files that I have created aren't stored as 'repository backstage previews' but as independent documents within the original eprint that have a specific relationship (mechanicallyDerivedVersion) to the original Office document.

This means that any eprint which contains an Office document may now acquire a number of service-generated additional documents. At the moment, EPrints doesn't use them in its internal processes (for providing thumbnails on abstract pages), but export plugins can treat them how they like. The only thing that EPrints understands is that if a document changes (is updated, or deleted) then all of the documents that were mechanicallyDerived from it must be deleted. The assumption is that the 'preview' has been rendered obsolete or out-of-date by the changes/deletions and it is the responsibility of whatever third-party service that created the documents to recreate them.

My 'slideshow' exporter can now look for all the PowerPoint documents it wants, and then use any image file that is mechanicallyDerived from them. Job done! The "preview" semantics of the derived files are only understood by the third party plugins that create and use them, but their temporary and dependent relationship to the other documents are understood by the repository core. As we refine this process we will doubtlessly add something like a derivationRelationship to the mix, so that we can tell our thumbnails from our MD5 signatures.

The main shift in expectations remains that "preview generation" (something that was an automatic, internal service) is being supplemented by an external, partially manual (handheld) service. Sometimes, it seems, the services that a repository takes credit for are actually provided by human cranking the handle on other pieces of software! I've just had exactly this discussion with Bill Hubbard, and it's making the boundary of my repository architecture diagram look very fuzzy!

1 comment:

  1. Is there any public documentation of "the new EPrints REST API" that you mention? I couldn't find any mention of it at