RepositoryMan: June 2008

Thursday, 26 June 2008

Inspirational Teachers

I listened to John Willinsky give an inspirational keynote at ELPub 2008 this morning. He banged the drum for Open Access and announced an OA mandate for the Stanford School of Education. According to the story, he was describing the Harvard mandate to his colleagues in a meeting and they instantly voted to adopt a similar mandate themselves. Way to go!

However, the message that I shall take home was his discussion of the connection between "public" forms of knowledge and "highly authoritative" forms of knowledge. He gave the specific example of the links made between between Wikipedia and the Stanford New Encyclopedia of Philosophy, ie opportunities where a general and democratic information resource links back to a resource which is written and governed by domain experts. A really very good thing, according to Willinsky, who believes that the sustainability of the entire research infrastructure is based on its perception as a Public Good, one that is open and encourages the participation and engagement of its sustaining community.

In other words, the fact that many non-researchers seem to be downloading papers from our repositories shouldn't be seen as a suspicious thing. "Things on the Web are just downloaded by teenagers and pornographers" according to some colleagues who are less than Web-friendly! "If a download isn't attributable to someone in a University then it shouldn't count - it's obviously a mistake or being read by someone who can't possibly understand it." That's the attitude.

But perhaps not. According to Willinsky, our (Higher Education's) ongoing existence as a part of society depends on us acknowledging that less esoteric forms of debate and knowledge do exist (public forums and websites) and that we should expect and encourage the public to refer to our work, and link to our work and even read our work.

And I think that if repositories have a role in making collections of research material accessible, then perhaps we should be thinking about how to make them a bit more accessible to the public, in helping us become inspirational teachers with half an eye to the rest of society.

Wednesday, 25 June 2008

Repositories Should be More Like Email (apparently)

See below of a summary of an interesting JCDL 2008 paper that adds to the "repositories - they're all wrong" debate. Cathy is well-known (and, I think, well-loved) from the hypertext community for her ethnographic studies of information handling, and here she reports on a small scale study of the information management practices of research authors as they go about the task of writing papers, and the implications for repositories. The paper is noteworthy because it highlights the role of email as a personal archiving solution and argues that any repository platform will need to do better than email in a range of criteria to gain user acceptance.

Well, it's a new target for repository developers, and perhaps a new marketing slogan to look forward to (EPrints: Sucks Less Than Hotmail).

From my experience, the paper rings true in its description of ad-hoc and distributed author processes, but it is focused on a small group of Computer Scientists all of whom use LaTeX and BibTeX, so I don't know exactly how applicable its message is across the whole institution.

Marshall, C. C. 2008. From writing and analysis to the repository: taking the scholars' perspective on scholarly archiving. In Proceedings of the 8th ACM/IEEE-CS Joint Conference on Digital Libraries (Pittsburgh PA, PA, USA, June 16 - 20, 2008). JCDL '08. ACM, New York, NY, 251-260. doi: 10.1145/1378889.1378930.

(For those without subscriptions for the ACM Digital Library, Google Scholar will point you at a preprint available at tamu.edu.)

ABSTRACT: This paper reports the results of a qualitative field study of the scholarly writing, collaboration, information management, and long-term archiving practices of researchers in five related subdisciplines. The study focuses on the kinds of artifacts the researchers create in the process of writing a paper, how they exchange and store materials over the short term, how they handle references and bibliographic resources, and the strategies they use to guarantee the long term safety of their scholarly materials. The findings reveal: (1) the adoption of a new CIM infrastructure relies crucially on whether it compares favorably to email along six critical dimensions; (2) personal scholarly archives should be maintained as a side-effect of collaboration and the role of ancillary material such as datasets remains to be worked out; and (3) it is vital to consider agency when we talk about depositing new types of scholarly materials into disciplinary repositories.

The Bits I Underlined

Furthermore, from the point of view of the researchers and scientists themselves, institutional archiving arrives on the scene late in the process; the deposit of publications and datasets is an afterthought to the actual work, the research and writing. What would make archiving more integral to the entire process? What does scholarly archiving look like today from the scholar's perspective? How can normal collaborative interactions be used to improve repository quality?

I make an effort to focus closely on the practices and artifacts relevant to maintaining personal archives and contributing to institutional repositories.

Second, participants feel that versions record the development of ideas, a trail that may prove important. But how important? Much of the history and provenance of an idea can be reconstructed from communications media like email, especially when it is combined with intrinsic metadata such as file dates. Thus benign neglect coupled with imaginative interpretation will get you pretty far in reconstructing a publication's history.

What is most apparent throughout this discussion is that personal archiving is a side effect of collaboration and publication: for example, if email is used as the mechanism for sharing files, it also becomes the nexus for archiving files. If one's CV is the means by which a public list of publications is maintained, it is also used as a pointer for oneself to the most authoritative version of a publication. Personal archiving can be both opportunistic and social: participants talked about tracking down public versions of their own publications to reclaim copies of lost work.

Email is cited as a good permanent store for three reasons: (1) it is easy to browse chronologically, which makes retrieval easy and lifts the filing and organizing burden; (2) intrinsic metadata supports the reconstruction of context (for example, who made particular revisions and why); and (3) email is usually accessible from any web browser. If email is used as an archive, some care must be taken to ensure everything that is important is actually in email. Some archival material is normally in email (reviews, for example) and no extra effort needs to be expended to make it part of the record. Other types of artifacts‚ (run output, for example) must be put into email deliberately. Email is a sufficiently good archive that some participants made the effort...

It is easy to see how email provides just enough mechanism to fulfill the minimal version of these requirements. Any CIM infrastructure must beat email along all of those dimensions if it is to be adopted in email's stead

Tuesday, 24 June 2008

Publishing - A One-Word Oxymoron?

Why do they call it "publishing"? Wouldn't it be much more accurate to say "I've just had a paper privatised?"

Just thinking aloud.

Wednesday, 11 June 2008

Negative Click Repositories

The topic of "negative cost" repositories has been doing the rounds in the blogosphere. Chris Rushbridge has rebadged it as the negative click repository on the grounds that there is a positive cost associated with setting up and using a repository. I think I would rather talk about value or profit - the final outcome when you take the costs and benefits into consideration. Do you run a positive value repository? Is it frankly worth the effort? Are your users in scholarly profit, or are you a burden on their already overtaxed resources?

Chris quotes from Cavlec's (imaginary) repository apologist who attempts to defend a very high-cost, low-benefit repository. But he then goes on to treat that passage as if it were a factual evaluation of a real repository, a damning piece of evidence on the fundamental uselessness of repositories. It isn't! Ulysses Acqua is a straw man and his repository is a caricature of a real repository. I certainly don't accept that he describes my repository and I can easily answer yes to many of those questions. So while I'm not complacent and I recognise that there are many new services I want my repository to offer, I think we're not doing too bad on the value scale already, thank you very much.

Negative click/positive value. It's a nice rhetorical stance and a useful banner to rally the troops to, but let's not flagellate ourselves unduly. Let's recognise where good value exists and promote it! Lets foster new services around the material that repositories capture, manage and expose. Otherwise we'll just give up and run to the next bandwagon which will always sound more enticing because it has less experience with dealing with real practice!

Anyway, I think that I am in violent agreement with Chris, so to show solidarity I will do what he asked and list some positive value generators: publicity and profile (CVs, Web pages, displays, adverts for MScs/PhDs/staff), community discovery, laptop backup and asset management.

Sunday, 8 June 2008

Even Simple Services can be Annoyingly Complicated

I know I've mentioned this before (on this blog and elsewhere) but I think that displays are important. Whether it's bibliographic displays of papers for CVs and online bibliographies or whether its very visual galleries, slideshows and montages of papers, presentations, posters, images and videos then a core part of the academic life is "showing off" the things you have done, and telling stories about them. Hopefully, your repository can help you with these displays.

In a previous blog (Cobbling it Together) I described a quick and dirty way of making a slideshow from a set of documents in a repository by using Acrobat to do all the heavy lifting. Several months later I created a slideshow a different way with the EPrints ZIP exporter and the Mac's iPhoto slideshow software. Both of these methods required quite a lot of manual labour to provide an end-to-end service. Now mediated deposit is one thing, but mediated use is quite another, and so I have been trying to find an easier way to produce a good looking slideshow with EPrints.

We have done various experiments with the display side, and that turns out to be quite easy. Whether it is with Flash, or a 3D renderer, or an external Graphics environment you can build interesting displays, assuming you have the right data files. The problem is getting the right data files! Quite often the items you most wish to show off are the most visual ones - posters or presentations, rather than papers. This means rich media, which almost certainly means commercial desktop presentation programs like Microsoft Powerpoint or Apple Keynote. Now, repositories make every effort to create previews and thumbnails of ingested documents, but this is mainly limited to images, videos and PDFs. Office documents have to be downloaded and opened in their native application to be able to view their contents.

When I was managing the OR08 repository, I encouraged authors to send in the source of their presentations and the poster artwork. Many sent me PDFs only, some sent me PDFs with the original PowerPoint, and some sent me the PowerPoint files only. For that last group I made sure that I manually converted the document into PDF (using PowerPoint 2008 on the Mac). The repository automatically created preview files for the first page of the PDF files, so each presentation ended up with a preview image of sorts. (Previews are actually created in three sizes by default by the open source Ghostscript interpreter.)

The problems that I have are that (a) previews aren't made of PowerPoint documents unless a derived PDF version is supplied and (b) the previews are relatively low-resolution and (c) the creation process not reliable. Of the 144 PDF documents in the OR08 repository, about 10% have no preview because the conversion process failed. Of the remainder, a further 10% have an inaccurate preview (missing fonts, incorrect geometry, badly positioned graphic components). To produce a full-screen slideshow of posters, what is really needed is high fidelity, high quality (200dpi) images of the documents. Even when the preview-generation software is changed to increase the resolution, this does not fix the 20% of previews that do not pass the expected quality threshold. And, of course, it still requires the human depositor to submit a PDF document to make the Office original 'previewable'.

The best results I have obtained for generating previews have been from using bona-fide PowerPoint to create PDFs and Acrobat (or the Mac's "Preview" program) to create images from those PDFs. Since Office is a personal desktop piece of software, it can't really be used in the context of a server and the Microsoft documentation advises against it since various functions try to communicate with the user's desktop environment. This seems to preclude having an preview generation service tucked away in the repository software.

I've been experimenting with Chris G to find a manageable way to handle the production of high quality viewable PDF and preview images for Microsoft Office documents. It's not finished yet, but we're getting there. Parts of the process are in place, but they're all in pieces on the kitchen floor (metaphorically speaking). What I'm trying to do is document our experiences here, and in particular, why the whole charade is looking very different to the neat little repository service I imagined at the start.

Basically, whatever we try it involves running Office on someone's desktop, and that means transferring files from the repository and back again. We looked at file sharing technologies (e.g. SAMBA or WebDAV), or file transfer protocols (e.g. FTP or EMAIL), but we had problems because our server is behind a firewall and none of our user desktop machines are allowed there. Our initial expectation was that the repository would initiate and control the production of previews using some kind of push technology (e.g. messages or drop folders). In the end we settled on the new EPrints REST API and allowing the user/client/service to control the choice of items which require previews.

I've ended up using Automator on a Mac to control the use of Word, PowerPoint and Preview to process Office and PDF documents. Later I will investigate using .NET to control Office and Acrobat on a Windows desktop. At the moment it is happening on a physically separate computer, although we might look at virtualising the process and running it on the same machine as the repository.

This process is Automate-d but not completely automatic, because user tools are involved. Every so often when a Word document opens up, Word puts up a dialog box to ask me whether I want to trust its macros. Or when it tries to save to PDF I get a message warning me that a footer is out of a printable region. Also, Automator is wont to crash after processing about 50 documents. This means that it took about 4 days elapsed time to convert 60 powerpoint and 100 word documents in to PDFs and thence to 200dpi PNG files. If I had been constantly in the office to give it the total 10 minutes attention it needed, the whole process would have taken about 2 hours. Still, a "slightly hands on" process is better than "no process at all" becasue I need those files. And I hope that I'm only going to have to process this backlog once any way!

Now that I've got the new files on my desktop computer, I can put them back using the REST API, but how does the repository know that they are preview files? The automated built in previews are stored in a separate place in the repository; third party services can access them but there is no public API to create them or update them. Also, only preview images are handled, not PDFs. So the files that I have created aren't stored as 'repository backstage previews' but as independent documents within the original eprint that have a specific relationship (mechanicallyDerivedVersion) to the original Office document.

This means that any eprint which contains an Office document may now acquire a number of service-generated additional documents. At the moment, EPrints doesn't use them in its internal processes (for providing thumbnails on abstract pages), but export plugins can treat them how they like. The only thing that EPrints understands is that if a document changes (is updated, or deleted) then all of the documents that were mechanicallyDerived from it must be deleted. The assumption is that the 'preview' has been rendered obsolete or out-of-date by the changes/deletions and it is the responsibility of whatever third-party service that created the documents to recreate them.

My 'slideshow' exporter can now look for all the PowerPoint documents it wants, and then use any image file that is mechanicallyDerived from them. Job done! The "preview" semantics of the derived files are only understood by the third party plugins that create and use them, but their temporary and dependent relationship to the other documents are understood by the repository core. As we refine this process we will doubtlessly add something like a derivationRelationship to the mix, so that we can tell our thumbnails from our MD5 signatures.

The main shift in expectations remains that "preview generation" (something that was an automatic, internal service) is being supplemented by an external, partially manual (handheld) service. Sometimes, it seems, the services that a repository takes credit for are actually provided by human cranking the handle on other pieces of software! I've just had exactly this discussion with Bill Hubbard, and it's making the boundary of my repository architecture diagram look very fuzzy!

Friday, 6 June 2008

Friday Afternoon Features

I can now keep up with the latest repository submissions from my mobile phone! Yes, in a fit of Web 2.0 experimentation, Chris and I connected eprints.ecs.soton.ac.uk to Twitter so that every time a new publication goes live it sends out a tweat to the eprintsecs user which I am now following.