Tuesday 15 January 2008

The Myth of Complex Objects II

Following on from my previous posting, I'd like to say a few more things about complexity. In particular, I'd like to acknowledge that complexity does exist while at the same time standing by my assertion that repository users themselves aren't creating "complex objects". It's the act of putting things in a repository that creates complexity, and that has to be managed in as straightforward a way as possible.

First of all, a definition. Something is complex if it "consists of interconnected or interwoven parts" (according to the American Heritage dictionary at Answers.com).

What authors and researchers in general are doing is creating lots of simple things - a paper, a database, a presentation. They're creating them as files and directories on their computers (laptops, workstations, servers). What authors and researchers need from us are repositories to capture these simple things and manage them (for preservation and access purposes).

Complexity appears when we as repository designers notice that "many things" are being deposited by a content creator, and that these things are not entirely independent of each other. The original source of a paper, its PDF, the presentation that was created to discuss it at a conference, the video of the presentation. These things are all implicitly interconnected in the cataloguer's mind, even though personal experience says that they are probably not explicitly grouped together on the author's hard disk. We want to capture this implicit interconnection and turn it into benefit for the author or the reader.

It may be that the connection is even stronger - a group of papers in a reading list, a set of questions for an exam or a collection of photos of a single event. These examples are more likely to be stored together by the author, simply because they are naturally used together. Even so, they are still created and managed as single files in a directory structure, because those are the day-to-day tools that content creators are familiar with.

So, as responsible information designers we have to decide how to treat this complexity - this implicit relationship between files.

The easiest thing to do is to ignore interconnectedness. We can achieve that end in two ways, either by forcing users to deposit individual files in individual records (leaving the user ignorant of the relationships between the records) or else by allowing users to deposit undistinguished clumps of files in a single record (leaving the user ignorant of the nature of the relationships between the files inside a record). In both cases, subversive use of the record's metadata by the depositor may help overcome the shortcomings of the repository design and reassemble some ad hoc sense of relationship.

The most natural way for a repository to support interconnectedness and relationship amongst the files it holds is to model them in a way that its users will recognise. Hence EPrints allows each record to have many 'documents', where each document has its own metadata to describe its role and purpose. That allows a preprint, a postprint, a poster and a presentation to co-exist within the same record. Even though they may all be PDF files, there is no danger of confusing them, because they all have their own metadata descriptions. To cope with the cases where one of the documents is really a collection of things (like a photo album or a web page), each of the documents is allowed to consist of many separate files.

That means there is more than one way to store a group of inter-related files like a photo collection in an EPrints repository. (a) Store each image as a separate eprint record with its own metadata and perhaps even create a top-level repository 'view' for it (b) Store the collection as a single eprint record, and store each image as a separate document (c) Store the collection as a single document made up of all of the images. (d) Turn the collection into a single ZIP/TAR/METS file and store it as a single item. Which of those choices you take really depends on the significance of the collection and the use to which you wish to put it.

So despite the fact that authors aren't themselves engaged in creating complex objects outside the repository, an EPrints repository supports sufficient complexity to allow for implicit connections and relationships between authored items to be made explicit and for users and software to take advantage of it.

Thursday 10 January 2008

The Myth of the Complex Object

Here's an extract from the introduction to a Fedora tutorial that I have bit of an issue with:

The Problem of Digital Content

Digital content is not just documents, nor is it made up exclusively of the content from digital versions of currently owned non-digital content.

  • Conventional Objects: books and other text objects, geospatial data, images, maps
  • Complex, Compound, Dynamic Objects: video, numeric data sets and their associated code books, timed audio

As users become more sophisticated at creating and using complex digital content, digital repositories must also become more sophisticated.

In summary "There is a problem with content! Content is complex, compound dynamic objects! We need more sophisticated repositories to cope!" As a PhD examiner I am used to challenging rather alarmist opening paragraphs like this. So here I go...

This idea that users are creating complex objects which only complex repositories can cater for just doesn't sound right. When I was a child we had the idea that we'd all be walking around in silver spacesuits by the year 2000, but we aren't, and we aren't creating complex digital objects either. What we are creating is files. Lots of them. PDF files, Word files, spreadsheet files, video files, database files, Web files. The media types that I am working with may have got more interesting (richer), but I'm still using applications to create and edit and look after lots of files on my hard disk(s) on my computer(s).

And far from video content not being "just documents" but "complex compound dynamic objects" we in fact see Movie documents or plain old AVI files on our desktops. Even my e-science friends with their robotic labs full of experimental data and analytical data from formally defined workflows are producing and working with lots of files, not complex objects. They know the format and purpose and content of every file, and how it should be used, analysed and checked, and what applications can be used for each of these purposes.

What is complex and problematical about this? Exactly what do we need a new breed of repository for? A repository just needs to be able to manage items containing lots of documents/files, and to deliver them to lots of applications (or "services").

If we need better repositories, it is not to handle "more complex content" coming from "more sophisticated users", but to be better integrated with the working practices of those users. What we really need is better ingest/deposit features to help capture as much of their material as possible and better services to help them accomplish their tasks and to excel in their careers.

And that's not just vacuous waffle, because if a repository can't make knowledge workers (e.g researchers or teachers) more effective at their jobs then there really is a problem!

Monday 7 January 2008

The Journey of a Thousand Deposits

My last post did come across a bit anti-researcher, especially for a researcher! I'd like to confirm that I do believe in making repositories as researcher-friendly and researcher-relevant as is humanly possible. And as librarian-friendly as possible too!

I'll take this opportunity to share a poster that I'll be using for our researchers. I'm adopting the phrase at the bottom "Research outputs go in research repositories" as my motto for 2008. To get the full effect you have to say it in one of those deep bass cinema trailer voiceover voices.

Sunday 6 January 2008

The Journey of a Thousand Miles Begins With a Lot of Effort

I suppose the New Year is a time for reflection and loin-girding, and during the enhanced Winterval binge that this UK academic enjoyed (21st Dec - 6th Jan, thanks to scheduled network downtime at the office) I have had chance to think a bit around some repository topics.

Are repositories just wrong? Aren't they failing? These are questions that were brought up at the JISC CRIG unconference in December, a theme that has emerged from Caveat Depositor (Dorothea Salo) in recent months and one that was addressed at the inaugural UKCoRR meeting of research repository managers in the UK back in May.

They're certainly not easy services to run, requiring researchers and faculty at least to change their working practices, if not to re-evaluate their relationship to the information that they generate. And who ends up doing the hard work? Librarians! If they're not running proxy deposit services, they're having to spend endless meetings evangelizing, proselytizing and advocating the use of repositories at the grass roots, middle management and the top level of University structures. And outside the walls of the university, similar (seemingly interminable) discussions and arguments are taking place with funding bodies and governments on open access. Slowly the pieces are falling into place - the NIH mandate being the latest example. Slowly repositories are beginning to build up some useful levels of contents (see roar.eprints.org for exact stats). It's still painfully slow, and certainly not an overnight success.

It's easy to feel a failure if your one-year-old repository has only a few hundred items in it, but the adoption of technology and institutional change don't come quickly. The adoption of email and word processing wasn't that quick within a University context. Our school was full of geeks who used it exclusively by 1982, but Bill Hubbard (SHERPA) tells a story of how his University's Vice Chancellor nearly provoked a rebellion by unilaterally moving all his own communication to email and refusing to read any more written memos. Still, it worked within a year. Some things just require a longer time to catch on, and then some mandating!

It is easy to imagine that the Web has revolutionised the lives of academics, and that it is only repositories that are failing in their duty to be popular. But in fact, the Web has also failed to take off in academia in important ways. No, really, just look around and see how many academics in your schools have up-to-date home pages. How many academics (who live and die by reputation) have digital profiles that aren't years out of date? This summer, one team in our school boasted a front page BBC News story about its research project. The BBC dutifully linked to the home page of the principal researcher, but it turned out that he hadn't yet updated his home page to mention the project. That'll be the THREE YEAR project that had just finished. D'oh!

In total, about a third of our academics don't have a functional home page (I'm ignoring the "official" web page that the school portal automatically puts together based on papers, projects, recent seminars because its too general and too sparse.) And we're computer scientists - technology geeks in other words. What hope the arts faculty? But it's not just us - it also looks like 20% of MIT Computer Science professors don't have web pages (according to their phone book, at least).

So it looks to me like researchers in general aren't too good at web dissemination. Don't blame repositories! They *are* a part of the solution, it's just that they're a solution that researchers aren't looking for. In other words, Dorothea is right.

BUT SO WHAT? Just because academics don't care about an issue doesn't mean that it should be dropped. This is where DS and I will have naturally different perspectives. She stands in the library and I stand in the research lab. She can't tell academics what to do. She can't change their behaviour. She can't force researchers to adopt open access, preservation-friendly practices. She can only advise and educate. That is pretty frustrating. I'm sure that she's got all the low-hanging fruit. Perhaps everyone who was going to be quickly convinced has been convinced.

But this isn't just her fight. Librarians can't boss professors, only other professors or their senior management and their funders. So the others had better step up to the plate - the researchers, academics, professors who support repositories, open access, information preservation. Those who can see the advantage and implications of a well-maintained network of up-to-date, accessible information about research, researchers, research projects, activities - the scholarly lifecycle, its outputs and stakeholders. Those who get it - that the Web has changed the rules for everyone. In other words, this is MY problem (as an academic), not Dorothea's (as a librarian). And academics just don't listen to people unless they're forced to. And that is why (I believe) the smart money is on mandates at the moment - funder mandates, institutional mandates or departmental (patchwork) mandates. Whoever is listening to sense should just impose sense where they have authority.

Librarians often react badly to mandates because they contrast with the normal library/faculty relationship. But that's rather the point - after all the education and information has been delivered, the remaining message is "just do it". "Stop messing around." And librarians can't deliver that message. In a world in which knowledge can be easily shared and indexed for the whole planet to benefit from, it is simply no longer acceptable that research material (data, analysis and article) should be slowly be lost to disorganised filing cabinets, file systems or unsupported, obselete IT platforms. Or to propping up out-of-date publishing business models, come to that.

So what do I predict for 2008? More mandates and more content. It'll feel like slow progress, but the rate of growth of the content will start to speed up. Perhaps I'll get Tim Brody to put a speedometer on the front page of ROAR!