Tuesday, 15 January 2008

The Myth of Complex Objects II

Following on from my previous posting, I'd like to say a few more things about complexity. In particular, I'd like to acknowledge that complexity does exist while at the same time standing by my assertion that repository users themselves aren't creating "complex objects". It's the act of putting things in a repository that creates complexity, and that has to be managed in as straightforward a way as possible.

First of all, a definition. Something is complex if it "consists of interconnected or interwoven parts" (according to the American Heritage dictionary at Answers.com).

What authors and researchers in general are doing is creating lots of simple things - a paper, a database, a presentation. They're creating them as files and directories on their computers (laptops, workstations, servers). What authors and researchers need from us are repositories to capture these simple things and manage them (for preservation and access purposes).

Complexity appears when we as repository designers notice that "many things" are being deposited by a content creator, and that these things are not entirely independent of each other. The original source of a paper, its PDF, the presentation that was created to discuss it at a conference, the video of the presentation. These things are all implicitly interconnected in the cataloguer's mind, even though personal experience says that they are probably not explicitly grouped together on the author's hard disk. We want to capture this implicit interconnection and turn it into benefit for the author or the reader.

It may be that the connection is even stronger - a group of papers in a reading list, a set of questions for an exam or a collection of photos of a single event. These examples are more likely to be stored together by the author, simply because they are naturally used together. Even so, they are still created and managed as single files in a directory structure, because those are the day-to-day tools that content creators are familiar with.

So, as responsible information designers we have to decide how to treat this complexity - this implicit relationship between files.

The easiest thing to do is to ignore interconnectedness. We can achieve that end in two ways, either by forcing users to deposit individual files in individual records (leaving the user ignorant of the relationships between the records) or else by allowing users to deposit undistinguished clumps of files in a single record (leaving the user ignorant of the nature of the relationships between the files inside a record). In both cases, subversive use of the record's metadata by the depositor may help overcome the shortcomings of the repository design and reassemble some ad hoc sense of relationship.

The most natural way for a repository to support interconnectedness and relationship amongst the files it holds is to model them in a way that its users will recognise. Hence EPrints allows each record to have many 'documents', where each document has its own metadata to describe its role and purpose. That allows a preprint, a postprint, a poster and a presentation to co-exist within the same record. Even though they may all be PDF files, there is no danger of confusing them, because they all have their own metadata descriptions. To cope with the cases where one of the documents is really a collection of things (like a photo album or a web page), each of the documents is allowed to consist of many separate files.

That means there is more than one way to store a group of inter-related files like a photo collection in an EPrints repository. (a) Store each image as a separate eprint record with its own metadata and perhaps even create a top-level repository 'view' for it (b) Store the collection as a single eprint record, and store each image as a separate document (c) Store the collection as a single document made up of all of the images. (d) Turn the collection into a single ZIP/TAR/METS file and store it as a single item. Which of those choices you take really depends on the significance of the collection and the use to which you wish to put it.

So despite the fact that authors aren't themselves engaged in creating complex objects outside the repository, an EPrints repository supports sufficient complexity to allow for implicit connections and relationships between authored items to be made explicit and for users and software to take advantage of it.

No comments:

Post a Comment