Sunday 14 June 2009

The Desktop Repository that's Already There

It's really time I acknowledged Peter Sefton who's doing a lot of work on Powerpoint and slide bursting for the Fascinator Desktop, part of his project to bring open HTML formats to the desktop. Peter visited Southampton earlier this year, and inspired me on the topic. I'd just got knocked back on a JISC proposal for looking at repository - desktop integration, so it was great to talk to someone else who wanted to do something in the area. We both seem to be goading each other on at the moment and we've been tweeting and emailing each other, but I've not given him his due credit in this blog so far.

I've been surprised to see how much of the infrastructure for a desktop repository is already in place in the operating system that he and I use (Mac OS X). The Mac already has a process that extracts metadata and data contents from each file into a central database (see mds(8) in the Unix manual pages); this process is alerted to update the database every time a new file is created or an old file is changed. There is an interface for querying the database (Spotlight), either looking just for matches of the contents, or for complex boolean queries based on the metadata and contents. There is also a sophisticated framework for generating and caching previews and thumbnails (QuickLook). A system that provides data and metadata handling in a centralised database with querying and visualisation facilities all sounds very repository-like to me. And in case you think that I'm overegging this pudding, here's a list of some of the common metadata that OS X will allow you to query (not including media-specific metadata):

AudiencesThe intended audience of the file.
AuthorsThe authors of the document.
CityThe document’s city of origin.
CommentComments regarding the document.
ContactKeywordsA list of contacts associated with the document.
ContentCreationDateThe document’s creation date.
ContentModificationDateLast modification date of the document.
ContributorsContributors to this document.
CopyrightThe copyright owner.
CountryThe document’s country of origin.
CoverageThe scope of the document, such as a geographical location or a period of time.
CreatorThe application that created the document.
DescriptionA description of the document.
DueDateDue date for the item represented by the document.
DurationSecondsDuration (in seconds) of the document.
EmailAddressesEmail addresses associated with this document.
EncodingApplicationsThe name of the application (such as “Acrobat Distiller”) that was responsible for converting the document in its current form.
FinderCommentThis contains any Finder comments for the document.
FontsFonts used in the document.
HeadlineA headline-style synopsis of the document.
InstantMessageAddressesIM addresses/screen names associated with the document.
InstructionsSpecial instructions or warnings associated with this document.
KeywordsKeywords associated with the document.
KindDescribes the kind of document, such as “iCal Event.”
LanguagesLanguage of the document.
LastUsedDateThe date and time the document was last opened.
NumberOfPagesPage count of this document.
OrganizationsThe organization that created the document.
PageHeightHeight of the document’s page layout in points.
PageWidthWidth of the document’s page layout in points.
PhoneNumbersPhone numbers associated with the document.
ProjectsNames of projects (other documents such as an iMovie project) that this document is associated with.
PublishersThe publisher of the document
RecipientsThe recipient of the document
RightsA link to the statement of rights (such as a Creative Commons or old-school copyright license) that govern the use of the document.
SecurityMethodEncryption method used on the document.
StarRatingRating of the document (as in the iTunes “star” rating).
StateOrProvinceThe document’s state or province of origin.
TitleThe title.
VersionThe version number.
WhereFromsWhere the document came from, such as a URI or email address.

That's a pretty impressive list, and it is fully typed as well, so dates are dates and numbers are numeric, meaning that you can do proper range searches not just text matches. Still, the Mac implementation has enough limitations to mean that we haven't yet thought of it as a repository
  1. it's a proprietary system. You can't access the thumbnails or export the metadata.
  2. there isn't any way of manually entering or editing the metadata - it's all automatically extracted from the file contents by the ingesters/importers
  3. there isn't any particularly useful way of displaying the metadata, apart from in the Finder's "Get Info" box or on the commandline (using the mdls program).
Issues (1) and (3) just reduce to coding better applications. There are a number of Finder replacements, but none of them really take the metadata seriously. There are also a number of tagging applications that have emerged in the last year or so, but they use a very narrow range of metadata. Someone could add a faceted browser interface to the Finder, or integrate some more explicitly bibliographic metadata into the Apple infrastructure.

Further reading around shows that issue (2) is also surmountable; extra metadata can be attached to a file through the use of the Mac filesystem's extended attributes. As well as the Title and Author information that the Microsoft Office importer produces, extended attributes with names like are inspected when the file is indexed. The value of that attribute is an "OS X Property List value" i.e. a number, boolean, date, string or array stored as binary or XML.

This looks like a very useful platform on which to build the researcher's desktop repository; a few added user-centric applications for browsing and editing metadata, together with some software to synchronise the desktop repository with the institutional repository (something like Time Machine) and we would have a very powerful system indeed.

Now I really do have to get on with that marking!


  1. One of the things we're trying to do is get software that will work cross-platform. I don't suppose this metadata file system is available to run elsewhere?

    See Linda Octalina's post on our Filesystem watcher so far for Linux but she is doing OS X as well and Ron Ward is doing a Windows version:

    BTW I only look like I use OS X - that MacBook I had in Atlanta was running Ubuntu.

  2. The metadata filesystem (ie the extended attributes) will work for Linux and probably windows in some form. Almost everyone does it, it's just that few applications use it.

  3. Hi Leslie & Peter

    Not all metadata/data content can be extracted by spotlight. Spotlight still depend on the importers (with .imdimporter extension) of the file to extract the metadata. For instance, to extract full metadata for ODF documents and MS office, NeoLight importer needs to be installed in /Library/Spotlight folder.
    By default Spotlight only index those files that it understands and if it does not understand, it will just provide the basic metadata of the file. This basic metadata will work cross-platform.

    I written blog post of using Spotlight to implement MAC watcher: but this implementation still need to be considered.

    Linda Octa