I've been surprised to see how much of the infrastructure for a desktop repository is already in place in the operating system that he and I use (Mac OS X). The Mac already has a process that extracts metadata and data contents from each file into a central database (see mds(8) in the Unix manual pages); this process is alerted to update the database every time a new file is created or an old file is changed. There is an interface for querying the database (Spotlight), either looking just for matches of the contents, or for complex boolean queries based on the metadata and contents. There is also a sophisticated framework for generating and caching previews and thumbnails (QuickLook). A system that provides data and metadata handling in a centralised database with querying and visualisation facilities all sounds very repository-like to me. And in case you think that I'm overegging this pudding, here's a list of some of the common metadata that OS X will allow you to query (not including media-specific metadata):
Audiences | The intended audience of the file. |
Authors | The authors of the document. |
City | The document’s city of origin. |
Comment | Comments regarding the document. |
ContactKeywords | A list of contacts associated with the document. |
ContentCreationDate | The document’s creation date. |
ContentModificationDate | Last modification date of the document. |
Contributors | Contributors to this document. |
Copyright | The copyright owner. |
Country | The document’s country of origin. |
Coverage | The scope of the document, such as a geographical location or a period of time. |
Creator | The application that created the document. |
Description | A description of the document. |
DueDate | Due date for the item represented by the document. |
DurationSeconds | Duration (in seconds) of the document. |
EmailAddresses | Email addresses associated with this document. |
EncodingApplications | The name of the application (such as “Acrobat Distiller”) that was responsible for converting the document in its current form. |
FinderComment | This contains any Finder comments for the document. |
Fonts | Fonts used in the document. |
Headline | A headline-style synopsis of the document. |
InstantMessageAddresses | IM addresses/screen names associated with the document. |
Instructions | Special instructions or warnings associated with this document. |
Keywords | Keywords associated with the document. |
Kind | Describes the kind of document, such as “iCal Event.” |
Languages | Language of the document. |
LastUsedDate | The date and time the document was last opened. |
NumberOfPages | Page count of this document. |
Organizations | The organization that created the document. |
PageHeight | Height of the document’s page layout in points. |
PageWidth | Width of the document’s page layout in points. |
PhoneNumbers | Phone numbers associated with the document. |
Projects | Names of projects (other documents such as an iMovie project) that this document is associated with. |
Publishers | The publisher of the document |
Recipients | The recipient of the document |
Rights | A link to the statement of rights (such as a Creative Commons or old-school copyright license) that govern the use of the document. |
SecurityMethod | Encryption method used on the document. |
StarRating | Rating of the document (as in the iTunes “star” rating). |
StateOrProvince | The document’s state or province of origin. |
Title | The title. |
Version | The version number. |
WhereFroms | Where the document came from, such as a URI or email address. |
That's a pretty impressive list, and it is fully typed as well, so dates are dates and numbers are numeric, meaning that you can do proper range searches not just text matches. Still, the Mac implementation has enough limitations to mean that we haven't yet thought of it as a repository
- it's a proprietary system. You can't access the thumbnails or export the metadata.
- there isn't any way of manually entering or editing the metadata - it's all automatically extracted from the file contents by the ingesters/importers
- there isn't any particularly useful way of displaying the metadata, apart from in the Finder's "Get Info" box or on the commandline (using the mdls program).
Further reading around shows that issue (2) is also surmountable; extra metadata can be attached to a file through the use of the Mac filesystem's extended attributes. As well as the Title and Author information that the Microsoft Office importer produces, extended attributes with names like com.apple.metadata:kMDItemPhoneNumbers are inspected when the file is indexed. The value of that attribute is an "OS X Property List value" i.e. a number, boolean, date, string or array stored as binary or XML.
This looks like a very useful platform on which to build the researcher's desktop repository; a few added user-centric applications for browsing and editing metadata, together with some software to synchronise the desktop repository with the institutional repository (something like Time Machine) and we would have a very powerful system indeed.
Now I really do have to get on with that marking!
One of the things we're trying to do is get software that will work cross-platform. I don't suppose this metadata file system is available to run elsewhere?
ReplyDeleteSee Linda Octalina's post on our Filesystem watcher so far for Linux but she is doing OS X as well and Ron Ward is doing a Windows version: http://lindaocta.com/?p=119
BTW I only look like I use OS X - that MacBook I had in Atlanta was running Ubuntu.
The metadata filesystem (ie the extended attributes) will work for Linux and probably windows in some form. Almost everyone does it, it's just that few applications use it.
ReplyDeleteHi Leslie & Peter
ReplyDeleteNot all metadata/data content can be extracted by spotlight. Spotlight still depend on the importers (with .imdimporter extension) of the file to extract the metadata. For instance, to extract full metadata for ODF documents and MS office, NeoLight importer needs to be installed in /Library/Spotlight folder.
By default Spotlight only index those files that it understands and if it does not understand, it will just provide the basic metadata of the file. This basic metadata will work cross-platform.
I written blog post of using Spotlight to implement MAC watcher: http://lindaocta.com/?p=126 but this implementation still need to be considered.
Linda Octa