Tuesday, 21 October 2008

Data Access in Repositories - Don't Overlook What We Already Have!

Dorothea Salo's latest blog entry takes EPrints and DSpace to task for not being able to help users analyse (query, slice-and-dice, facet, analyse, number-crunch, mash-up) data files.

You can already do that, at least you can in Microsoft Excel anyway. As an example, I chose a data file that is already in the MINDS reporisoty (DSpace) and one that is in my school repository (EPrints) and created a new spreadsheet on my desktop that referenced data ranges in both of the archived data sets. I have put it on the Web so that you can check it out yourselves.

The screen shot shows the new spreadsheet that calculates the average publication date of the 2900 records in the ARCL WSS dataset, and the count of the number of data points in A Longitudinal Study of Self Archiving .

The Excel cell reference syntax isn't very pretty - it is a backward compatible munging (that's a technical term) of a URL into a UNC syntax. (And by the way, the munging was done automatically by Excel 2008 on a Mac.)
It is an interesting issue, to think what the data-oriented functions are that a repository can provide. However, we should not overlook the functions that we already have! And in the future, I would hope that URI-based data reference will become common-place in all our desktop applications.

1 comment:

  1. Ah, but DSpace doesn't guarantee URIs for bitstreams! *groan*

    How would this interact with a file-versioning system? (I don't recall offhand whether EPrints does versioning or not.)

    Good demo. Gives one to think about the job of a repositarian vis-a-vis data-modelling inside the repository, as opposed to just putting the stuff there and letting people figure out how to be creative about accessing and using it. And whether/how repositarians then capture the results of that creativity.