RepositoryMan: October 2011

Wednesday, 26 October 2011

Rethinking the Open Access Agenda

I used to be a perfectly good computer scientist, but now I've been ruined by sociologists. Or at least that is what Professor Catherine Pope (the Marxist feminist health scientist who co-directs the Web Science Doctoral Training Centre with me) says. I am now as likely to quote Bruno Latour as Donald Knuth, and when I examine "the web" instead of a linked graph of HTML nodes I increasingly see a complex network of human activity loosely synchronised by a common need for HTTP interactions.

All of which serves as a kind of explanation of why I have come to think that we need to revisit the Budapest Open Access Initiative's obsession with information technology:

An old tradition and a new technology have converged to make possible an unprecedented public good. The old tradition is the willingness of scientists and scholars to publish the fruits of their research in scholarly journals without payment, for the sake of inquiry and knowledge. The new technology is the internet. The public good they make possible is the world-wide electronic distribution of the peer-reviewed journal literature and completely free and unrestricted access to it by all scientists, scholars, teachers, students, and other curious minds. Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (see http://www.soros.org/openaccess/read)

BOAI promises that the "new technology" of the Internet (actually the Web) will transform our relationship to knowledge. But that was also one of the promises of the electric telegraph a century ago

From the telegraph's earliest days, accounts of it had predicted "great social benefits": diffused knowledge, collective amity, even the prevention of crimes. (Telegraphic realism: Victorian fiction and other information systems by Richard Menke.)

There has been much good and effective work to support OA from both technical and policy perspectives - Southampton's part includes the development of the EPrints repository platform as well as the ROAR OA monitoring service - but critics still point to a disappointing amount of fruit from our efforts. Repositories multiply and green open access (self-deposited) material increases; knowledge about (and support for) OA has spread through academic management, funders and politicians, but it has not yet become a mainstream activity of researchers themselves. And now, a decade into the Open Access agenda, we are grasping the opportunity to replay all our missteps and mistakes in the pursuit of Open Data.

I am beginning to wonder whether by defining open access as a phenomenon of scholarly communication, we mistakenly created from the outset an alien and unimportant concept for the scientists and scholars who long ago outsourced the publication process to a support industry. As a consequence, OA has been best understood by (or most discussed by) the practitioners of scholarly and scientific communication - librarians and publishers - rather than by the practitioners of scholarship and science.

We have seen that the challenge of the Web can't be neatly limited to dissemination practices. In calling for researchers open the outputs of their research, we inevitably argue with researchers to reconsider the relationship that they have with their own work, their immediate colleagues, their academic communities, their institutions, funders and their public. It turns out that we haven't been able to divorce the output of research from the conduct and the context of research activity. Let's move on from there.

In a recent paper Openness as infrastructure, John Wilbanks discussed the three missing components of an open infrastructure for science: the infrastructure to collaborate scientifically and produce data, the technical infrastructure to classify data and the legal infrastructure to share data - extending the technical infrastructure with a legal framework. I think that we need to go further and refocus our efforts and our rhetoric about "Open Access to Scientific Information" towards "Open Activity by Scientists" supported by three kinds of infrastructure:

Human Engagement
Methodological Analysis and
Social Trust.

The aim of open access to scientific outputs and outcomes will not occur until scientific practitioners see the benefit of the scientific commons, not as an anonymous dumping ground for information that can be accessed by all and sundry, but as a field of engagement that offers richer possibilities for their research and their professional activities. To realise that, scientists need more than email and Skype to work together, more than Google to aggregate their efforts and more than a copyright disclaimer to negotiate and mediate the trust relationships that make the openness that OA promises a safe and attractive, and hence realistic, proposition.

What I'm saying isn't new - there has been lots of effort and discussion about improving the benefits of repository technology to the end user/researcher, and about lowering the barriers of use. JISC have funded a number of projects in its Deposit programme, trying various strategies to increase user engagement with OA. As well as continuing to pursue this approach, we also need to step back from obsessing about the technology of information delivery, think bigger thoughts about scientific people and scientific practice and tell a bigger and more relevant story.

Sunday, 9 October 2011

Using EPrints Repositories to Collect Twitter Data

A number of our Web Science students are doing work analysing people's use of Twitter, and the tools available for them to do so are rather limited since Twitter changed the terms of their service so that the functionality of TwapperKeeper and similar sites has been reduced. There are personal tools like NodeXL (a plugin for Microsoft Excel running under Windows) that do provide simple data capture from social networks, but a study will require long-term data collection over many months that is independent of reboots and power outages.

They say that to a man with a hammer, the solution to every problem looks like a nail. And so perhaps it its unsurprising that I see a role for EPrints in helping students and researchers to gather, as well as curate and preserve, their research data. Especially when the data gathering requires a managed, long-term process that results in a large dataset.

EPrints Twitter Dataset,
Rendered in HTML

In collecting large, ephemeral data sets (tweets, Facebook updates, Youtube uploads, Flickr photos, postings on email forums, comments on web pages) a repository has a choice between:

(1) simply collecting the raw data, uninterpreted and requiring the user to analyse the material with their own programs in their own environments

(2) partially interpreting the results and providing some added value for the user by offering intelligent searches, analyses and visualisations to help the researchers get a feel for the data.

We experimented with both approaches. The first sounds simple and more appropriate (don't make the repository get in the way!), but in the end the job of handling, storing and providing a usable interface to the collection of temporal data means that some interpretation of the data is inevitable.

So instead of just constantly appending a stream of structured data objects (tweets, emails, whatever) to an external storage object (a file, database or cloud bucket) we ingest each object into an internal eprints dataset with appropriate schema. There is a tweet dataset for individual tweets, and a timeline data set for collections of tweets - in theory multiple timeline datasets will refer to the same objects in the tweet dataset. These datasets can be manipulated by the normal EPrints API and managed by the normal EPrints repository tools: you can search, export and render tweets in the same way that you can for eprints, documents, projects and users.

EPrints collects Twitter data by regular calls to the Twitter API, using the search parameters given by the user. The figure on the left shows the results of a data collection (on the hashtag "drwho") resulting in a single twitter timeline that is rendered as HTML for the Manage Records page. In this rendering, the timeline of tweets is shown as normal on the left of the window, with lists of top tweeters, top mentions, top hashtags and top links together with a histogram of tweet frequency on the right. These simple additions serve to give an overview of the data to the researcher - not to try to take the place of their bespoke data analysis software, but simply to help understand some of the major features of the data as it is being collected. The data can be exported in various formats (JSON, XML, HTML and CSV) for subsequent processing and analysis. The results of this analysis can themselves be ingested into EPrints for preservation and dissemination, along with the eventual research papers that describe the activity.

All this functionality will soon be released as an EPrints Bazaar package; as of the time of writing we are about to release it for testing by our graduate students. The infrastructure that we have created will then be adapted for other Web temporal data capture sources as mentioned above (Flickr, YouTube, etc).