RepositoryMan: 2011

Wednesday, 26 October 2011

Rethinking the Open Access Agenda

I used to be a perfectly good computer scientist, but now I've been ruined by sociologists. Or at least that is what Professor Catherine Pope (the Marxist feminist health scientist who co-directs the Web Science Doctoral Training Centre with me) says. I am now as likely to quote Bruno Latour as Donald Knuth, and when I examine "the web" instead of a linked graph of HTML nodes I increasingly see a complex network of human activity loosely synchronised by a common need for HTTP interactions.

All of which serves as a kind of explanation of why I have come to think that we need to revisit the Budapest Open Access Initiative's obsession with information technology:

An old tradition and a new technology have converged to make possible an unprecedented public good. The old tradition is the willingness of scientists and scholars to publish the fruits of their research in scholarly journals without payment, for the sake of inquiry and knowledge. The new technology is the internet. The public good they make possible is the world-wide electronic distribution of the peer-reviewed journal literature and completely free and unrestricted access to it by all scientists, scholars, teachers, students, and other curious minds. Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (see http://www.soros.org/openaccess/read)

BOAI promises that the "new technology" of the Internet (actually the Web) will transform our relationship to knowledge. But that was also one of the promises of the electric telegraph a century ago

From the telegraph's earliest days, accounts of it had predicted "great social benefits": diffused knowledge, collective amity, even the prevention of crimes. (Telegraphic realism: Victorian fiction and other information systems by Richard Menke.)

There has been much good and effective work to support OA from both technical and policy perspectives - Southampton's part includes the development of the EPrints repository platform as well as the ROAR OA monitoring service - but critics still point to a disappointing amount of fruit from our efforts. Repositories multiply and green open access (self-deposited) material increases; knowledge about (and support for) OA has spread through academic management, funders and politicians, but it has not yet become a mainstream activity of researchers themselves. And now, a decade into the Open Access agenda, we are grasping the opportunity to replay all our missteps and mistakes in the pursuit of Open Data.

I am beginning to wonder whether by defining open access as a phenomenon of scholarly communication, we mistakenly created from the outset an alien and unimportant concept for the scientists and scholars who long ago outsourced the publication process to a support industry. As a consequence, OA has been best understood by (or most discussed by) the practitioners of scholarly and scientific communication - librarians and publishers - rather than by the practitioners of scholarship and science.

We have seen that the challenge of the Web can't be neatly limited to dissemination practices. In calling for researchers open the outputs of their research, we inevitably argue with researchers to reconsider the relationship that they have with their own work, their immediate colleagues, their academic communities, their institutions, funders and their public. It turns out that we haven't been able to divorce the output of research from the conduct and the context of research activity. Let's move on from there.

In a recent paper Openness as infrastructure, John Wilbanks discussed the three missing components of an open infrastructure for science: the infrastructure to collaborate scientifically and produce data, the technical infrastructure to classify data and the legal infrastructure to share data - extending the technical infrastructure with a legal framework. I think that we need to go further and refocus our efforts and our rhetoric about "Open Access to Scientific Information" towards "Open Activity by Scientists" supported by three kinds of infrastructure:

Human Engagement
Methodological Analysis and
Social Trust.

The aim of open access to scientific outputs and outcomes will not occur until scientific practitioners see the benefit of the scientific commons, not as an anonymous dumping ground for information that can be accessed by all and sundry, but as a field of engagement that offers richer possibilities for their research and their professional activities. To realise that, scientists need more than email and Skype to work together, more than Google to aggregate their efforts and more than a copyright disclaimer to negotiate and mediate the trust relationships that make the openness that OA promises a safe and attractive, and hence realistic, proposition.

What I'm saying isn't new - there has been lots of effort and discussion about improving the benefits of repository technology to the end user/researcher, and about lowering the barriers of use. JISC have funded a number of projects in its Deposit programme, trying various strategies to increase user engagement with OA. As well as continuing to pursue this approach, we also need to step back from obsessing about the technology of information delivery, think bigger thoughts about scientific people and scientific practice and tell a bigger and more relevant story.

Sunday, 9 October 2011

Using EPrints Repositories to Collect Twitter Data

A number of our Web Science students are doing work analysing people's use of Twitter, and the tools available for them to do so are rather limited since Twitter changed the terms of their service so that the functionality of TwapperKeeper and similar sites has been reduced. There are personal tools like NodeXL (a plugin for Microsoft Excel running under Windows) that do provide simple data capture from social networks, but a study will require long-term data collection over many months that is independent of reboots and power outages.

They say that to a man with a hammer, the solution to every problem looks like a nail. And so perhaps it its unsurprising that I see a role for EPrints in helping students and researchers to gather, as well as curate and preserve, their research data. Especially when the data gathering requires a managed, long-term process that results in a large dataset.

EPrints Twitter Dataset,
Rendered in HTML

In collecting large, ephemeral data sets (tweets, Facebook updates, Youtube uploads, Flickr photos, postings on email forums, comments on web pages) a repository has a choice between:

(1) simply collecting the raw data, uninterpreted and requiring the user to analyse the material with their own programs in their own environments

(2) partially interpreting the results and providing some added value for the user by offering intelligent searches, analyses and visualisations to help the researchers get a feel for the data.

We experimented with both approaches. The first sounds simple and more appropriate (don't make the repository get in the way!), but in the end the job of handling, storing and providing a usable interface to the collection of temporal data means that some interpretation of the data is inevitable.

So instead of just constantly appending a stream of structured data objects (tweets, emails, whatever) to an external storage object (a file, database or cloud bucket) we ingest each object into an internal eprints dataset with appropriate schema. There is a tweet dataset for individual tweets, and a timeline data set for collections of tweets - in theory multiple timeline datasets will refer to the same objects in the tweet dataset. These datasets can be manipulated by the normal EPrints API and managed by the normal EPrints repository tools: you can search, export and render tweets in the same way that you can for eprints, documents, projects and users.

EPrints collects Twitter data by regular calls to the Twitter API, using the search parameters given by the user. The figure on the left shows the results of a data collection (on the hashtag "drwho") resulting in a single twitter timeline that is rendered as HTML for the Manage Records page. In this rendering, the timeline of tweets is shown as normal on the left of the window, with lists of top tweeters, top mentions, top hashtags and top links together with a histogram of tweet frequency on the right. These simple additions serve to give an overview of the data to the researcher - not to try to take the place of their bespoke data analysis software, but simply to help understand some of the major features of the data as it is being collected. The data can be exported in various formats (JSON, XML, HTML and CSV) for subsequent processing and analysis. The results of this analysis can themselves be ingested into EPrints for preservation and dissemination, along with the eventual research papers that describe the activity.

All this functionality will soon be released as an EPrints Bazaar package; as of the time of writing we are about to release it for testing by our graduate students. The infrastructure that we have created will then be adapted for other Web temporal data capture sources as mentioned above (Flickr, YouTube, etc).

Sunday, 26 June 2011

Mendeley: Measuring OA rates

Having talked about Mendeley's OA deposit rates in my last blog post, I thought it worthwhile to check how representative my chosen discipline (Computer Science) was. Rather than download the entire community for each other discipline, I have performed a quick and dirty sample of some of the available literature in each discipline using the search function. Each Mendeley search result offers the option of saving the PDF (if available) to your library, so it is a simple matter to wget some search results and grep for PDFs.

The table below shows the results of this procedure for 11 disciplines (two illustrative keywords each). The "available PDFs" column records the number of PDFs offered on the first page of the search results (each page contains 200 results); the total number of results shows the relative coverage of the topic in Mendeley.

Computer Science appears to be in the 5-10% range of OA (18 or 11 PDFs out of a page of 200 results) which does seem to be just about average. Social Science, Medicine, Health Science, Economics and the Humanities appear to have fewer PDFs and Maths and Physics appear to have rather more.

Search term	Discipline	Available PDFs	Total Results
chromatography	Chem	10	14260
crystallography	Chem	27	4921
JAVA	CS	18	848
software	CS	11	15185
geology	Earth	36	4180
hydrodynamic	Earth	40	2853
econometrics	Economics	13	565
microeconomics	Economics	5	88
biodiversity	Env	14	4668
climate	Env	14	13003
nursing	Health	6	10723
palliative	Health	6	1978
archaeology	Hum	6	1730
Foucault	Hum	11	248
algebra	Math	101	4424
cohomology	Math	171	525
cancer	Med	11	52315
pharmacology	Med	4	62285
quasar	Phys	127	556
telescope	Phys	101	2347
cognition	Psy	11	18805
schizophrenia	Psy	17	4055
criminology	SocSci	2	154
sociology	SocSci	2	2005

Mendeley: Download vs Upload Growth

There was a lot of talk about Mendeley at OAI7 in Geneva, especially the news that in the first quarter of 2011 the number of articles downloaded for free jumped from 300,000 to 800,000. That's really good news, confirming Mendeley as a successful service in the Open Access domain. Having done an analysis of Mendeley's impact on Open Access (see Comparing Social Sharing of Bibliographic Information with Institutional Repositories) just under a year ago, I thought I'd repeat the analysis to see the extent of the impact of their growth on deposits as well as downloads.

Results: the number of members of the Computer Science discipline appears to be 2.2x larger than last August (increased to 74736 from 34230.) Of these, only 12102 appear in the Computer Science directory listing, whose contents are now filtered by Mendeley according to their "profile completion"; the gross number was kindly provided for me by Steve Dennis at Mendeley. This filtering takes care of the long tail of accounts that have never been used. Of the filtered users, 1676 are "OA active", having publicly shared at least one PDF document (up 21% on last August). The total number of PDFs shared by this group is 8014, up 16% on last August with 4.8 PDFs being shared per "active OA user" (down from 5.0 last August).

So a big increase in user numbers results in a small increase in publicly shared PDFs, confirming (I think) that Mendeley are not preaching to the choir, and are mainly attracting users who are not already "OA active". Users of Mendeley have clearly transitioned from "scholarly knowledge collectors" to "scholarly knowledge sharers". The challenge still remains how to change their behaviour from "scholarly asset maintainers" to "scholarly asset sharers".

Wednesday, 27 April 2011

Experimenting With Repository UI Design

I'm always on the lookout for engaging UI paradigms to inspire repository design, and I recently noticed that Blogger has made some new "dynamic views" available. It provides a variety of smart presentation styles aren't a million miles away from the ones emerging on smartphone apps, combining highly visual and animated layouts.

So I've imported some repository contents into Blogger to get some hands on experience, and I'd be interested in any feedback on whether this looks useful or compelling.

The new blog is called Mike O'Lection - it's a little DSpace repository joke. .
New views

Sidebar: http://mikeolection.blogspot.com/view/sidebar
Timeslide: http://mikeolection.blogspot.com/view/timeslide
Mosaic: http://mikeolection.blogspot.com/view/mosaic (very Tumblr)
Snapshot: http://mikeolection.blogspot.com/view/snapshot
Flipcard: http://mikeolection.blogspot.com/view/flipcard

Original repository pages: http://eprints.ecs.soton.ac.uk/17386/, http://eprints.ecs.soton.ac.uk/21289/, http://eprints.ecs.soton.ac.uk/21622/, http://eprints.ecs.soton.ac.uk/21030/

These views suit various different types of material, but the constant theme that is emerging is that a good visual is pretty much de rigeur for any resource. This means that relying on the thumbnail image of an article's first page is not going to be a good strategy (hint: they all look the same.) I can forsee the need to extract figures and artwork from the PDFs and Office Documents uploaded to a repository.

(Over the next few days I hope to put some more examples on the blog to help get a better feel for how this will work. But I think I might make a bulk Blogger exporter for EPrints because manual cut and pasting is only enjoyable for a few minutes!)

Tuesday, 26 April 2011

Mobile Use of Repositories

While looking at the impact of mobile devices on the development of the Web I found useful information in this March 2011 press release from web analytics company StatCounter, charting the rise of Android.

StatCounter data also pinpoints the rise and rise of mobile devices to access the Internet. The use of mobile to access the Internet compared to desktop has more than doubled worldwide from 1.72% a year ago to 4.45% today. The same trend is evident in the US with mobile Internet usage more than doubling over the past year from 2.59% to 6.32%.

I thought I'd see whether this behavior applies equally to repositories and so I had a poke around in the usage states for eprints.ecs.soton.ac.uk and this is what I found:

53,285 PDF downloads from 27 March 2011 (4am) - 3rd Apr 2011 (4am).
Of these 33,304 are attributed to crawlers and 19,981 to real browsers.
Only 0.93% of the browser downloads occur on mobile devices (70% iOS, 22% Android, 7% Blackberry and 1% Symbian)

The use of mobiles that we are seeing for accessing research outputs in repositories is less than 1/4 of the general use of mobile Internet. An obvious reason for that is the unpalatable mixture of PDF pages and small devices, but popular applications like Mekentoshj's Papers and Mendeley for iPhone seem to indicate that an attractive mobile experience should be possible.
That implies that there's another exciting opportunity for repository developers to up their game!

Thursday, 14 April 2011

Faculty of 1000 Posters - Still Looking for a Silver Bullet

The F1000 Open Access Poster Repository was brought to my attention by a recent Tweet. I love repositories with posters in - they're copyright-lite and very visually attractive - and I've long advocated for more use to be made of these kinds of scholarly communication. With some success, I have pushed hard for the poster artwork to be made available online in all the conferences I have been involved in organising.

The Faculty of 1000 has a special relationship with some Biomedical conferences, inviting authors to upload their posters to the open access F1000 site. Perhaps this is an effective new way of gaining open access to specific kinds of early-report research material?

The F1000 posters site contains 909 posters. 649 of those are derived from 28 invited conferences (an admirable average of 23 posters per conference), and the remaining 260 posters are uploaded on an ad hoc basis from authors attending 148 other conferences (an average of 1.7 posters per conference).

While it is clear that the invitation approach is much more effective than the laissez faire approach, the huge size of biomedical conferences (often displaying several thousand posters over the course of four days) means that the overall success rate of this OA strategy is only 4.2% (a figure I reached by counting the total number of posters at a sample of 7 of the 28 invited conferences).

So, still no silver OA bullet!

Monday, 21 March 2011

I Won't Review Green OA, It's Spam - I DO NOT LIKE IT Sam-I-Am

According to the Times Higher, Michael Mabe (chief executive of the International Association of Scientific, Medical and Technical Publishers and a visiting professor in information science at University College London) fears that repositories are essentially "electronic buckets" with no quality control. He also expressed doubts that the academy would be able to successfully introduce peer review to such repositories, partly because it would be difficult to attract reviewers who had no "brand allegiance" to the repositories.

Let's think about this....

Q: Who are the authors of papers?

A: Researchers.

Q: Who put papers in repositories?

A: The authors.

Q: Who review papers?

A: The authors of other papers.

Q: Where do they get papers to review?

A: From a URL provided by the journal editorial board.

Q: Who are the editorial board?

A: Authors of other papers.

Q: Just remind me what the publishers do?

A: Their most important job is to organise the processes that get the peer review accomplished by the other authors (see above).

Q: Where does the brand value of a journal come from?

A: It's a bit complicated, but mainly from the prestige of the authors on the editorial board and the prestige of the papers that the authors write. There is a default brand that comes from the publishing company that owns the journal, but of course that comes recursively from the brand value of all the journals that it owns.

Q: "Electronic buckets" don't sound very valuable, do they?

A: No they certainly don't - I mean, imagine the kind of material that normally ends up in a bucket! Who would want to peer-review that? But hang on - who stores stuff in buckets anyway? That's a bit of a problematic metaphor for a storage system! Try replacing "buckets" with "library shelves" and the statement becomes more accurate. What kind of material do you find on library shelves? Things that people might want to read. Things that people might want to review.

Q: But how would authors know what to review in a repository without the publishing company's branding?

A: I suppose an editorial board would send them a URL.

Friday, 11 March 2011

You Can't Trust Everything You Read on the Web

Houston, we have a problem. It turns out that trusting repositories as authoritative sources of research information is all very well and good, except when the repository is an authoritative source of demonstration (fake) documents. Sebastien Francois (one of the EPrints team at Southampton) has just reported that Google Scholar is indexing the fake documents that we make available in demoprints.eprints.org.

So when your weaker students start citing

Freiwald, W. and Bonardi, X. and Leir, X. (1998) Hellbenders in the Wild. Better Farming, 1 (4). pp. 91-134.

you know that it's just a teensy misunderstanding, OK? But if anyone needs their citation count artificially boosting, I have a repository available to monetize.

Monday, 7 March 2011

Google, Content Farms and Repositories

In recent news, Google has altered its ranking algorithms to favour sites with original material rather than so-called content farms that simply redistribute material found on other sites. Although users report satisfaction with improved results, this action has caused quite a furore with some genuine sites losing significant business as well.

I have been worried about how this would affect repositories, after all we technically fit into the definition of content farms: sites that exist to redistribute material that is published elsewhere. Bearing in mind that Google delivers the vast majority of our visitors to us, if the changes were to impact on our rankings, we might suffer quite badly. Now that there's been a couple of weeks for the changes to migrate around the planet, our usage stats point to business as usual.

First of all, downloads over the last quarter - no dramatic tailoffs in the last week.

And a comparison with last year (apologies the different vertical scale) shows year-on-year stability.

So good news there: our repositories haven't been classed as valueless redistribution agents. That would have been a bit of a blow to our morale!

Sunday, 6 March 2011

The Missing Sixth Star of Open Linked Data?

In my previous posting I proposed the idea of the 5 stars of open access. There is of course one feature that the original "taxonomy" misses out completely - repositories! Not just "my favourite repository platform", but the idea of persistent, curated storage. Consequently, my proposal for open access doesn't mention repositories - a bit of an oversight!

At the moment, the entry level to the 5 stars is simply "put it on the web, with an open license". Perhaps we should change this to "put it in a repository with an open license"; perhaps we could designate a "zeroth star" for "just put it on the Web". However, the Linked Data Research Lab at DERI already propose a no-star level, which involves material being put on the web without an explicit license.

You can get away with putting material on the Web without any concern about their future safety - but not for long, especially if you want to build services on top of that material.

Services like CKAN (Comprehensive Knowledge Archive Network, http://ckan.net/) are registries of open knowledge packages currently favoured by the open data community. This registry is built on a simple content management environment, and by November 2010 was already returning HTTP 400- and 500-class error codes for 9% of its listed data source URLs.

A more extreme example is seen in the UK, where police forces recently started to release data about crime reports. But "whenever a new set of data is uploaded, the previous set will be removed from public view, making comparisons impossible unless outside developers actively store it" (see The Guardian for more details).

Repositories have an opportunity to provide management, persistence and curation services to the open data community and its international collections of linked data. Whether our OA platforms are chosen (DSpace? EPrints? Fedora? Zentity?) is not the issue - it is the philosophy and practices of repository that are vital to the Open Data community, because the data is important and long-lived.

On the other hand, I have argued that reuse (and in this case retention) are the enemy of access. "Just putting it up on the Web" is an easier injunction than "deposit it in a repository" (especially if you haven't got a repository installed) and hence more likely to succeed. So we shouldn't put repositories on the Linked Data on-ramp (step/star 1), but if not there, then where should they go?

I would argue that by step 3 (using open formats) or 4 (adding value with identifiers and semantic web tech) the data provider is being asked to make a more substantial investment, and to boost the value of their data holdings. This seems to be an appropriate point to add in extra features, especially when they will help secure the results of that investment. So the 5 stars of Linked Data would mention repositories in Level 4, but the five stars of Open Access could do so in Level 1 because they are already an accepted part of OA processes.

I'm not sure I'm comfortable with mixing the levels - it makes for confusion. Wouldn't it be much better to have one set of processes that apply to all forms of openness - the basic principles of the Web? In my previous post I pointed out that you can add 5* links to 2* PDFs and spreadsheets, so I think possibly that the solution lies in the fact that the 5 stars are not sequential stages, but 5 more-or-less independent principles that each make openness more valuable and useful: licensing, machine readability, open standards, entity identification, interlinking. To which we could add "sustainability", making (see diagram above) is a constellation of linked data properties.

Friday, 4 March 2011

The Five Stars of Open Access (aka Linked Documents)

Yesterday I was having a discussion about Scholarly Communications, Open Access, Web 2 and the Semantic Web with some colleagues in our newly formed "Web and Internet Science Research Group" at Southampton. As we were comparing and contrasting more than a decade's experience of open access/open data/OER/Open Government Data, we made the following observation: reuse is the enemy of access.

There have been efforts to replace PDF with HTML as a scholarly format to make data mining more easy, and movements to establish highly structured Learning Objects rich in pedagogic metadata to facilitate interoperability of e-learning material. (I have been involved in both of these!) But both have been ignored by the community - they are too hard, they fly in the face of current practice, they involve users learning new skills or making more effort. Some would argue that similar comments could be made about preservation and open access, or even just repositories and open access.

Although "reuse is the enemy of access" is quite a bold statement it's really just a reformulation of the old saw "the best is the enemy of the good". Attempts to do something with the material we have available are always more complex than just looking at the material we have available. Adding services, however valuable and desirable, are more problematic than "just making material available". In the repository community we've worked hard to help users get something for nothing (or something for as little effort as possible), and I'm proud that people recognise that philosophy in EPrints. But it's still a tension - you have to present Open Access as a bandwagon that's easy to climb on!

So I'm particularly impressed with Tim Berners-Lee's Five Stars of Linked Data as a means of declaring an easy onramp to the world of Linked Data, while at the same time setting out a clear means of evaluating and improving contributions and the processes required to support them. It allows the community to have their cake and eat it; to claim maximum participation (a bigger community is a more successful community) and appropriate differentiation (better value is a better agenda).

I think this approach would have served the Open Access communities (OA/OER/Open Data) very well (why didn't we think of it?) But it could yet do so, and so in the spirit of reuse I offer some early thoughts on the Five Stars of Open Access.

★ Available on the web (whatever format), but with an open licence
★★ Available as machine-readable editable data (e.g. Word instead of PDF page description)
★★★ as above plus non-proprietary format (e.g. HTML5 instead of Word)
★★★★ All the above plus, use open standards from W3C (RDF and microformats) to identify things, so that people can understand your stuff
★★★★★ All the above, plus: link your data to other people’s data to provide context i.e. link citations to DOIs and other entities to appropriate URIs (e.g. project names, author names, research groups, funders etc).

These are directly taken from Tim's document, with some subtle variations, and are intended for discussion. For a start, it shows that we haven't even got very far into 1-star territory, as we mainly fudge the licensing issue. (This comes from the fact that unlike data, our documents are often re-owned by third parties.) Pressing on, the second star is available for editable source documents rather than page images and this is also a minority activity. In our repository, there are 7271 PDFs vs 820 Office/HTML/XML documents. So a long way to go there. The third star seems even more remote (376 documents). And as for the fourth star's embedded metadata?

But the fifth star: this seems to be so valuable. If we could just get there - properly linked documents, no chasing down references, the ability to easily generate citation databases, easy lookup of the social network of authors. Sigh. What's not to like? And you can even add 5* facilities to PDF, so perhaps we will find some short cuts!

If we develop these five stars, it will help us to function as positive Open Access evangelists, while also promoting the future benefits that we would like to work towards. No mixed messages. No confusion.

Sunday, 27 February 2011

Open Access - Who Calls the Shots Now?

Three years ago on this blog (doesn't time fly!) I contrasted the efforts that librarians and academics could make in furthering Open Access. My argument (such as it was) focused on the relationships between the two communities, noting that when it came to research, librarians could only advise and assist but that academics could lead and command. Or at least in theory! In particular I backed the idea that change would come from senior managers in the academic world and from research funders. In the intervening time we have indeed seen a big increase in OA leadership in the form of mandates being adopted, but I wonder if the pace of change is not about to put even researchers in the back seat.

The Web was developed at CERN, in Switzerland, and took over the world in more than a geographic sense. It emerged from its home in a highly-funded, very collaborative, international research laboratory and carried the culture and design assumptions of its birthplace (open information exchange, minimal concern over intellectual property control, no requirements for individuals to monetize knowledge production) and stamped them on the rest of society, regardless of society's estimation of its own needs (for more, see the presentation The Information Big Bang & Its Fundamental Constants). One manifestation of the clash between the Web and "how society has historically operated" was the Budapest Open Access Initiative some ten years after the initial development of the Web.

The Web's culture of open information exchange has more recently had a very visible effect in the area of Open Government Data. A simple re-statement of the objectives of the Semantic Web as The Five Stars of Linked Data has powered a tremendous focus of activity in national and local government when allied with political agendas of Transparency and Accountability. Portals like data.gov.uk and data.gov provide access to "the raw data driving government forward" which can be used to "help society, or investigate how effective policy changes have been over time". In the UK, the Treasury's COINS database of public spending is one of 5,600 public datasets that have been made available as part of the initiative. In the US, the Open Government Directive requires each department to publish high value data sets and states that "it is important that policies evolve to realize the potential of technology for open government." Both US and UK government see the opening up of public data as the driver for political improvement, innovation and economic growth, with the Public Data Corporation as the focus of British development of an entire social and economic Open Data ecosystem.

Having watched Open Access lobbyists engage in political processes in the UK and US (with a handful of Senators, Congressmen and MPs sometimes for OA and sometimes against) it is rather a shock to see the President and the Prime Minister suddenly mandating a completely revolutionary set of national policies based on the technological affordances of the Web, and in the teeth of plenty of advisors' entrenched opposition. And rather a shock to realise that offices even more elevated than a vice chancellor are enthusiastically joining the world of open resources and open policies.

But data and publications are different things, and publications are privately owned by private publishing companies rather than stockpiled by the government. However, the decade of Open Access debate has shown that progress in OA (and OER and open data) is impeded more by individual and institutional inertia than corporate opposition. When the highest offices of government are confidently pushing forward a programme of open participation, will academics have the luxury of treading water?

How will our governments sudden enthusiasm for open data affect Open Access? Perhaps not at all. Perhaps Universities are too insulated from the administrative whims and shocks of Washington and Whitehall to be affected. (How many researchers have even heard of data.gov?) Even so, governments will indirectly cause a shakeup in the administration of public research funding, and the infrastructure needed for universities to adequately respond to the requirements of open funders will cause them to become more open themselves.

The public climate that informs the private OA debates and decisions in University boardrooms will change; pro-OA researchers and librarians will no longer be arguing from such a defensive position, not appearing as idealistic hippies. Even in the absence of direct government mandates, pro-OA decisions will be easier to support and less contentious to implement. The values of the research communities will change as public values and expectations change - when even governments become more accountable through open data, research communities that insist that their data and their research is their private property, for the sole benefit of the furtherance of their own careers, will soon appear old-fashioned and untenable.

So watch this space. It may be that Cameron and Obama will indirectly achieve what Harnad and Suber have been toiling for. I wonder what I'll have to say in another three years' time?

Rehabilitating The Third Star of Linked Data

The mantra of open data is: put your data on the web / with an open license / in a structured, reusable format / that is open / using open identifiers / that are linked with other data.

The third step/star in this process is commonly explained as using CSV rather than Excel, (because the former is an open format, but the latter is a closed proprietary standard). You'll see this position stated at Linked Data Design at the W3C and sites all around the world are copying it.

We really need to think a bit harder about this: Excel's native format is an open standard, and although an XML encoding of a the complete semantics of a spreadsheet is hardly a straightforward thing to deal with, it is simple enough to extract data from. In particular, I don't see that it is significantly more difficult than dealing with CSV!

Once you've unzipped the Office Open XML data, you can iterate around the contents of the spreadsheet, or extract individual cells with ease. And without any .NET coding or impenetrable Microsoft APIs. Here's a simple example that lists the addresses and contents of all the cells in a spreadsheet.

<xsl:template match='/'>
<xsl:for-each select="/worksheet/sheetData/row/c">
<xsl:value-of select="@r"/> = <xsl:value-of select="v"/>
</xsl:for-each>
</xsl:template>

Of course it's simplified: i've missed off the namespaces, and strings are actually stored in a lookaside table and there are multiple sheets in a single document, but even so I'd rather wrangle XML than wrestle with CSV quotes any day.

Tuesday, 18 January 2011

Response to An Open Letter on OpenAIRE

I hope that my friends and colleagues in the Open Access movement will forgive me for the following words, written in response to John Willinsky's blog post The Enlightenment 2.0: An Open Letter on OpenAIRE:

Oh punner! While you are Open,
Err on the side of caution.
O, pen airy words of knowledge
To fill us with a sparc of hope an' ne'er
Leave us Else ever 'ere.

I'll get my coat.