In the last six months since I analysed Mendeley's contribution to Computer Science OA in June 2011, they appear to have increased their membership of that community by 37% and the ratio of full text documents to community members has increased from 0.66 to 0.71. The number of OA documents has increased by 47% to 11,757 and the number of OA active users (i.e. users who have made at least one document public through Mendeley's servers) has risen by 46% to 2,441 but still represents only 15% of the total membership of that community.
Congratulations to Mendeley - their service is obviously rising in popularity and hence in significance to the community. OA analysts will note that the increase in open access documents comes from increased membership, rather than a change in behaviour of the community.
RepositoryMan
The Blog of a repository administrator and web scientist. Leslie Carr is a researcher and lecturer who runs a research repository for the School of Electronics and Computer Science in the University of Southampton in the UK. This blog is to record the day to day activities of a repository manager.
Thursday, 5 January 2012
Wednesday, 26 October 2011
Rethinking the Open Access Agenda
I used to be a perfectly good computer scientist, but now I've been ruined by sociologists. Or at least that is what Professor Catherine Pope (the Marxist feminist health scientist who co-directs the Web Science Doctoral Training Centre with me) says. I am now as likely to quote Bruno Latour as Donald Knuth, and when I examine "the web" instead of a linked graph of HTML nodes I increasingly see a complex network of human activity loosely synchronised by a common need for HTTP interactions.
All of which serves as a kind of explanation of why I have come to think that we need to revisit the Budapest Open Access Initiative's obsession with information technology:
I am beginning to wonder whether by defining open access as a phenomenon of scholarly communication, we mistakenly created from the outset an alien and unimportant concept for the scientists and scholars who long ago outsourced the publication process to a support industry. As a consequence, OA has been best understood by (or most discussed by) the practitioners of scholarly and scientific communication - librarians and publishers - rather than by the practitioners of scholarship and science.
We have seen that the challenge of the Web can't be neatly limited to dissemination practices. In calling for researchers open the outputs of their research, we inevitably argue with researchers to reconsider the relationship that they have with their own work, their immediate colleagues, their academic communities, their institutions, funders and their public. It turns out that we haven't been able to divorce the output of research from the conduct and the context of research activity. Let's move on from there.
In a recent paper Openness as infrastructure, John Wilbanks discussed the three missing components of an open infrastructure for science: the infrastructure to collaborate scientifically and produce data, the technical infrastructure to classify data and the legal infrastructure to share data - extending the technical infrastructure with a legal framework. I think that we need to go further and refocus our efforts and our rhetoric about "Open Access to Scientific Information" towards "Open Activity by Scientists" supported by three kinds of infrastructure:
What I'm saying isn't new - there has been lots of effort and discussion about improving the benefits of repository technology to the end user/researcher, and about lowering the barriers of use. JISC have funded a number of projects in its Deposit programme, trying various strategies to increase user engagement with OA. As well as continuing to pursue this approach, we also need to step back from obsessing about the technology of information delivery, think bigger thoughts about scientific people and scientific practice and tell a bigger and more relevant story.
All of which serves as a kind of explanation of why I have come to think that we need to revisit the Budapest Open Access Initiative's obsession with information technology:
An old tradition and a new technology have converged to make possible an unprecedented public good. The old tradition is the willingness of scientists and scholars to publish the fruits of their research in scholarly journals without payment, for the sake of inquiry and knowledge. The new technology is the internet. The public good they make possible is the world-wide electronic distribution of the peer-reviewed journal literature and completely free and unrestricted access to it by all scientists, scholars, teachers, students, and other curious minds. Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (see http://www.soros.org/openaccess/read)BOAI promises that the "new technology" of the Internet (actually the Web) will transform our relationship to knowledge. But that was also one of the promises of the electric telegraph a century ago
From the telegraph's earliest days, accounts of it had predicted "great social benefits": diffused knowledge, collective amity, even the prevention of crimes. (Telegraphic realism: Victorian fiction and other information systems by Richard Menke.)There has been much good and effective work to support OA from both technical and policy perspectives - Southampton's part includes the development of the EPrints repository platform as well as the ROAR OA monitoring service - but critics still point to a disappointing amount of fruit from our efforts. Repositories multiply and green open access (self-deposited) material increases; knowledge about (and support for) OA has spread through academic management, funders and politicians, but it has not yet become a mainstream activity of researchers themselves. And now, a decade into the Open Access agenda, we are grasping the opportunity to replay all our missteps and mistakes in the pursuit of Open Data.
I am beginning to wonder whether by defining open access as a phenomenon of scholarly communication, we mistakenly created from the outset an alien and unimportant concept for the scientists and scholars who long ago outsourced the publication process to a support industry. As a consequence, OA has been best understood by (or most discussed by) the practitioners of scholarly and scientific communication - librarians and publishers - rather than by the practitioners of scholarship and science.
We have seen that the challenge of the Web can't be neatly limited to dissemination practices. In calling for researchers open the outputs of their research, we inevitably argue with researchers to reconsider the relationship that they have with their own work, their immediate colleagues, their academic communities, their institutions, funders and their public. It turns out that we haven't been able to divorce the output of research from the conduct and the context of research activity. Let's move on from there.
In a recent paper Openness as infrastructure, John Wilbanks discussed the three missing components of an open infrastructure for science: the infrastructure to collaborate scientifically and produce data, the technical infrastructure to classify data and the legal infrastructure to share data - extending the technical infrastructure with a legal framework. I think that we need to go further and refocus our efforts and our rhetoric about "Open Access to Scientific Information" towards "Open Activity by Scientists" supported by three kinds of infrastructure:
- Human Engagement
- Methodological Analysis and
- Social Trust.
What I'm saying isn't new - there has been lots of effort and discussion about improving the benefits of repository technology to the end user/researcher, and about lowering the barriers of use. JISC have funded a number of projects in its Deposit programme, trying various strategies to increase user engagement with OA. As well as continuing to pursue this approach, we also need to step back from obsessing about the technology of information delivery, think bigger thoughts about scientific people and scientific practice and tell a bigger and more relevant story.
Sunday, 9 October 2011
Using EPrints Repositories to Collect Twitter Data
A number of our Web Science students are doing work analysing people's use of Twitter, and the tools available for them to do so are rather limited since Twitter changed the terms of their service so that the functionality of TwapperKeeper and similar sites has been reduced. There are personal tools like NodeXL (a plugin for Microsoft Excel running under Windows) that do provide simple data capture from social networks, but a study will require long-term data collection over many months that is independent of reboots and power outages.
They say that to a man with a hammer, the solution to every problem looks like a nail. And so perhaps it its unsurprising that I see a role for EPrints in helping students and researchers to gather, as well as curate and preserve, their research data. Especially when the data gathering requires a managed, long-term process that results in a large dataset.
In collecting large, ephemeral data sets (tweets, Facebook updates, Youtube uploads, Flickr photos, postings on email forums, comments on web pages) a repository has a choice between:
(1) simply collecting the raw data, uninterpreted and requiring the user to analyse the material with their own programs in their own environments
(2) partially interpreting the results and providing some added value for the user by offering intelligent searches, analyses and visualisations to help the researchers get a feel for the data.
We experimented with both approaches. The first sounds simple and more appropriate (don't make the repository get in the way!), but in the end the job of handling, storing and providing a usable interface to the collection of temporal data means that some interpretation of the data is inevitable.
So instead of just constantly appending a stream of structured data objects (tweets, emails, whatever) to an external storage object (a file, database or cloud bucket) we ingest each object into an internal eprints dataset with appropriate schema. There is a tweet dataset for individual tweets, and a timeline data set for collections of tweets - in theory multiple timeline datasets will refer to the same objects in the tweet dataset. These datasets can be manipulated by the normal EPrints API and managed by the normal EPrints repository tools: you can search, export and render tweets in the same way that you can for eprints, documents, projects and users.
EPrints collects Twitter data by regular calls to the Twitter API, using the search parameters given by the user. The figure on the left shows the results of a data collection (on the hashtag "drwho") resulting in a single twitter timeline that is rendered as HTML for the Manage Records page. In this rendering, the timeline of tweets is shown as normal on the left of the window, with lists of top tweeters, top mentions, top hashtags and top links together with a histogram of tweet frequency on the right. These simple additions serve to give an overview of the data to the researcher - not to try to take the place of their bespoke data analysis software, but simply to help understand some of the major features of the data as it is being collected. The data can be exported in various formats (JSON, XML, HTML and CSV) for subsequent processing and analysis. The results of this analysis can themselves be ingested into EPrints for preservation and dissemination, along with the eventual research papers that describe the activity.
All this functionality will soon be released as an EPrints Bazaar package; as of the time of writing we are about to release it for testing by our graduate students. The infrastructure that we have created will then be adapted for other Web temporal data capture sources as mentioned above (Flickr, YouTube, etc).
They say that to a man with a hammer, the solution to every problem looks like a nail. And so perhaps it its unsurprising that I see a role for EPrints in helping students and researchers to gather, as well as curate and preserve, their research data. Especially when the data gathering requires a managed, long-term process that results in a large dataset.
![]() |
| EPrints Twitter Dataset, Rendered in HTML |
(1) simply collecting the raw data, uninterpreted and requiring the user to analyse the material with their own programs in their own environments
(2) partially interpreting the results and providing some added value for the user by offering intelligent searches, analyses and visualisations to help the researchers get a feel for the data.
We experimented with both approaches. The first sounds simple and more appropriate (don't make the repository get in the way!), but in the end the job of handling, storing and providing a usable interface to the collection of temporal data means that some interpretation of the data is inevitable.
So instead of just constantly appending a stream of structured data objects (tweets, emails, whatever) to an external storage object (a file, database or cloud bucket) we ingest each object into an internal eprints dataset with appropriate schema. There is a tweet dataset for individual tweets, and a timeline data set for collections of tweets - in theory multiple timeline datasets will refer to the same objects in the tweet dataset. These datasets can be manipulated by the normal EPrints API and managed by the normal EPrints repository tools: you can search, export and render tweets in the same way that you can for eprints, documents, projects and users.
EPrints collects Twitter data by regular calls to the Twitter API, using the search parameters given by the user. The figure on the left shows the results of a data collection (on the hashtag "drwho") resulting in a single twitter timeline that is rendered as HTML for the Manage Records page. In this rendering, the timeline of tweets is shown as normal on the left of the window, with lists of top tweeters, top mentions, top hashtags and top links together with a histogram of tweet frequency on the right. These simple additions serve to give an overview of the data to the researcher - not to try to take the place of their bespoke data analysis software, but simply to help understand some of the major features of the data as it is being collected. The data can be exported in various formats (JSON, XML, HTML and CSV) for subsequent processing and analysis. The results of this analysis can themselves be ingested into EPrints for preservation and dissemination, along with the eventual research papers that describe the activity.
All this functionality will soon be released as an EPrints Bazaar package; as of the time of writing we are about to release it for testing by our graduate students. The infrastructure that we have created will then be adapted for other Web temporal data capture sources as mentioned above (Flickr, YouTube, etc).
Sunday, 26 June 2011
Mendeley: Measuring OA rates
Having talked about Mendeley's OA deposit rates in my last blog post, I thought it worthwhile to check how representative my chosen discipline (Computer Science) was. Rather than download the entire community for each other discipline, I have performed a quick and dirty sample of some of the available literature in each discipline using the search function. Each Mendeley search result offers the option of saving the PDF (if available) to your library, so it is a simple matter to wget some search results and grep for PDFs.
The table below shows the results of this procedure for 11 disciplines (two illustrative keywords each). The "available PDFs" column records the number of PDFs offered on the first page of the search results (each page contains 200 results); the total number of results shows the relative coverage of the topic in Mendeley.
Computer Science appears to be in the 5-10% range of OA (18 or 11 PDFs out of a page of 200 results) which does seem to be just about average. Social Science, Medicine, Health Science, Economics and the Humanities appear to have fewer PDFs and Maths and Physics appear to have rather more.
The table below shows the results of this procedure for 11 disciplines (two illustrative keywords each). The "available PDFs" column records the number of PDFs offered on the first page of the search results (each page contains 200 results); the total number of results shows the relative coverage of the topic in Mendeley.
Computer Science appears to be in the 5-10% range of OA (18 or 11 PDFs out of a page of 200 results) which does seem to be just about average. Social Science, Medicine, Health Science, Economics and the Humanities appear to have fewer PDFs and Maths and Physics appear to have rather more.
Search term | Discipline | Available PDFs | Total Results |
chromatography | Chem | 10 | 14260 |
crystallography | Chem | 27 | 4921 |
JAVA | CS | 18 | 848 |
software | CS | 11 | 15185 |
geology | Earth | 36 | 4180 |
hydrodynamic | Earth | 40 | 2853 |
econometrics | Economics | 13 | 565 |
microeconomics | Economics | 5 | 88 |
biodiversity | Env | 14 | 4668 |
climate | Env | 14 | 13003 |
nursing | Health | 6 | 10723 |
palliative | Health | 6 | 1978 |
archaeology | Hum | 6 | 1730 |
Foucault | Hum | 11 | 248 |
algebra | Math | 101 | 4424 |
cohomology | Math | 171 | 525 |
cancer | Med | 11 | 52315 |
pharmacology | Med | 4 | 62285 |
quasar | Phys | 127 | 556 |
telescope | Phys | 101 | 2347 |
cognition | Psy | 11 | 18805 |
schizophrenia | Psy | 17 | 4055 |
criminology | SocSci | 2 | 154 |
sociology | SocSci | 2 | 2005 |
Mendeley: Download vs Upload Growth
There was a lot of talk about Mendeley at OAI7 in Geneva, especially the news that in the first quarter of 2011 the number of articles downloaded for free jumped from 300,000 to 800,000. That's really good news, confirming Mendeley as a successful service in the Open Access domain. Having done an analysis of Mendeley's impact on Open Access (see Comparing Social Sharing of Bibliographic Information with Institutional Repositories) just under a year ago, I thought I'd repeat the analysis to see the extent of the impact of their growth on deposits as well as downloads.
Results: the number of members of the Computer Science discipline appears to be 2.2x larger than last August (increased to 74736 from 34230.) Of these, only 12102 appear in the Computer Science directory listing, whose contents are now filtered by Mendeley according to their "profile completion"; the gross number was kindly provided for me by Steve Dennis at Mendeley. This filtering takes care of the long tail of accounts that have never been used. Of the filtered users, 1676 are "OA active", having publicly shared at least one PDF document (up 21% on last August). The total number of PDFs shared by this group is 8014, up 16% on last August with 4.8 PDFs being shared per "active OA user" (down from 5.0 last August).
So a big increase in user numbers results in a small increase in publicly shared PDFs, confirming (I think) that Mendeley are not preaching to the choir, and are mainly attracting users who are not already "OA active". Users of Mendeley have clearly transitioned from "scholarly knowledge collectors" to "scholarly knowledge sharers". The challenge still remains how to change their behaviour from "scholarly asset maintainers" to "scholarly asset sharers".
Results: the number of members of the Computer Science discipline appears to be 2.2x larger than last August (increased to 74736 from 34230.) Of these, only 12102 appear in the Computer Science directory listing, whose contents are now filtered by Mendeley according to their "profile completion"; the gross number was kindly provided for me by Steve Dennis at Mendeley. This filtering takes care of the long tail of accounts that have never been used. Of the filtered users, 1676 are "OA active", having publicly shared at least one PDF document (up 21% on last August). The total number of PDFs shared by this group is 8014, up 16% on last August with 4.8 PDFs being shared per "active OA user" (down from 5.0 last August).
So a big increase in user numbers results in a small increase in publicly shared PDFs, confirming (I think) that Mendeley are not preaching to the choir, and are mainly attracting users who are not already "OA active". Users of Mendeley have clearly transitioned from "scholarly knowledge collectors" to "scholarly knowledge sharers". The challenge still remains how to change their behaviour from "scholarly asset maintainers" to "scholarly asset sharers".
Wednesday, 27 April 2011
Experimenting With Repository UI Design
I'm always on the lookout for engaging UI paradigms to inspire repository design, and I recently noticed that Blogger has made some new "dynamic views" available. It provides a variety of smart presentation styles aren't a million miles away from the ones emerging on smartphone apps, combining highly visual and animated layouts.
So I've imported some repository contents into Blogger to get some hands on experience, and I'd be interested in any feedback on whether this looks useful or compelling.
These views suit various different types of material, but the constant theme that is emerging is that a good visual is pretty much de rigeur for any resource. This means that relying on the thumbnail image of an article's first page is not going to be a good strategy (hint: they all look the same.) I can forsee the need to extract figures and artwork from the PDFs and Office Documents uploaded to a repository.
(Over the next few days I hope to put some more examples on the blog to help get a better feel for how this will work. But I think I might make a bulk Blogger exporter for EPrints because manual cut and pasting is only enjoyable for a few minutes!)
So I've imported some repository contents into Blogger to get some hands on experience, and I'd be interested in any feedback on whether this looks useful or compelling.
- The new blog is called Mike O'Lection - it's a little DSpace repository joke. .
- New views
- Sidebar: http://mikeolection.blogspot.com/view/sidebar
- Timeslide: http://mikeolection.blogspot.com/view/timeslide
- Mosaic: http://mikeolection.blogspot.com/view/mosaic (very Tumblr)
- Snapshot: http://mikeolection.blogspot.com/view/snapshot
- Flipcard: http://mikeolection.blogspot.com/view/flipcard
- Original repository pages: http://eprints.ecs.soton.ac.uk/17386/, http://eprints.ecs.soton.ac.uk/21289/, http://eprints.ecs.soton.ac.uk/21622/, http://eprints.ecs.soton.ac.uk/21030/
These views suit various different types of material, but the constant theme that is emerging is that a good visual is pretty much de rigeur for any resource. This means that relying on the thumbnail image of an article's first page is not going to be a good strategy (hint: they all look the same.) I can forsee the need to extract figures and artwork from the PDFs and Office Documents uploaded to a repository.
(Over the next few days I hope to put some more examples on the blog to help get a better feel for how this will work. But I think I might make a bulk Blogger exporter for EPrints because manual cut and pasting is only enjoyable for a few minutes!)
Tuesday, 26 April 2011
Mobile Use of Repositories
While looking at the impact of mobile devices on the development of the Web I found useful information in this March 2011 press release from web analytics company StatCounter, charting the rise of Android.
That implies that there's another exciting opportunity for repository developers to up their game!
StatCounter data also pinpoints the rise and rise of mobile devices to access the Internet. The use of mobile to access the Internet compared to desktop has more than doubled worldwide from 1.72% a year ago to 4.45% today. The same trend is evident in the US with mobile Internet usage more than doubling over the past year from 2.59% to 6.32%.I thought I'd see whether this behavior applies equally to repositories and so I had a poke around in the usage states for eprints.ecs.soton.ac.uk and this is what I found:
- 53,285 PDF downloads from 27 March 2011 (4am) - 3rd Apr 2011 (4am).
- Of these 33,304 are attributed to crawlers and 19,981 to real browsers.
- Only 0.93% of the browser downloads occur on mobile devices (70% iOS, 22% Android, 7% Blackberry and 1% Symbian)
That implies that there's another exciting opportunity for repository developers to up their game!
Subscribe to:
Posts (Atom)
