Sunday 27 February 2011

Rehabilitating The Third Star of Linked Data

The mantra of open data is: put your data on the web / with an open license / in a structured, reusable format / that is open / using open identifiers / that are linked with other data.

The third step/star in this process is commonly explained as using CSV rather than Excel, (because the former is an open format, but the latter is a closed proprietary standard). You'll see this position stated at Linked Data Design at the W3C and sites all around the world are copying it.

We really need to think a bit harder about this: Excel's native format is an open standard, and although an XML encoding of a the complete semantics of a spreadsheet is hardly a straightforward thing to deal with, it is simple enough to extract data from. In particular, I don't see that it is significantly more difficult than dealing with CSV!

Once you've unzipped the Office Open XML data, you can iterate around the contents of the spreadsheet, or extract individual cells with ease. And without any .NET coding or impenetrable Microsoft APIs. Here's a simple example that lists the addresses and contents of all the cells in a spreadsheet.
<xsl:template match='/'>
<xsl:for-each select="/worksheet/sheetData/row/c">
<xsl:value-of select="@r"/> = <xsl:value-of select="v"/>
</xsl:for-each>
</xsl:template>

Of course it's simplified: i've missed off the namespaces, and strings are actually stored in a lookaside table and there are multiple sheets in a single document, but even so I'd rather wrangle XML than wrestle with CSV quotes any day.

3 comments:

  1. The programming language you use doesn't have a CSV library? I don't fuss with quotes.

    ReplyDelete
  2. Also, if you need to unlock some data from non XML Excel:
    http://graphite.ecs.soton.ac.uk/excel2csv

    So, what is a good example to use for 2* data? We need one to replace excel!

    ReplyDelete
  3. Does the programming language I use not have a CSV library? No, last time I looked "sh" didn't have any support for CSV, because sed/awk/grep make a right mess of it. But at least I have a choice of common XSLT tools that I can embed into my scripts.

    But Perl has library support for operations on binary Excel files, so no-one should be at a real disadvantage in any case!

    ReplyDelete