Monday, 30 July 2007

Importing Frustrations

It all sounds so easy "just import it from BibTeX". But of course, the ACM's idea of what should go where in BibTeX doesn't fit with mine / my repository. So for example, my repository has an "Official URL" field to indicate where the "official publisher's version" (ahem) is to be found. The ACM (bless 'em) instead provide a "DOI" field. That's a straight-forward-enough mismatch of information and easy to work around, but to make matters confused they don't put a DOI in the DOI field, they put a URL there. The URL happens to be the URL of a DOI resolution service (their own) with the DOI stuck on the end. This (as it happens) is very easy for a human to use, but a bit of a pain for a service to interpret. Only a little bit of a pain, I hear you cry! But these import scripts are supposed to be little pieces of easy-to-write code that adapt a well-understood interop format to my database schema. Am I supposed to write a different BibTeX importer for each blooming publisher? Ick! Or am I to write a mega-disambiguation script that can understand what the data provider should have said?

Also, there's that little matter of the missing abstract, so I have to roll my own BibTex by data scraping anyway. Roll on RDF! (But then of course you can make the same mistakes with RDF and all the hordes of Semantic Web technologists that you can with BibTeX.)

Or, do I just make do with whatever little scraps of help the importer does get right and manually enter the rest (using my army of self-archiving slaves)? What's the Zen thing?

