[gcs-pcs-list] Autodiscovery and embedding metadata

Richard Cameron camster at citeulike.org
Mon Dec 20 07:23:57 EST 2004


Following on from Dan's email, I thought I'd kick things off on the 
"autodiscovery" problem.

I've got a rather personal perspective on this, which is to say there's 
an immediate application: I run a "social bookmarking" site for 
academic papers http://www.citeulike.org and a lot of my time is spent 
trying to solve the "URL to metadata problem".

The problem is that when one of my users finds an article on the web 
(say on PubMed to give a concrete example), my server ends up with the 
URL. I then fetch that page myself, and go about the business of trying 
to figure out what the metadata associated with that article should be. 
A lot of this involves some pretty horrendous scraping code, which is 
fragile and an absolute pain to write.

If only, I thought, there was a standard way of embedding the data into 
the HTML page itself. It seems Dan's been thinking about exactly the 
same problem from a slightly different perspective, and proposed a 
number of possible ways of doing this. I've tried to summarise his 
options on a Wiki page:


If anyone has any thoughts, or can think of why any of these methods 
would be unsuitable for a particular domain then shall we talk about it 
on the list? Feel free to edit the Wiki too, but it might be wise to 
reach some consensus on the list first.

I know a lot of people will be on holiday at the moment, and mincemeat 
and merriment is likely to be more interesting than talking about this, 
so perhaps it's something we can pick up again in the new year?


