From daniel.chudnov at yale.edu Wed Dec 15 10:50:02 2004 From: daniel.chudnov at yale.edu (Daniel Chudnov) Date: Wed Dec 15 10:47:03 2004 Subject: [gcs-pcs-list] discussion on service autodiscovery here? Message-ID: <41C05D2A.5020909@yale.edu> Greetings from chilly Connecticut. Although not much has happened on this list, plenty is happening all over, of course. But moving aside the topic of "big search company scanning our stuff" for the moment... Jeremy Frumkin and I posted a public draft called "Service Autodiscovery for Rapid Information Movement." In many ways our thinking was informed by some of the core issues we discussed in Baltimore: how do you get to reasonable metadata for a given item, how do you find rights info for it also, and how do you wire up disparate systems to ask such things of each other asynchronously. Its summary reads: "In this informal paper we review several exciting recent advances in online information services, and in how users navigate through and move information between a multitude of resource types. We consider ways in which those advances bring us closer to, but still fall short of, meeting the ultimate user requirements of simplicity and coherence. We propose expanding on the model of "autodiscovery" as a means to enable both users and machines to move more freely between human and machine interfaces." It's at: http://curtis.med.yale.edu/dchud/writings/sa4rim.html We've gotten some positive feedback from some folks who've read it, and are starting to talk about testing some of the service autodiscovery possibilities with live resources. A wiki has been started, and we'd also like to move some of the discussion to a list. Since this list presumably comprises people who've already stated an interest in and much experience with such issues, I thought this might be a good home for that discussion. :) Also, we haven't exactly had much (any) traffic so far, so it seems a potentially good conversation re-starter. So, do y'all mind if we discuss these issues here? If more people respond to me (off-list is okay too) that they do mind than not, we'll start a different list. Many thanks, -Dan -- Daniel Chudnov Yale Center for Medical Informatics (203) 737-5789 From arhyno at uwindsor.ca Fri Dec 17 11:14:47 2004 From: arhyno at uwindsor.ca (arhyno@uwindsor.ca) Date: Fri Dec 17 11:15:29 2004 Subject: [gcs-pcs-list] discussion on service autodiscovery here? In-Reply-To: <41C05D2A.5020909@yale.edu> Message-ID: Hey Dan, Don't take the silence as complacency, I think you and Jeremy cover so much ground in your paper that a lot of us are still trying to absorb it all. By all means, a public discussion is a great way to move forward. art From daniel.chudnov at yale.edu Fri Dec 17 11:23:11 2004 From: daniel.chudnov at yale.edu (Daniel Chudnov) Date: Fri Dec 17 11:21:03 2004 Subject: [gcs-pcs-list] discussion on service autodiscovery here? In-Reply-To: References: Message-ID: <41C307EF.7050105@yale.edu> arhyno@uwindsor.ca wrote: > > Don't take the silence as complacency, I think you and Jeremy cover so > much ground in your paper that a lot of us are still trying to absorb it > all. By all means, a public discussion is a great way to move forward. Heh, thanks Art. Considering the email woes visited upon this campus (and others?) in the last week, I'd thought of pinging the list again just to make sure it's working, but now I know! I'm sorry the paper is so long... at least it has lots of pictures. :) -Dan -- Daniel Chudnov Yale Center for Medical Informatics (203) 737-5789 From camster at citeulike.org Mon Dec 20 07:23:57 2004 From: camster at citeulike.org (Richard Cameron) Date: Mon Dec 20 07:23:43 2004 Subject: [gcs-pcs-list] Autodiscovery and embedding metadata Message-ID: <05362EA1-5282-11D9-91BA-000D9336C6A0@citeulike.org> Hi, Following on from Dan's email, I thought I'd kick things off on the "autodiscovery" problem. I've got a rather personal perspective on this, which is to say there's an immediate application: I run a "social bookmarking" site for academic papers http://www.citeulike.org and a lot of my time is spent trying to solve the "URL to metadata problem". The problem is that when one of my users finds an article on the web (say on PubMed to give a concrete example), my server ends up with the URL. I then fetch that page myself, and go about the business of trying to figure out what the metadata associated with that article should be. A lot of this involves some pretty horrendous scraping code, which is fragile and an absolute pain to write. If only, I thought, there was a standard way of embedding the data into the HTML page itself. It seems Dan's been thinking about exactly the same problem from a slightly different perspective, and proposed a number of possible ways of doing this. I've tried to summarise his options on a Wiki page: http://dbk.ch.umist.ac.uk/wiki/index.php?title=Metadata_in_HTML If anyone has any thoughts, or can think of why any of these methods would be unsuitable for a particular domain then shall we talk about it on the list? Feel free to edit the Wiki too, but it might be wise to reach some consensus on the list first. I know a lot of people will be on holiday at the moment, and mincemeat and merriment is likely to be more interesting than talking about this, so perhaps it's something we can pick up again in the new year? Richard. From arhyno at uwindsor.ca Mon Dec 20 15:06:09 2004 From: arhyno at uwindsor.ca (arhyno@uwindsor.ca) Date: Mon Dec 20 15:11:03 2004 Subject: [gcs-pcs-list] Autodiscovery and embedding metadata In-Reply-To: <05362EA1-5282-11D9-91BA-000D9336C6A0@citeulike.org> Message-ID: This topic sort of came up on the Code4Lib list at one point in the context of federated searching, how do you describe an interface in such a way that a community can share strategies to snag desired content? I keep hoping that some of the plumbing proposed for achieving device independence, such as CC/PP, might offer a less fragile conduit to web resources than scraping the HTML. Still, despite the hype, I suspect that cellphone/pda browsers are still too minor in terms of audience to see these options any time soon. In fact, I wonder if we will ever get completely away from the notion of screen scraping even if the hooks get dramatically better. There may always be content that lives outside of the kind of containers which will allow metadata to be injected very easily. I have been thinking about this lately in connection to doing some work with saxon 8 in cocoon. It's been a bit of pain to set up but saxon 8 supports xslt 2.0, and xslt 2.0, in turn, supports regular expressions. Whether the combination of xpath and regular expressions amounts to solid scraping mechanisms is debatable, but xpath does give a good way to describe whatever structure is available in an HTML resource, and regular expressions seem to be the main fuel for text processing outside of the structure. The combination doesn't completely solve the sometimes gruesome aspects of stylesheet syntax, but I wonder if xslt 2.0 is a framework for a kind of language-neutral option for describing how to harvest content that exists in a third-party resource. In other words, if you had a "recipe" for extracting metadata from HTML, what format would work best for plugging into the most environments? I have an example below of looking for isbns and modifying hrefs. As more development environments support xpath, would xslt provide the basis of one format to build a registry of patterns for metadata? If not, what would? Otherwise, I hope everyone enjoys the mincemeat and merriment in the coming days. Through the actions of my colleagues, I have recently discovered that one of the joys of having chocolate available at all times in your workplace is that it can be a surprisingly compelling addition to traditional breakfast fare. I don't know why such options are not put in place all year around, though the answer may come to me during my nap. art --- this looks like an isbn, need to do a checkdigit to be sure! From camster at citeulike.org Thu Dec 23 06:30:14 2004 From: camster at citeulike.org (Richard Cameron) Date: Thu Dec 23 06:30:19 2004 Subject: [gcs-pcs-list] Autodiscovery and embedding metadata In-Reply-To: References: Message-ID: <03B8BF2E-54D6-11D9-9A08-000393C0D098@citeulike.org> On 20 Dec 2004, at 20:06, arhyno@uwindsor.ca wrote: > In fact, I wonder if we will ever get completely away from the notion > of > screen scraping even if the hooks get dramatically better. There may > always be content that lives outside of the kind of containers which > will > allow metadata to be injected very easily. I suspect you're right, but if we can produce something which allows harvesters to pick at least the lowest hanging fruit off the tree then that would be progress. I've just spent ten minutes trying to parse just the title of an article out of a page which formats it pretty much inconsistently every time (more on this later). I suspect there will always be esoteric data which doesn't fit into the model, but a) That can be dealt with by a modular approach to the metadata format. E.g. XML namespaces. b) If it's incredibly esoteric then the chances are it won't fit into the "lowest common denominator" of data which harvesters will be interested in, so it might not be the end of the world. > Whether the combination of xpath and > regular expressions amounts to solid scraping mechanisms is debatable, > but > xpath does give a good way to describe whatever structure is available > in > an HTML resource, and regular expressions seem to be the main fuel for > text processing outside of the structure. I had this argument with a friend the other day. To do XPATH on HTML kind of assumes you have valid XHTML. At one point my site was XHTML compliant, but because browsers are massively flexible in what they'll render, it just drifts away from compliance and there's no real motivation on my part to fix it. It's just a lot of effort which doesn't buy me anything except a little badge which says "Valid XHTML". Even if it were, my head just doesn't seem to cope with thinking about scraping out information from web pages using XPATH. I find I'm about 20 times more productive doing such work in Perl or Tcl. Maybe this is an argument for storing the metadata in a separate file linked to from the HTML, just like RSS does, and note actually encoded in the HTML document itself? > The combination doesn't completely solve the sometimes gruesome > aspects of > stylesheet syntax, but I wonder if xslt 2.0 is a framework for a kind > of > language-neutral option for describing how to harvest content that > exists > in a third-party resource. In other words, if you had a "recipe" for > extracting metadata from HTML, what format would work best for plugging > into the most environments? I have an example below of looking for > isbns > and modifying hrefs. As more development environments support xpath, > would > xslt provide the basis of one format to build a registry of patterns > for > metadata? If not, what would? What's the advantage of scraping in XPATH over Perl/Tcl/Python? Perl was designed for slicing and dicing pieces of text, and it does so with ruthless efficiency. If you want a challenge, try using your preferred method to scrap out the title and authors from this page: http://www3.interscience.wiley.com/cgi-bin/abstract/5008818/ABSTRACT When you're happy that it works, pick ten other articles from that web site and see if you can parse them. I suspect you won't be able to, because Wiley formats this stuff in a crazy way. I also predict that when you go back to your XPATH code to generalise it, you'll find that it's much more brittle than a few hacky lines of Perl. I think Dan makes the good point that trying to screen scrape over so many sites is just jolly difficult. I massively underestimated how annoying it would be when I sat down to write CiteULike. In my mind, he's absolutely spot-on with his call to embed the data in a machine readable format and press publishers into using it. It will lower the barrier to people writing really useful tools like the ones outlined in his paper. Richard. From arhyno at uwindsor.ca Thu Dec 23 11:55:03 2004 From: arhyno at uwindsor.ca (arhyno@uwindsor.ca) Date: Thu Dec 23 12:00:05 2004 Subject: [gcs-pcs-list] Autodiscovery and embedding metadata In-Reply-To: <03B8BF2E-54D6-11D9-9A08-000393C0D098@citeulike.org> Message-ID: <200412231700.iBNGxxD6008339@pmails2.uwindsor.ca> Hi, Sorry, I wasn't clear. I was interested in XPath more as a syntax for laying out how to drill down to specific content in HTML resources rather than focusing on particular implementation. On its own, XPath should be language agnostic in the same way that regular expressions are but my thinking was that the HTML to XHTML, if needed/desired, would come from the tidy tools or something similar. My sense is that most of us still end up looking for the third TD in a table or whatever a lot of times when working with HTML resources, and Xpath could still be a map even if it wasn't the way the content was extracted. Again, tidy-like tools are probably options in most languages, and it could literally be the case that the HTML structure buys you zilch with a lot of resources. Some content is always going to be elusive, maybe semantic latent indexing or something like it will someday help us here. It is infinitely preferable to not do any screen scraping at all, and there are content providers that could probably be convinced to add more hooks now that OpenURL is sort of a precedent for this kind of thing. Service Orientated Architecture should also drive a lot of web applications towards exposing a more functional layer for re-purposing content. But for what's it worth, I took a brief look at Wiley and it is pretty darn scary. About the only compelling aspects it has are that you don't have to deal with a lot of intermediate pages or cookie management to get around the site. To see what it looks like with tidy, I put together a simple setup that uses this format: http://137.207.120.195:8080/cocoon/holdings/xmlout?url= For example: http://137.207.120.195:8080/cocoon/holdings/xmlout?url=http://www3.interscience.wiley.com/cgi-bin/abstract/5008818/ABSTRACT I also tried a simple stylesheet to extract title and author information, but wiley is sooo slow from our network today that I didn't take it very far. The setup is a slight change to the above URL: http://137.207.120.195:8080/cocoon/holdings/wiley?url=http://www3.interscience.wiley.com/cgi-bin/abstract/5008818/ABSTRACT And the stylesheet I used is at: http://137.207.120.195:8080/cocoon/holdings/wiley.xsl And I am sure it breaks all over the place. Wiley does seem to use DIV tags and the occasional CLASS attribute, and if there was a somewhat consistent use of these, then it might be an example of where Xpath could be useful, even if the coding is done in perl or php and so on. Now time to shovel snow! Happy holidays to everyone from this normally warmer part of the frozen north. art -------------- next part -------------- An HTML attachment was scrubbed... URL: http://cipolo.med.yale.edu/pipermail/gcs-pcs-list/attachments/20041223/50b9239b/attachment.htm