(Assuming you're using a Windows platform) I've used Win32::Clipboard in the past to futz with clipboard contents. Worked fairly well, when I was on a windows machine.<br><br>Michele<br><br><div class="gmail_quote">
On Thu, May 8, 2008 at 6:21 PM, Alec Flett <<a href="mailto:alecf@metaweb.com">alecf@metaweb.com</a>> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
So I've been using list importer a lot lately to suck in lists of<br>
things from webpages.<br>
<br>
My biggest problem lately has been that the text I want to extract on<br>
a page is fairly complex - i.e. it's usually the first <a> inside each<br>
<li>, but there is a bunch of expository text after the <a>, etc.<br>
<br>
For example, here's a list of shopping malls owned by "Madison<br>
Marquette":<br>
<a href="http://www.madisonmarquette.com/portfolio/property_listings" target="_blank">http://www.madisonmarquette.com/portfolio/property_listings</a><br>
<br>
Note that each entry is separated by a state, and is generally not<br>
useful for generalized copy/paste into list importer.<br>
<br>
This is the tool that has made this much easier, called "XPather"<br>
<a href="https://addons.mozilla.org/en-US/firefox/addon/1192" target="_blank">https://addons.mozilla.org/en-US/firefox/addon/1192</a><br>
<br>
Here's how I use it (assume below you've installed XPather)<br>
<br>
1) on <a href="http://freebase.com" target="_blank">freebase.com</a>, go to a topic (say <a href="http://www.freebase.com/view/guid/9202a8c04000641f80000000082dc986" target="_blank">http://www.freebase.com/view/guid/9202a8c04000641f80000000082dc986</a>)<br>
and on one of the properties you want to import (say, "Shopping<br>
centers owned") click the little menu button and click "Import List"<br>
<br>
2) Go to the web page with the site you want (say <a href="http://www.madisonmarquette.com/portfolio/property_listings" target="_blank">http://www.madisonmarquette.com/portfolio/property_listings</a>)<br>
Right click on one of the list item's, in the area that has the text<br>
you want to import<br>
<br>
3) Click "Show in XPather"<br>
<br>
4) Edit the xpath there. In the above example, the path is<br>
/html/body/div[@id='mainJoint']/div[@id='subPageMainWhiteBox']/<br>
div[@id='mainMeat']/div[@id='firstColumn']/div[@id='contentBlk']/table/<br>
tbody/tr/td[2]/blockquote[1]/a[3]<br>
<br>
5) Remove the last few indexes (the numbers inside the [ ]) - in my<br>
case I removed the [2], the [1] and the [3] because I wanted all <a>'s<br>
inside this table:<br>
/html/body/div[@id='mainJoint']/div[@id='subPageMainWhiteBox']/<br>
div[@id='mainMeat']/div[@id='firstColumn']/div[@id='contentBlk']/table/<br>
tbody/tr/td/blockquote/a<br>
<br>
6) click "Eval" - you should get a list item showing just the text in<br>
each <a><br>
<br>
7) Click the top item and then shift-click the bottom item (sadly,<br>
'select all' didn't work for me)<br>
<br>
8) click the "Inner HTML" tab - you'll get each entry on a separate<br>
line, followed by "<!-- next result -->" - Unfortunately you have to<br>
hand-remove each "<!-- next result -->" line :(<br>
<br>
9) Copy this text, then paste this into the list importer back at<br>
<a href="http://freebase.com" target="_blank">freebase.com</a> Click "Continue" and continue with your list importing...<br>
<br>
I'd love to know if anyone has any clipboard-cleanup tools that could<br>
get rid of the "<!-- next result --> lines automatically.<br>
<br>
Alec<br>
<br>
_______________________________________________<br>
Developers mailing list<br>
<a href="mailto:Developers@freebase.com">Developers@freebase.com</a><br>
<a href="http://lists.freebase.com/mailman/listinfo/developers" target="_blank">http://lists.freebase.com/mailman/listinfo/developers</a><br>
</blockquote></div><br>