[Developers] wicked cool tool for data importer
Alec Flett
alecf at metaweb.com
Thu May 8 23:21:47 UTC 2008
So I've been using list importer a lot lately to suck in lists of
things from webpages.
My biggest problem lately has been that the text I want to extract on
a page is fairly complex - i.e. it's usually the first <a> inside each
<li>, but there is a bunch of expository text after the <a>, etc.
For example, here's a list of shopping malls owned by "Madison
Marquette":
http://www.madisonmarquette.com/portfolio/property_listings
Note that each entry is separated by a state, and is generally not
useful for generalized copy/paste into list importer.
This is the tool that has made this much easier, called "XPather"
https://addons.mozilla.org/en-US/firefox/addon/1192
Here's how I use it (assume below you've installed XPather)
1) on freebase.com, go to a topic (say http://www.freebase.com/view/guid/9202a8c04000641f80000000082dc986)
and on one of the properties you want to import (say, "Shopping
centers owned") click the little menu button and click "Import List"
2) Go to the web page with the site you want (say http://www.madisonmarquette.com/portfolio/property_listings)
Right click on one of the list item's, in the area that has the text
you want to import
3) Click "Show in XPather"
4) Edit the xpath there. In the above example, the path is
/html/body/div[@id='mainJoint']/div[@id='subPageMainWhiteBox']/
div[@id='mainMeat']/div[@id='firstColumn']/div[@id='contentBlk']/table/
tbody/tr/td[2]/blockquote[1]/a[3]
5) Remove the last few indexes (the numbers inside the [ ]) - in my
case I removed the [2], the [1] and the [3] because I wanted all <a>'s
inside this table:
/html/body/div[@id='mainJoint']/div[@id='subPageMainWhiteBox']/
div[@id='mainMeat']/div[@id='firstColumn']/div[@id='contentBlk']/table/
tbody/tr/td/blockquote/a
6) click "Eval" - you should get a list item showing just the text in
each <a>
7) Click the top item and then shift-click the bottom item (sadly,
'select all' didn't work for me)
8) click the "Inner HTML" tab - you'll get each entry on a separate
line, followed by "<!-- next result -->" - Unfortunately you have to
hand-remove each "<!-- next result -->" line :(
9) Copy this text, then paste this into the list importer back at
freebase.com Click "Continue" and continue with your list importing...
I'd love to know if anyone has any clipboard-cleanup tools that could
get rid of the "<!-- next result --> lines automatically.
Alec
More information about the Developers
mailing list