[Developers] wicked cool tool for data importer

Alec Flett alecf at metaweb.com
Thu May 8 23:21:47 UTC 2008


So I've been using list importer a lot lately to suck in lists of  
things from webpages.

My biggest problem lately has been that the text I want to extract on  
a page is fairly complex - i.e. it's usually the first <a> inside each  
<li>, but there is a bunch of expository text after the <a>, etc.

For example, here's a list of shopping malls owned by "Madison  
Marquette":
http://www.madisonmarquette.com/portfolio/property_listings

Note that each entry is separated by a state, and is generally not  
useful for generalized copy/paste into list importer.

This is the tool that has made this much easier, called "XPather"
https://addons.mozilla.org/en-US/firefox/addon/1192

Here's how I use it (assume below you've installed XPather)

1) on freebase.com, go to a topic (say http://www.freebase.com/view/guid/9202a8c04000641f80000000082dc986) 
  and on one of the properties you want to import (say, "Shopping  
centers owned") click the little menu button and click "Import List"

2) Go to the web page with the site you want (say http://www.madisonmarquette.com/portfolio/property_listings) 
  Right click on one of the list item's, in the area that has the text  
you want to import

3) Click "Show in XPather"

4) Edit the xpath there. In the above example, the path is
/html/body/div[@id='mainJoint']/div[@id='subPageMainWhiteBox']/ 
div[@id='mainMeat']/div[@id='firstColumn']/div[@id='contentBlk']/table/ 
tbody/tr/td[2]/blockquote[1]/a[3]

5) Remove the last few indexes (the numbers inside the [ ]) - in my  
case I removed the [2], the [1] and the [3] because I wanted all <a>'s  
inside this table:
/html/body/div[@id='mainJoint']/div[@id='subPageMainWhiteBox']/ 
div[@id='mainMeat']/div[@id='firstColumn']/div[@id='contentBlk']/table/ 
tbody/tr/td/blockquote/a

6) click "Eval" - you should get a list item showing just the text in  
each <a>

7) Click the top item and then shift-click the bottom item (sadly,  
'select all' didn't work for me)

8) click the "Inner HTML" tab - you'll get each entry on a separate  
line, followed by "<!-- next result -->" - Unfortunately you have to  
hand-remove each "<!-- next result -->" line :(

9) Copy this text, then paste this into the list importer back at  
freebase.com Click "Continue" and continue with your list importing...

I'd love to know if anyone has any clipboard-cleanup tools that could  
get rid of the "<!-- next result --> lines automatically.

Alec



More information about the Developers mailing list