[Developers] wicked cool tool for data importer
Michele Berg
michele.r.berg at gmail.com
Fri May 9 04:29:17 UTC 2008
(Assuming you're using a Windows platform) I've used Win32::Clipboard in the
past to futz with clipboard contents. Worked fairly well, when I was on a
windows machine.
Michele
On Thu, May 8, 2008 at 6:21 PM, Alec Flett <alecf at metaweb.com> wrote:
> So I've been using list importer a lot lately to suck in lists of
> things from webpages.
>
> My biggest problem lately has been that the text I want to extract on
> a page is fairly complex - i.e. it's usually the first <a> inside each
> <li>, but there is a bunch of expository text after the <a>, etc.
>
> For example, here's a list of shopping malls owned by "Madison
> Marquette":
> http://www.madisonmarquette.com/portfolio/property_listings
>
> Note that each entry is separated by a state, and is generally not
> useful for generalized copy/paste into list importer.
>
> This is the tool that has made this much easier, called "XPather"
> https://addons.mozilla.org/en-US/firefox/addon/1192
>
> Here's how I use it (assume below you've installed XPather)
>
> 1) on freebase.com, go to a topic (say
> http://www.freebase.com/view/guid/9202a8c04000641f80000000082dc986)
> and on one of the properties you want to import (say, "Shopping
> centers owned") click the little menu button and click "Import List"
>
> 2) Go to the web page with the site you want (say
> http://www.madisonmarquette.com/portfolio/property_listings)
> Right click on one of the list item's, in the area that has the text
> you want to import
>
> 3) Click "Show in XPather"
>
> 4) Edit the xpath there. In the above example, the path is
> /html/body/div[@id='mainJoint']/div[@id='subPageMainWhiteBox']/
> div[@id='mainMeat']/div[@id='firstColumn']/div[@id='contentBlk']/table/
> tbody/tr/td[2]/blockquote[1]/a[3]
>
> 5) Remove the last few indexes (the numbers inside the [ ]) - in my
> case I removed the [2], the [1] and the [3] because I wanted all <a>'s
> inside this table:
> /html/body/div[@id='mainJoint']/div[@id='subPageMainWhiteBox']/
> div[@id='mainMeat']/div[@id='firstColumn']/div[@id='contentBlk']/table/
> tbody/tr/td/blockquote/a
>
> 6) click "Eval" - you should get a list item showing just the text in
> each <a>
>
> 7) Click the top item and then shift-click the bottom item (sadly,
> 'select all' didn't work for me)
>
> 8) click the "Inner HTML" tab - you'll get each entry on a separate
> line, followed by "<!-- next result -->" - Unfortunately you have to
> hand-remove each "<!-- next result -->" line :(
>
> 9) Copy this text, then paste this into the list importer back at
> freebase.com Click "Continue" and continue with your list importing...
>
> I'd love to know if anyone has any clipboard-cleanup tools that could
> get rid of the "<!-- next result --> lines automatically.
>
> Alec
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/developers/attachments/20080508/e9c296d4/attachment.htm
More information about the Developers
mailing list