[Developers] libraries/techniques for extracting data from the Wikipedia to feed to freebase

Tom Morris tfmorris at gmail.com
Thu Feb 26 15:11:24 UTC 2009


I would definitely like to see some of the framework for Wikipedia
interpretation released so that it could be improved/expanded, but
it's possible that Freebase considers this part of its "secret sauce."
 It certainly needs improvement, but I don't have the machine learning
chops to know how hard it would be.

Having said that, parsing Wiki text, even infoboxen, should be a last
resort.  If you can find the original source for the data that the
Wikipedians are referencing, you are much better off extracting
directly from there.  Mating two lossy transforms back-to-back is just
going to create very messy data for you.

I haven't looked at it, but the dbpedia folks have at least some of
their code on Sourceforge.  http://sourceforge.net/projects/dbpedia  A
related tool, LEILA, is available at
http://www.mpi-inf.mpg.de/~suchanek/downloads/leila/ but I don't know
whether feeding data into Freebase would be considered "commercial
use" or not (it's license/terms of usage aren't exactly formal).

Tom

On Thu, Feb 26, 2009 at 12:37 AM, David Roberts <dvdr18 at gmail.com> wrote:
> There's an issue that I've submitted for uploading BODR chemical
> element/isotope data into freebase if you're interested:
> https://bugs.freebase.com/browse/DA-636
>
> --
> David Roberts
> http://purl.org/david
>
>
>
> 2009/2/26 Raymond Yee <raymond.yee at gmail.com>:
>> Anyone out there have a lot of experience scraping the Wikipedia for
>> facts?   The applications are many, but some examples I have in mind
>> right now include:
>>
>> 1) extracting data about chemical elements -- e.g. boiling points of
>> elements
>>
>> 2) American politicians at the federal, state, and municipal levels
>>
>> 3) visual artists and their works
>>
>> One thing that has surprised me about freebase has been the patchiness
>> of the data in it -- I wanted to plot all the boiling point of elements
>> vs atomic numbers -- but a lot of the elements are missing bps -- if you
>> go to
>>
>> http://is.gd/kVb1
>>
>> and hit "Read>>"  you'll get a list of elements w/o boiling points -- as
>> of 2009-02-26T04:53:34.3750Z (that is).
>>
>> So what I'd like to do is to use a set of Wikipedia parsers to extract
>> data that I find useful and push them into Freebase for some projects I
>> have in mind.  My quick experience with DBPedia is that it's not better
>> for chemical elements either -- but I might just be misunderstanding it.
>>
>> Does freebase have any tools it can release that we can adapt for
>> specific purposes to push more data into freebase?
>>
>> Thanks,
>> -Raymond
>> _______________________________________________
>> Developers mailing list
>> Developers at freebase.com
>> http://lists.freebase.com/mailman/listinfo/developers
>>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>


More information about the Developers mailing list