[Developers] Loading Freebase into a Star Schema

Paul Houle paul at ontology2.com
Wed Feb 25 20:03:32 UTC 2009


Shawn Simister wrote:
> Great work Paul,
>
> This is one of those things that I always figured should be possible but 
> I had no idea it would take so much time and effort to load the data. It 
> certainly gives me a new appreciation for all the work that the Metaweb 
> folks do to return my long, complicated MQL queries at reasonable 
> speeds. Thanks for working through the hard parts for us and documenting 
> your process so clearly.
>
> Shawn
>   
    It wasn't ~that~ much work,  but I had to wait a long time for the 
process to finish.  One take-away is that I'm not going to do it again 
with a larger dump.  It's really a  research system (figure out what 
exactly freebase is) rather than a practical system for extraction.

    Currently I'm thinking about using the *-schema to build a 'map' of 
freebase.  The map would probably be about 5% or less the size of 
freebase (in quad equivalent.)  One component would be a data dictionary 
(extraction of the freebase schema),  another component would be a 
relationalization of the "/type/object/" predicates:  this would let you 
look up everything by name and look up the types of things.  I think a 
map creation script would run in an less than an hour and will probably 
be good for a few years of growth.

    Given a "map",  one could selectively load parts of Freebase than 
one wants to load,  for instance,  particular bases.  Perhaps it could 
selectively download stuff from

http://download.freebase.com/datadumps/2009-01-13/browse/

     I think that it would be practical to represent Freebase in an 
object-relational format:  if you're interested in Type X,  you could 
create table X that has all of the 1-1 properties associated with type X 
as columns,  then you add some extra satellite tables to handle more 
complex relations.  You could then do most of the obvious queries in an 
obvious way with SQL.  Maybe the data dictionary could be used to build 
special sorts of graph queries.  I think performance would be pretty good.

    Note the requirements of this are quite different from the live FB 
system,  which has to deal with continuously changing schema -- they've 
got to have some kind of index structure that's more solid than a 
generic RDF store but more flexible than RDBMS.


   


More information about the Developers mailing list