[Developers] Steps Toward a Complete History Export in Freebase

Hostile Fork hostilefork at gmail.com
Fri May 29 22:19:37 UTC 2009


Hello Freebase + developers...!

As an open source activist, I'm naturally skeptical of a proprietary  
engine being the foundation for Web 3.0+.  But MetaWeb's technical  
approach to the semantic web problem fits my intuition perfectly, and  
I can't find anyone else who is doing such a stellar job at the  
execution.  Plus, pursuant to the "Free" in the Freebase name...the  
data set is available for anyone to import, index, and even serve  
through an API.

However: In speaking with the Graphd team in person, I expressed a  
concern about "community trust" because it isn't a full export.   
*Please correct me if I'm wrong*, but the finest granularity of  
download looks like this:

	http://download.freebase.com/datadumps/quad-sample.txt

There are no dates/times on which the transactions were entered.  This  
means Freebase is the only entity that can analyze the chronology, or  
implement the crucial "as_of_time" feature as explained here:

	http://blog.freebase.com/2009/02/02/mql-monday-looking-back-into-the-past-with-as_of_time/

No indication is given of which user made an assertion in that log.   
Imagine if a bad piece of data is noticed: Freebase holds a unique  
position to investigate the other changes made by that account.  Thus  
their ability to reconcile and analyze the corpus is privileged.

Lastly, these data sets are provided in monolithic files released on  
an arbitrary 3 month delay.  The most recent export was on March 23,  
2009...and is now more than a month old.  That's far too long for a  
competing (or complementary) service based on the data to wait.  As  
anyone developing apps under Internet expectations can attest, even a  
one minute lag for updates is too long!!

The good news is that there's a very simple solution to all of this.   
Just establish a REST API which returns all the user modifications  
that Freebase records for itself between (time1) and (time2).  It's  
perfectly fine to require a login for these queries, and to establish  
quotas...so long as these are not deliberately designed to cripple a  
mirroring effort.

Make sense?  I'm happy to volunteer my time to assist in the  
specification + documentation of such an API.

Thanks!
--Brian

http://hostilefork.com


More information about the Developers mailing list