[Data-modeling] keeping Freebase topics and Wikipedia pages in sync; uncertainty in who is the composer

Brian Karlak zenkat at metaweb.com
Wed Jul 22 17:51:23 UTC 2009


On Jul 22, 2009, at 7:44 AM, Iain Sproat wrote:

> And more frequent wikipedia imports to freebase can only help.  Is  
> it possible to use the wikipedia api to get more frequent updates -  
> what are the issues involved here?


The first issue is spam/vandalism detection.  These are usually found  
by the wikipedia community process within a few days, so we wait at  
least 3 days before creating a new topic in freebase.

The second issue the size and complexity of the data processing.  We  
use the wikipedia API to track every edit event to every page in  
wikipedia, in near real-time.  We get between 700K and 1M of these  
every two weeks.

To run the data pipeline, we need to scrape the entire HTML and XML  
dump for each touched article, parse them all through WEX, run the key- 
based reconciliation to freebase, and the update/create the topics.   
This takes a lot of time, and is most effectively done batch-wise on  
hadoop.  We think that we can reasonably run it every week -- more  
than that and we spend more time scraping and processing churn than we  
do getting new articles.

One thing we have discussed is creating a "current awareness" queue  
for new articles.  This would be a list of wikipedia article names  
which we know are not spam which should be created ASAP.  This would  
be a new pipeline running in parallel with the current batch pipeline,  
so it would require a bit amount of work to implement.  However, it  
could be useful for both "current awareness" articles as well as  
syncing community contributions.

Brian


More information about the Data-modeling mailing list