[Data-modeling] keeping Freebase topics and Wikipedia pages in sync

Brian Karlak zenkat at metaweb.com
Wed Jul 22 17:56:35 UTC 2009


On Jul 22, 2009, at 8:23 AM, Tom Morris wrote:

> what [is] the intended synchronization strategy is, specifically  
> with regard to:
>
> a) deleted Wikipedia pages

Pages can be deleted in wikipedia for many reasons: notability  
requirements, article consolidation, vandalism, and spam.

We filter out vandalism and spam by requiring that pages exist for at  
least three days before we create the article.  Therefore most of the  
deletion events we see are due to wikipedia editorial policy -- most  
often, article consolidation.

Since freebase does not have wikipedia's strict notability  
requirements, we have found it best to keep our topics when the  
associated wikipedia article is deleted.  We remove the /wikipedia/en  
key to note that it no longer points to a valid article, but keep the  
original /wikipedia/en_id key for attribution purposes.

> b) renamed Wikipedia pages

We experimented briefly with changing the names of topics when then  
the wikipedia article name changed.  However, sampling of the runs  
showed that the results were subjectively unsatisfying, usually  
because of article consolidation and other artifacts in wikipedia.  We  
have reverted to making no changes when we see a rename event.

There is also:

c) merged wikipedia pages

Wikipedia merges can happen for two reasons: merging of duplicate  
articles, and consolidation of subtopics into a main article.  We have  
found that about 30% of the "wikipedia merges" are true  
reconciliations that should result in a merge in freebase.  We  
considered putting these on the merge queue, but with >2000 merges  
every two weeks, we quickly realized that we would swamp our  
community's bandwidth.

Brian 
   


More information about the Data-modeling mailing list