[Data-modeling] keeping Freebase topics and Wikipedia pages in sync
Brian Karlak
zenkat at metaweb.com
Wed Jul 22 17:56:35 UTC 2009
On Jul 22, 2009, at 8:23 AM, Tom Morris wrote:
> what [is] the intended synchronization strategy is, specifically
> with regard to:
>
> a) deleted Wikipedia pages
Pages can be deleted in wikipedia for many reasons: notability
requirements, article consolidation, vandalism, and spam.
We filter out vandalism and spam by requiring that pages exist for at
least three days before we create the article. Therefore most of the
deletion events we see are due to wikipedia editorial policy -- most
often, article consolidation.
Since freebase does not have wikipedia's strict notability
requirements, we have found it best to keep our topics when the
associated wikipedia article is deleted. We remove the /wikipedia/en
key to note that it no longer points to a valid article, but keep the
original /wikipedia/en_id key for attribution purposes.
> b) renamed Wikipedia pages
We experimented briefly with changing the names of topics when then
the wikipedia article name changed. However, sampling of the runs
showed that the results were subjectively unsatisfying, usually
because of article consolidation and other artifacts in wikipedia. We
have reverted to making no changes when we see a rename event.
There is also:
c) merged wikipedia pages
Wikipedia merges can happen for two reasons: merging of duplicate
articles, and consolidation of subtopics into a main article. We have
found that about 30% of the "wikipedia merges" are true
reconciliations that should result in a merge in freebase. We
considered putting these on the merge queue, but with >2000 merges
every two weeks, we quickly realized that we would swamp our
community's bandwidth.
Brian
More information about the Data-modeling
mailing list