[Developers] dealing with duplicates

Chris odp at freenet.de
Tue Jun 5 21:17:28 UTC 2007


Apologies for posting on a developers list as a completely non-technical 
person, but the convergence problem is simply too tempting to keep my 
mouth shut ;-)

Alec Flett wrote:
"This is a great question - convergence problems are always going to 
exist in freebase and it would be great to flush out some specific 
patterns to deal with them. After all, it takes just one broken program 
or person creating a second topic called "The Beatles" - even if some 
process clear up duplicates in 6 hours, that's 6 hours of "confused" 
data. I'd love to hear some suggestions on how to address this. "

 I´d try to combine various external authoritative resources, and the 
authority of the human contributors.

For many topics, Freebase already has a link to the corresponding 
Wikipedia article: it´s currently used to display the Wikipedia 
description for a topic if there´s no user-defined description yet. If 
you´d not only use these links for finding the Wiki descriptions, but 
store them as an extra "Corresponding Wikipedia article" default 
property and encourage people to actively maintain these links, that 
would give an external authority for a whole lot of topics... especially 
if it is forbidden, or at least made a bit more difficult, to enter a 
specific article more than once into this "Corresponding Wikipedia 
article" field in the complete Freebase. If someone tries to enter a 
link to a Wikipedia article that has already been used elsewhere, 
display a note "already used for topic x, please make sure there´s no 
duplication". This would help people to find duplicates or overlaps, and 
create a bit healthy pressure to discuss problem cases too.

The IMDB entries which are already used in the Music domain could 
probably be used like this, too.
Other candidates are links to Musicmoz/Chefmoz/ODP categories 
(affiliation disclosure: I am editing at dmoz.org), or the Yahoo 
directory. Or expert resources that are used only for specific domains.

A developer who wants to use Freebase content combined with an external 
authority could choose any of these external resources as his main 
authority. E.g. for searches on films IMDB might be the best fit.  Or 
depending of what you use the data for, you could assign higher value to 
a topic that has matches at several of these selected resources.

Regards,
Chris (chris2 at Freebase)






More information about the Developers mailing list