[Developers] [Data-modeling] Data load issues

Reilly Hayes rfh at metaweb.com
Wed May 27 21:34:02 UTC 2009


Tom,

We're aware of the issues with the Open Library load (which is still  
very much in progress).

There were a few thousand misses on reconciliation of hundreds of  
thousands of authors.  We're in the process of cleaning these up (via  
our human judgement capture system)  We take reconciliation very  
seriously are investing considerable effort into building  
reconciliation technology.

Reconciliation is hard, and reconciling people is particularly hard.   
There is bound to be error.  The effort to correct a false positive on  
reconciliation is about 20 times the effort to correct a false  
negative.  This cost differential leads us to set the thresholds very  
high for reconciliation.  Said another way, we would rather see 19  
unreconciled authors than 1 author incorrectly reconciled to another.   
For this reason, please tread gently on the merge queue.

We have not yet addressed the non-Person Person issue you raised, but  
it is on the radar.  Thanks.

Your point on poorly disambiguated topics is correct, given the data  
currently visible in Freebase.  As it turns out, Book and Edition data  
*is* part of the load  We will be loading that after the authors are  
cleaned up.

If you would like to help, we're going to have some related tasks pop  
up in our human judgement capture system (RABJ) in the near future.   
I'll post the URL when they are ready.

-reilly


On May 27, 2009, at 11:35 AM, Tom Morris wrote:

> This is a query, well two queries actually, about the process of
> loading data itself as opposed to the modeling of it, so I'm not sure
> where it belongs.  Is there a good home for this type of discussion
> (assuming it isn't just an outright bug that should get tossed into
> Jira)?
>
> These issues are entirely unrelated, except that they are both about
> automated data loading processes and quality thereof.
>
> 1. A large dataset was loaded (or is still being loaded?) from Open
> Library without any apparent attempt at reconciliation with the
> existing database which has caused a number of issues including:
>
> a. creation of a substantial number of spurious Person entries for
> "Scholatistic Books Inc." et al
> b. creation of a large number of duplicate real Person topics (you've
> probably had to vote on these if you do any merge voting)
> c. creation of topics with insufficient information to disambiguate.
> Although the source information includes information about books
> authored, this is not being loaded and is a key piece of information
> for determining whether topics are duplicates
> d. creation of topics without any visible provenance.  If you know to
> use the Explore view, you can find the Open Library id and use it to
> link back to the source data so you can see the complete version, but
> it's all a manual process.
> e. the source data itself looks suspect to me and appears to conflate
> independent authors based solely on the fact that they share a common
> name
>
> The reason this is all an issue from a developer's point of view is
> that every time something like this happens, it becomes harder to do
> the resolution/disambiguation for the next data load.  Perhaps the
> view is that the "community" just has to dig in and clean up the mess,
> but a) they won't get it all and b) they've got more productive things
> to be doing with their time (e.g. modeling and typing all the stuff
> that's not done yet).
>
> It's may be too late to clean up a lot of the damage, but at least
> including a link back to the OpenLibrary page with a URL template and
> *perhaps* loading the books (assuming to won't just make things worse)
> to help disambiguate would help.
>
> 2. A little tiny nit compared to the above, but I came across it just
> before sitting down to write this - this Mike Lombardi topic
> http://www.freebase.com/view/guid/9202a8c04000641f800000000befcf9e is
> typed as an Influence Node, but nothing else, which seems totally
> bizarre to me.  How did that type get derived?  Since Wikipedia has an
> infobox with born date and he's in a Born_In_1976 category, I'd
> expected him to get typed as a person.
>
> Tom
>
> p.s. The way I stumbled across #2 was kind of cool because it was the
> first actual organic instance of MJT in the wild that I've seen.  I
> was looking at Mike Love's note about his Genealogy of Influence
> project http://mike-love.net/ and noticed that he'd *just* updated
> things which I thought was a huge coincidence.  Of course when I
> looked a little closer, it was just a little MJT magic querying
> Freebase when anyone visited his home page.
> _______________________________________________
> Data-modeling mailing list
> Data-modeling at freebase.com
> http://lists.freebase.com/mailman/listinfo/data-modeling

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2434 bytes
Desc: not available
Url : http://lists.freebase.com/pipermail/developers/attachments/20090527/88655378/attachment.bin 


More information about the Developers mailing list