[Developers] Data load issues

Tom Morris tfmorris at gmail.com
Wed May 27 18:35:57 UTC 2009


This is a query, well two queries actually, about the process of
loading data itself as opposed to the modeling of it, so I'm not sure
where it belongs.  Is there a good home for this type of discussion
(assuming it isn't just an outright bug that should get tossed into
Jira)?

These issues are entirely unrelated, except that they are both about
automated data loading processes and quality thereof.

1. A large dataset was loaded (or is still being loaded?) from Open
Library without any apparent attempt at reconciliation with the
existing database which has caused a number of issues including:

 a. creation of a substantial number of spurious Person entries for
"Scholatistic Books Inc." et al
 b. creation of a large number of duplicate real Person topics (you've
probably had to vote on these if you do any merge voting)
 c. creation of topics with insufficient information to disambiguate.
Although the source information includes information about books
authored, this is not being loaded and is a key piece of information
for determining whether topics are duplicates
 d. creation of topics without any visible provenance.  If you know to
use the Explore view, you can find the Open Library id and use it to
link back to the source data so you can see the complete version, but
it's all a manual process.
 e. the source data itself looks suspect to me and appears to conflate
independent authors based solely on the fact that they share a common
name

The reason this is all an issue from a developer's point of view is
that every time something like this happens, it becomes harder to do
the resolution/disambiguation for the next data load.  Perhaps the
view is that the "community" just has to dig in and clean up the mess,
but a) they won't get it all and b) they've got more productive things
to be doing with their time (e.g. modeling and typing all the stuff
that's not done yet).

It's may be too late to clean up a lot of the damage, but at least
including a link back to the OpenLibrary page with a URL template and
*perhaps* loading the books (assuming to won't just make things worse)
to help disambiguate would help.

2. A little tiny nit compared to the above, but I came across it just
before sitting down to write this - this Mike Lombardi topic
http://www.freebase.com/view/guid/9202a8c04000641f800000000befcf9e is
typed as an Influence Node, but nothing else, which seems totally
bizarre to me.  How did that type get derived?  Since Wikipedia has an
infobox with born date and he's in a Born_In_1976 category, I'd
expected him to get typed as a person.

Tom

p.s. The way I stumbled across #2 was kind of cool because it was the
first actual organic instance of MJT in the wild that I've seen.  I
was looking at Mike Love's note about his Genealogy of Influence
project http://mike-love.net/ and noticed that he'd *just* updated
things which I thought was a huge coincidence.  Of course when I
looked a little closer, it was just a little MJT magic querying
Freebase when anyone visited his home page.


More information about the Developers mailing list