[Developers] Importing data from World of Spectrum
Tim Kientzle
tim at metaweb.com
Mon Sep 10 20:30:18 UTC 2007
Philip Kendall wrote:
>
> 2) If so, the thing which is worrying me most is accidently creating
> a large number duplicate topics for things like software houses,
> where the Freebase entry happens to have a slightly different name
> from the name used by World of Spectrum. Two questions here, really:
We call this the "reconciliation problem." It is probably
our single biggest headache importing new data. (There
are entire academic conferences on this subject; it's not a
simple problem to solve in any generic fashion.)
I suggest you start by doing some experiments against our
sandbox system to see how many things you can match.
Start with manual experiments and searches, then try
automating them. From there, a lot depends on the nature
of the data you have and how well it happens to line up.
What you'll often find is that some percentage of your
data matches up very easily. By using some creativity
("companies with the same first four letters in the same
city/state", perhaps?) you might find a number of other
matches and/or determine that your objects don't already
exist.
There will always be a few errors and there will always
be room to make manual corrections afterwards, so try not
to get too hung up getting everything exactly correct.
It's mostly just a tricky judgment call when your work
so far is "good enough." Finding someone else who can
collaborate would certainly help. It might also be useful
to import subsets of the data at a time (first, the
data that matches exactly, then additional subsets as
you try different matching strategies). That would also
make it easier for others to look at the result and give
you feedback.
Good luck, and let us know how it goes.
Tim Kientzle
Metaweb Technologies
More information about the Developers
mailing list