[Freebase-discuss] the theory of de-duplication / record linking /reconciliation
gmcdonald at itasoftware.com
Thu May 6 19:52:53 UTC 2010
This is a major focus of our efforts in Needle (www.needlebase.com), too. Needle is in a way a parallel-universe version of Freebase in which all the goals are the same, but half the design decisions are inverted. There's a little diatribe on the subject of data-cleanup at http://www.needlebase.com/pj2009/68-pj2009, and you can see the resulting dataset in question at https://pub.needlebase.com/actions/visualizer/V2Visualizer.do?domain=Pazz-Jop-2009 .
Although in some abstract business sense Needle and Freebase could be considered competitive, from a design perspective I would love to see us make them work together. In the case of deduplication, the differences in approach between GridWorks and Needle are great: problems that are hard in one might be easy in the other.
For example, where GridWorks is focused on tables, Needle is built around graphs. So in the case I discussed in the linked blog post, the incoming CSV file had 40 columns that represented only 4 actual kinds of values: artist1, album1, label1, points1, artist2, album2, label2, points2, etc. You couldn't effectively clean this up in GridWorks, currently, because there's not yet any way to reconcile multiple columns together. Needle combines these, and lets you rotate the data to whatever perspective(s) you need for cleanup: by artist, by album, etc.
David and I compare notes periodically, but it would be interesting for both of us, I think (and you!), to have you (or anybody who's tried cleaning up anything in GridWorks) try the same task in Needle and see how it goes. Needle is a hosted system, not a client-side tool, but we have access control for privacy, and CSV and JSON exports to get your data back out again, so a lot of cleanup tasks should be suitable for both systems.
Let me know if you're interested, or just hit the big Signup button on www.needlebase.com.
On 6 May 10, at 2:45pm, Raymond Yee wrote:
> A generic task that I would guess many of face in the Freebase community
> is the issue of removing duplicates in a set of records. This task is
> trivial if records are considered duplicates if and only if they were
> exactly identical. We obviously have this problem generically in
> Freebase and that we as a community don't have any perfect, generic,
> automatic solutions -- that's why we have the workflow around flagging
> topics for merging and the need for people to vote on such decisions.
> What I'm looking for are techniques that would help one boil away as
> many duplicates before trying to load them into Freebase. I see one
> path forward is continued improvement with Gridworks, which has
> functionality for clustering facets and easy merging of facets. Maybe
> those clustering algorithms will be extened to entire records.
> I've been tackling problems of matching records in an ad hoc fashion but
> think it might be time to step back to look at the problem more
> generically. So I turn to the Wikipedia to look at
> http://en.wikipedia.org/wiki/Record_linkage to look for some basic
> pointers. I'd welcome any thoughts in terms of software, tools, papers,
> algorithms, ways of thinking to consider.
> (BTW, the specific problem I have in mind is merging records from
> Recovery.gov and USAspending.gov to come up with a more comprehensive
> picture of Recovery Act funds. It turns out that things that should be
> keys in both systems are not as key-like as you would like. I might end
> up downloading the datasets into CSV, importing them into Gridworks and
> look for patterns to cluster on.....but are there more effective generic
> approaches I should look at?)
> You are receiving this message because you are subscribed to the Freebase-discuss mailing list.
> To post a message to the list: Freebase-discuss at freebase.com
> To unsubscribe, view archives, etc: http://lists.freebase.com/mailman/listinfo/freebase-discuss
More information about the Freebase-discuss