[Developers] how to make best use of Freebase suggest for reconciling collections of names
Shawn Simister
narphorium at gmail.com
Wed Jun 24 19:47:12 UTC 2009
I too have done some work in this area, although I haven't worked on it
much lately. I think that anyone with a programming background who gets
deep enough into Freebase eventually gravitates toward this sort of
tool-set. In fact, last year the Metaweb folks drew up some pretty
ambitious plans <https://bugs.freebase.com/browse/CLI-3291> for
something very similar to what you're describing. Over time, those plans
got scaled back to produce the reconciliation tool that you linked to.
However it looks <https://bugs.freebase.com/browse/CLI-3718> like the
full-fledged spreadsheet loader is still in the works so I'm still hopeful.
As Tom said, you can get a lot of mileage from white-list/black-lists of
types to reconcile against. You can also use things like Wikipedia,
IMDB, NNDB keys as a proxy for some sort of "notability score" since
each of those sites have their own notability requirements. Lastly,
regular expressions can be used to filter out specific naming
structures. For example, people's names don't often end with
organization suffixes like Inc., Corp., Association, etc.
I'd love to discuss this in more detail and since you, me and Tom are
all going to be at Hack Day, I'd like to propose that we do a very
informal session on entity reconciliation. Maybe we could get Colin and
Reilly and any other interested parties to join in and share their
experiences building these sort of reconciliation services.
Shawn
Raymond Yee wrote:
> [My apologies if you get duplicates of this email -- I sent this out
> already under another email address but didn't see it come through....]
>
> Hi everyone,
>
> I'm finding it a challenge to efficiently match to Freebase items
> entities that are identified by no more than a single string (such as
> the names of US government agencies -- e.g., "Department of State").
> I'd like to describe an approach I'm taking and get your feedback on
> how to make it better.
>
> I'd like to write a Freebase Acre app that will take as input a list
> of strings and return the list of strings with Freebase ids for
> matches -- after an interactive process involving the user. (This app
> is modeled roughly on http://mqlx.com/reconciliation/recon.html) The
> steps involved will be:
>
> 1) feed each of the strings to the freebase search api
> (http://www.freebase.com/api/service/search?help) to come up with the
> "best match" (Naively, I'd just use the best match with the highest
> relevance:score -- but I'd like to figure out approaches for
> distinguishing between matches that are head and shoulders beyond the
> other matches vs ones that are just a bit better than the rest....)
>
> 2) present the best matches in an input box tied to the Freebase
> suggest jQuery plugin (http://suggest.freebaseapps.com/) so that the
> user can hopefully quickly inspect what Freebase is suggesting. If
> the user is unhappy with the choice, the user can look through other
> suggestions, create a new Freebase item, or flag the item as having no
> match.
>
> 3) return the complete list of matches.
>
> I'm curious to know whether this approach is basically sound. If so,
> I plan to implement it and look for ways to make the process
> efficient. For example:
>
> a) I'd like to find ways to make it easy for the user to know which
> matches have been matched with "high confidence" by Freebase so that
> she can scan through the list quickly....Do people have suggestions
> about how to measure "high confidence" beyond a high relevance score?
>
> b) have input and output mechanisms that tie into, say, Google
> spreadsheet. I often find it convenient to work with spreadsheets
> with columns of attributes -- one of which I'd like to have is the
> Freebase ID. I'd like to point this app to a Google spreadsheet (or
> upload an Excel or OpenOffice.org calc file or CSV or TSV file) and
> then have as an output the data with the Freebase ID filled
> out....(much like how the reconciliation interface works)
>
> BTW, does the output of the reconciliation service reduce down to the
> same as the search api if I were to feed the reconciliation api the
> right parameters?
>
> Thanks!
> -Raymond
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/developers/attachments/20090624/ecbb74f5/attachment.htm
More information about the Developers
mailing list