[Developers] Results from the reconciliation service?
Tom Morris
tfmorris at gmail.com
Thu Feb 19 20:28:52 UTC 2009
Thanks for the tips and the pointers. I'll have a play with the new
reconciliation service and see if it's better. I think the problems
are part algorithmic and part data based. The fact that the database
is still basically in bootstrap mode makes it harder for the
algorithms to work effectively.
One of the things I've noticed is that, at least for names, the
Wikipedia importer has left a lot of data on the table. There's a
fairly common pattern used in the articles where they begin "<name1>
(also known as <name2>, <name3>)" where the aliases haven't been
captured (and name1 should get recorded as an alias too if it's
different from the name used in the title of the article). This is
making the algorithms (and users) work much harder than is necessary.
One the algorithmic side of things, edit distance is all well and
good, but not all strings are created equal, especially for names.
The current algorithms don't seem to do anything special with
honorifics, generational suffixes, etc (e.g. Dr., Jr., III, Gen.,
General, Sgt.), middle names/initials, nicknames, etc. Applications
can obviously pre-process things, but the reconciler would be a lot
more useful if it understand some of the rules and conventions which
apply to personal names.
Another person specific thing is lifespan. On the data side,
Wikipedia uses a pretty common convention of (<birth date> - <death
date>) immediately following the name (and aliases) and this
information isn't getting extracted. On the resolution service side
of things, I'll often only want to consider people who were alive
during a certain time.
Different clients of the resolution service are going to want
different things from it, so some tunable knobs would be useful.
Being able to specify weighting coefficients for scoring would be
helpful. The type blacklists that Shawn mentions would be useful as
well.
How were the requirements for the new resolution service collected?
What community has input to them?
Tom
On Tue, Feb 17, 2009 at 6:55 PM, Colin Evans <colin at metaweb.com> wrote:
>
> On Feb 17, 2009, at 3:26 PM, Shawn Simister wrote:
>
> I've experimented with that service in the past and ,like you, I found it
> lacking in several ways. Luckily, there's a new and much improved
> reconciliation service in early development right now. It lets you add a lot
> more constraints which gives much better results than the old one.
>
> For example, entering the following query into the new service should give
> you what you're looking for:
>
> {
> "/type/object/name":"GM",
> "/type/object/type":"/business/company"
> }
>
> Yes! As Shawn mentioned, we've got a new service in early development now.
> There is a first cut at a javascript UI for reconciling spreadsheets that
> Peter Burns is developing here:
> http://mqlx.com/reconciliation/recon.html
> The ui code is open-sourced here - feel free to post patches and
> improvements:
> http://github.com/metaweb/reconciliation_ui/tree/master
> The service will give back a JSON list of suggested reconciliation results
> in order of likelihood, and if there were enough matches in your query, it
> may decide that the first record is definitely a good match. It uses string
> distance matches, and you can put in the names of adjacent topics as well.
> Here's the query that we often test with that reconciles to the correct
> release of the film "Ocean's Eleven". Notice the flattened CVT property
> pointing to the name of the actor through the performance CVT:
> {"/type/object/name":"Oceans Eleven",
> "/type/object/type":"/film/film","/film/film/starring/actor":"George
> Clooney"}
> The service is still under development, but please try it out and feel free
> to ask questions.
> Thanks!
> Colin
>
>
>
>
>
>
>
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>
>
More information about the Developers
mailing list