[Freebase-discuss] 99% accuracy?

Shawn Simister simister at google.com
Fri Mar 30 20:49:56 UTC 2012


On Thu, Mar 29, 2012 at 10:49 PM, Fred Katz <fmkatz at gmail.com> wrote:

> This target is mentioned on the list from time to time.
> Is there anything written about:
> - just what it means: 99% true, or consistent, or complete? Or, all of
> the above?
>

We are looking at minimizing two types of errors in
Freebase. Identity errors (duplicate and conflated topics) and factual
errors. Folks who have used Refinery have seen how we measure identity
errors using Matchmaker<http://www.freebase.com/apps/user/stefanomazzocchi/matchmaker>.
We're also developing tools like Fact
Checker<http://www.freebase.com/apps/user/sukritiramesh/factchecker>to
find and measure factual errors. We're currently much better at
resolving identity errors than factual errors but we have people working on
both aspects of the problem here at Google. Experience has also shown us
that identity errors are much more damaging than factual errors.

We don't currently have any targets for completeness although we certainly
want to have high coverage of the most popular entities so that people can
build useful applications on top of Freebase.


> - how accuracy is measured.  Are Freebase assertions compared with
> some other source?  Are Integrity constraints used? Do you test
> samples?
>

Because of the size of Freebase, we do all of our quality measurements as
random samples which are verified by multiple human judges. We have some
integrity constraints like incompatible
types<http://www.freebase.com/view/dataworld/incompatible_types> but
they are only used to find bad data and not to measure the overall quality
of Freebase.


> - where someone could see current measures of data quality for
> different domains.
>

I'd really like to see this sort of information available at the domain
level. I think one approach that might work would be to have a script that
calculates the data quality stats from the dumps and writes them to the
wiki. With the Freebase data dumps in
BigQuery<https://developers.google.com/bigquery/>its possible to run a
lot of calculations that would normally timeout in
MQL.


> If you're targeting 99%, you must be keeping track of how close you are.
> Could that information be shared?
>

It's still a work in progress. I know that an internal survey of topics
from across Freebase showed low, single-digit identity errors but I'd like
to see a more rigorous study before we release any official numbers.

-- 
Shawn Simister

Developer Programs Engineer
Google, San Francisco
http://freebase.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freebase.com/pipermail/freebase-discuss/attachments/20120330/0a79c82c/attachment.htm>


More information about the Freebase-discuss mailing list