[Freebase-discuss] 99% accuracy?
Tom Morris
tfmorris at gmail.com
Fri Mar 30 15:29:37 UTC 2012
On Fri, Mar 30, 2012 at 1:49 AM, Fred Katz <fmkatz at gmail.com> wrote:
> This target is mentioned on the list from time to time.
> Is there anything written about:
> - just what it means: 99% true, or consistent, or complete? Or, all of
> the above?
> - how accuracy is measured. Are Freebase assertions compared with
> some other source? Are Integrity constraints used? Do you test
> samples?
> - where someone could see current measures of data quality for
> different domains.
>
> If you're targeting 99%, you must be keeping track of how close you are.
> Could that information be shared?
The number gets bandied around a lot and confuses information
retrieval pros (precision? recall?) as well mere mortals like me.
Sometimes you'll here it repeated as "99% accurate, 95% of the time"
or some derivative, which makes even less sense.
I'd love to have a staff member explain exactly what quality goals
they have for which parts of Freebase and how they're measure, but
here's my understanding of how things work currently (and how I
explain it when asked):
The measure only applies to new bulk data loads. The 95% relates
purely to sample size. The sample size is selected to give a 95%
confidence interval for the results. Each instance from the sample is
reviewed by multiple (N=3?) human reviewers who vote on whether the
proposed change is valid. The reviewers (by majority vote?) must
agree that 99% of the samples are valid or the entire bulk data load
is sent back for rework.
OK, so what doesn't this cover? It doesn't cover historical bulk
loads. Things loaded in the pre-history of Freebase include
everything from crappy source data (Chef Moz) to data that was loaded
without any attempt at reconciliation (ie everything is a new topic)
like SF MOMA to a version of the current strategy using sample based
quality evaluation, but with less formal quality checks (often just
asking people to eyeball things in the sandbox) and quality thresholds
set more by "feel."
It doesn't cover recall (ie coverage). There's no statement of "99%
of films ever made" or "99% of films in IMDB" or even "99% of films in
Wikipedia."
It doesn't cover continuous processes such as the Wikipedia importer.
Anything which is newly added to Wikipedia that already exists in
Freebase will end up as a duplicate until someone discovers it and
merges it (e.g. all the asteroids Chris has recently merged, all the
National Register of Historic Places locations as Wikipedia catches up
with what we have, etc).
It doesn't cover data entered through the web client. It doesn't
cover data provided by user scripts through the MQL write API.
It doesn't address systematic bias in the evaluation system (e.g.
people confused by the UI). It doesn't address factual accuracy. It's
more about matches vs non-matches (e.g. this is the same John Smith as
in Wikipedia, as opposed to we have John Smith's birth date correct).
Both quality and coverage vary widely by domain, but I've never seen
any public numbers broken down by domain.
Tom
More information about the Freebase-discuss
mailing list