[Data-modeling] "Public person" and privacy
Robert Cook
robert at metaweb.com
Thu Jun 5 21:43:23 UTC 2008
Michael --
I agree with your critiques of Freebase, but I don't believe any of
them are existential. We have chosen to default to an open model
because it is both easiest to implement and has the greatest
flexibility. With our limited resources (and they are limited, given
the scope), we have to prioritize features carefully.
Fortunately with the structured nature of Freebase and the fine-
grained attribution, there are many features that could be developed
to reduce the human labor for vandalism and error correction. We've
taken a "wait and see" approach to see where problems arise and we'll
allocate effort accordingly (and quickly if necessary). In some
cases, application developers in the community have already begun some
of this work.
Here are some of the mechanisms we foresee helping with this problem.
We welcome other suggestions:
- Live monitoring of edit events that have a high chance of being
vandalized. Examples include changes to display names of topics that
are highly connected or people topics that are 'claimed' by freebase
users. There are some nascent tools for handling this already, but
they will need improvement as vandalism becomes more of a problem.
- Run daily consistency checks. There are some errors that can be
automatically detected simply because they defy common sense. Rules,
inferencing and statistical analysis can flag inconsistencies for
human attention -- e.g. A person's birth date can't come after his
date of death; a tropical cyclone can't have windspeeds > 400km/hr;
porn stars are not normally politicians, etc. These consistency
checks are one of the main reasons that Freebase should include
inferencing logic from sources like OpenCyc. Internally, we are
already using statistical methods to flag topics that have
incompatible types (like a person cannot also be a location, for
instance.)
- Provide mechanisms for 'pre-hoc' content moderation. With the
exception of schemas, Freebase, like Wikipedia, is entirely 'post-hoc'
moderated, where changes are visible in real time to all and can be
reverted by any interested party. Instead, 'pre-hoc' moderation
allows users to contribute changes, but those changes will not be
visible until approved. This is done through a time stamp that is
attached to a topic that indicates when changes to that topic were
last approved. When the data is rendered in an application (perhaps
on the Freebase site itself), changes after that time stamp are not
shown.
- Property instance write protection. There are "brand name" data
sets that when loaded into Freebase shouldn't be modified. Examples
include the CIA World Factbook and the Center For Responsive Politics
campaign contribution data set for US politicians. The current write-
protection mechanisms aren't quite sufficient to support these cases,
but there is planned support for write protection, where only members
of a particular user group can add, change or delete property values
to types within a particular domain.
Robert
On Jun 5, 2008, at 2:55 AM, Michael Scott wrote:
> there was a deathly silence on this - i wonder why
>
> privacy cuts right to the heart of what Freebase exposes itself to -
> the unlocked front door
>
> i've lurked on these lists for some months now in an attempt to get a
> measure of Freebase's prospects - how big its community is - to what
> extent it will succeed in what it aims to be
>
> from a corporate perspective i can see how the underlying product
> could be a very attractive proposition - a malleable database that
> would facilitate the unification of information across an enterprise -
> no need to lock the front door if there is always someone at home -
> there are natural constraints of good behaviour and supervision
> associated with wanting to keep one's job - it's fairly obvious what
> the policy should be and there's usually sufficient resources to
> guarantee that it's more than just words
>
> but as a public database the question of what Freebase contains
> depends precisely on what resources are available to back up any
> constraints - the "clear community standards" - because if these are
> more or less just words then what does Freebase contain
>
> from this perspective we could say that "public" is just another way
> of saying "more open to abuse"
>
> so let's say that Freebase becomes the world's number one public
> database - terabytes of potentially dirty data - how does the world
> handle that
>
> you can see how Wikipedia devolves responsibility to eyeballs -
> someone sees something is wrong and flags it or fixes it
>
> but with Freebase the eyeballs are more than likely going to be
> downstream somewhere the other end of an application that is
> extracting data from Freebase - at that point it is already too late -
> the "it's a wiki fix it" mantra doesn't apply - the information has
> been served up in a different context - and most likely in a context
> that would prefer not to serve dirty data - like water from a tap in a
> restaurant
>
> so the question is how from a software perspective does Freebase
> address the potential dirtiness of its data
>
> Polya has a nice little formula for calculating the probability of
> there still being mistakes in a document after more than one person
> has proofread it - perhaps something similar could be used to get
> users to estimate the dirtiness of the data in Freebase
>
> http://mathworld.wolfram.com/ProofreadingMistakes.html
>
> also - this lists all the Wikipedia edits by Metaweb
>
> http://wikiscanner.virgil.gr/f.php?ip1=64.81.62.32-63
>
> you can see where Robert Cook fixes some abuse about evil and killing
> kittens - is there already a way to do this on Freebase - is there a
> way to measure the degree to which any piece of data has been abused -
> are there plans for it
>
>
> _______________________________________________
> Data-modeling mailing list
> Data-modeling at freebase.com
> http://lists.freebase.com/mailman/listinfo/data-modeling
More information about the Data-modeling
mailing list