[Data-modeling] "Public person" and privacy

Michael Scott michael_scott at mac.com
Mon Jun 16 23:26:17 UTC 2008


Robert

I think really what I'm trying to get to the bottom of is quite what  
community Freebase aims to serve.

Wikipedia, for all its current authority, does still reflect it  
hobbyist origins. There is a kind of fault tolerance built into the  
readership that doesn't really mind when flaws appear. They might even  
be amusing.

But the readership of Freebase is different. Surely no one is going to  
consult the web interface to Freebase in preference to reading  
Wikipedia. If all goes well then the readers of Freebase will  
predominantly be applications. Initially, small scale applications  
created by personal investment. Freebase also has hobbyist roots. But  
inevitably if it's to really grow then corporate investment must find  
ways to leverage the data, and this is where I still hit a wall.

If a large respectable organisation starts using Freebase as a data  
source in some way, what will the response be when it encounters  
something like this:

http://www.freebase.com/view/guid/9202a8c04000641f80000000060b1403

With my hobbyist hat on, Johnny's jokes are just amusing. But with my  
work hat on it's a different story.


On 5 Jun 2008, at 22:43, Robert Cook wrote:

> Michael --
>
> I agree with your critiques of Freebase, but I don't believe any of
> them are existential.  We have chosen to default to an open model
> because it is both easiest to implement and has the greatest
> flexibility.  With our limited resources (and they are limited, given
> the scope), we have to prioritize features carefully.
>
> Fortunately with the structured nature of Freebase and the fine-
> grained attribution, there are many features that could be developed
> to reduce the human labor for vandalism and error correction.  We've
> taken a "wait and see" approach to see where problems arise and we'll
> allocate effort accordingly (and quickly if necessary).  In some
> cases, application developers in the community have already begun some
> of this work.
>
> Here are some of the mechanisms we foresee helping with this problem.
> We welcome other suggestions:
>
> - Live monitoring of edit events that have a high chance of being
> vandalized.  Examples include changes to display names of topics that
> are highly connected or people topics that are 'claimed' by freebase
> users.  There are some nascent tools for handling this already, but
> they will need improvement as vandalism becomes more of a problem.
>
> - Run daily consistency checks.  There are some errors that can be
> automatically detected simply because they defy common sense.   Rules,
> inferencing and statistical analysis can flag inconsistencies for
> human attention -- e.g. A person's birth date can't come after his
> date of death; a tropical cyclone can't have windspeeds > 400km/hr;
> porn stars are not normally politicians, etc.  These consistency
> checks are one of the main reasons that Freebase should include
> inferencing logic from sources like OpenCyc.  Internally, we are
> already using statistical methods to flag topics that have
> incompatible types (like a person cannot also be a location, for
> instance.)
>
> - Provide mechanisms for 'pre-hoc' content moderation.  With the
> exception of schemas, Freebase, like Wikipedia, is entirely 'post-hoc'
> moderated, where changes are visible in real time to all and can be
> reverted by any interested party.   Instead, 'pre-hoc' moderation
> allows users to contribute changes, but those changes will not be
> visible until approved.  This is done through a time stamp that is
> attached to a topic that indicates when changes to that topic were
> last approved.  When the data is rendered in an application (perhaps
> on the Freebase site itself), changes after that time stamp are not
> shown.
>
> - Property instance write protection.  There are "brand name" data
> sets that when loaded into Freebase shouldn't be modified.  Examples
> include the CIA World Factbook and the Center For Responsive Politics
> campaign contribution data set for US politicians.  The current write-
> protection mechanisms aren't quite sufficient to support these cases,
> but there is planned support for write protection, where only members
> of a particular user group can add, change or delete property values
> to types within a particular domain.
>
> Robert
>
> On Jun 5, 2008, at 2:55 AM, Michael Scott wrote:
>
>> there was a deathly silence on this - i wonder why
>>
>> privacy cuts right to the heart of what Freebase exposes itself to -
>> the unlocked front door
>>
>> i've lurked on these lists for some months now in an attempt to get a
>> measure of Freebase's prospects - how big its community is - to what
>> extent it will succeed in what it aims to be
>>
>> from a corporate perspective i can see how the underlying product
>> could be a very attractive proposition - a malleable database that
>> would facilitate the unification of information across an  
>> enterprise -
>> no need to lock the front door if there is always someone at home -
>> there are natural constraints of good behaviour and supervision
>> associated with wanting to keep one's job - it's fairly obvious what
>> the policy should be and there's usually sufficient resources to
>> guarantee that it's more than just words
>>
>> but as a public database the question of what Freebase contains
>> depends precisely on what resources are available to back up any
>> constraints - the "clear community standards" - because if these are
>> more or less just words then what does Freebase contain
>>
>> from this perspective we could say that "public" is just another way
>> of saying "more open to abuse"
>>
>> so let's say that Freebase becomes the world's number one public
>> database - terabytes of potentially dirty data - how does the world
>> handle that
>>
>> you can see how Wikipedia devolves responsibility to eyeballs -
>> someone sees something is wrong and flags it or fixes it
>>
>> but with Freebase the eyeballs are more than likely going to be
>> downstream somewhere the other end of an application that is
>> extracting data from Freebase - at that point it is already too  
>> late -
>> the "it's a wiki fix it" mantra doesn't apply - the information has
>> been served up in a different context - and most likely in a context
>> that would prefer not to serve dirty data - like water from a tap  
>> in a
>> restaurant
>>
>> so the question is how from a software perspective does Freebase
>> address the potential dirtiness of its data
>>
>> Polya has a nice little formula for calculating the probability of
>> there still being mistakes in a document after more than one person
>> has proofread it - perhaps something similar could be used to get
>> users to estimate the dirtiness of the data in Freebase
>>
>> 	http://mathworld.wolfram.com/ProofreadingMistakes.html
>>
>> also - this lists all the Wikipedia edits by Metaweb
>>
>> 	http://wikiscanner.virgil.gr/f.php?ip1=64.81.62.32-63
>>
>> you can see where Robert Cook fixes some abuse about evil and killing
>> kittens - is there already a way to do this on Freebase - is there a
>> way to measure the degree to which any piece of data has been  
>> abused -
>> are there plans for it
>>
>>
>> _______________________________________________
>> Data-modeling mailing list
>> Data-modeling at freebase.com
>> http://lists.freebase.com/mailman/listinfo/data-modeling
>
> _______________________________________________
> Data-modeling mailing list
> Data-modeling at freebase.com
> http://lists.freebase.com/mailman/listinfo/data-modeling



More information about the Data-modeling mailing list