[Data-modeling] English Words

Reilly Hayes rfh at metaweb.com
Wed Aug 12 22:45:48 UTC 2009


Loading wordnet has been a topic of internal discussion for a while.   
I think it is important to bring some of the fruits of that internal  
discussion into this conversation.  I've pasted in the document from  
our internal Wiki.  Don't click on the links, they go to our internal  
wiki.


As an aside, we've been told (by a leading NLP researcher who is also  
a fan of Freebase) that the prolog version of the wordnet database is  
the best place to start.  http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz

-r

Finally Loading Wordnet

 From Metaweb


Wordnet has a bit of a tortured history at Metaweb. One of the  
earliest test cases for graphd, it seems to be one of those obviously  
useful data sets which we have somehow never managed to load. A number  
of early Freebase users (most notably Powerset) have made use of it.  
In the general NLP community Wordnet seems to be heavily used.

Basically, Wordnet is an machine readable description of the English  
language carrying information roughly comparable to a dictionary and a  
thesaurus. For example, Wordnet knows that leukemia is a form of  
cancer which is a type of sickness, and that there are several words  
or phrases for specific types of leukemia (hyponyms) for example,  
"myeloid leukemia."

Contents

[hide]
1 Summary
1.1 Reconciling Wordnet
1.1.1 Cotyping Topics and Wordnet Words
1.1.1.1 Usage
1.1.2 Separate Identities for Wordnet Words
1.1.2.1 Usage
1.2 String Values Revisited
Summary

I'm suggesting that we load Wordnet in a novel way: Each Wordnet word  
would become an instance new top-level type, (/common/symbol?)  
analogous to /common/topic. Wordnet words would never be typed as / 
common/topic but would be related to topics the most common (only)  
relationship meaning roughly "this symbol (word) can stand for this  
topic"

  Reconciling Wordnet

The Wordnet schema is mature and amenable to relatively direct  
translation into Freebase schema. The principle problem to be resolved  
by a Wordnet load is how to relate Wordnet words to existing Metaweb  
topics. Basically there are two choices:

cotype existing topics as "Wordnet words" where we can reconcile  
Wordnet to existing topics, create new new identites for words than we  
cannot reconcile
create new identities for all Wordnet words and relate those identites  
to existing topics when they can be reconciled
wordnet "synsets" may be better candidates for reconciliation with  
topics. Nix 01:35, 1 August 2009 (UTC)
wikipedia has several structures that map words to topics, including  
disambiguation pages and a template that refers to other notable uses  
of a word. this should be easy to extract from wex. Nix 01:35, 1  
August 2009 (UTC)
  Cotyping Topics and Wordnet Words

Cotyping entities as Topics and Wordnet Words engenders a number of  
problems:

The spelling of a word (name) is fundamental to identity, not so for a  
topic
For example someone might decide that a Wikipedia article entitled  
"Leukemia" was really about "Myeloid Leukemia" and rename it. From the  
standpoint of concepts, renaming is entirely correct, from the  
standpoint of words, renaming is disastrous.

Creation of Topics which are just English words is going to lead to  
multiple matches in autocomplete and may cause confusion. Usually, a  
word is not what you want to see, you want the concept that the word  
represents.
Translations for Concepts and Words are different. Typically, a  
concept will have one "best" representation in every language. For  
words, this is not true at all. Most words have many possible  
translations depending on the concept they are intended to express.
I think this one is pretty much a dealbreaker. Nix 01:35, 1 August  
2009 (UTC)
Information about a word, for example philology, may not apply to all  
translations. For example the English word "gift" is Germanic, from  
"mitgift", the French equivalent, "cadeau", is romantic.
Because words have many senses, it is difficult to reconcile topics to  
words.
  Usage

Looking for topics which are special cases of a topic looks something  
like this:

{
    ...
    "type" : "/common/wordnet_word",
    "hyponym" : [{
       "type" : "/common/topic",
       "name" : "null"
       "id" : null,
    }],
    ...
}
  Separate Identities for Wordnet Words

An alternative to merging Wordnet Words and Freebase topics is to  
create a new "top-level" type analogous to /common/topic to represent  
words. Where we can reconcile them, instances of this new /common/ 
wordnet_word type are related to topics, typically those whose name  
has the same spelling. The reconciliation problem is the same as in  
the co-typing case (hard) but since reconciliation "failure" doesn't  
result in confusing pairs of words and concepts with similar spellings  
solving it is much less urgent. The Wordnet data is usable as is, and  
any connections between it and the existing world of Freebase topics  
are added bonuses.

At a very basic level, this representation makes a clear distinction  
between a concept and the symbols used to represent that concept in  
various human languages.

Problems with separate Wordnet identities:

A type with implicit language binding such as /common/wordnet_word  
fights with our existing /lang/* localization mechanism. Probably, we  
don't want to allow a German name for an English word as this simple  
pairing is woefully inadequate for the purposes of establishing a  
translation.
i think this is ok. but see comment below on joining through a /type/ 
text. Nix 01:35, 1 August 2009 (UTC)
Usage

Looking for topics which are special cases of a topic looks something  
like this:

{
    ...
    "symbol" : {
       "lang" : "/lang/en",
       "hyponym" : [{
          "topic" : [{
             "name" : null,
             "id" : null,
          }]
       }],
    }
    ...
}
where "symbol" is a new property of /common/topic which associates a  
topic with a language symbol. associates a topic with with a language  
symbol (wordnet word being the first example of such).

String Values Revisited

All of this (and some past discussions with Warren) prompts me to  
revisit the MQL value type /type/text. A MQL string value (instance  
of /type/text) is represented as a single link whose left is the  
subject, right is the language identity (for example /lang/en), and  
value is the string itself. This representation was chosen to save  
primitives. MQL, at some transformational expense, makes this single  
primitive look like an object with properties like "value" and "lang".

However, given the presence of an English Wordnet and the possibility  
of other Wordnets, it would be quite natural to create actual  
identities to represent words and phrases. As a the sole  
representation for text, this is extremely expensive. For example,  
naming something "leukemia" would cost 5 primitives:

the link from subject to "leukemia"
the identity for "leukemia"
the permission for the identity "leukemia"
the value link carrying the rawstring "leukemia"
a link from the identity "leukemia" to the language, /lang/en
To a very slight degree, the primitive burn is offset by the fact that  
naming any subsequent thing "leukemia" will cost one primitive, a link  
to the identity.

As our sole representation for strings, this excessively expensive.  
However, if we're going to be loading Wordnet anyway, it becomes  
tempting to allow this "expanded" form of /type/text in addition to  
the current compact form because it resolves the tension between  
localized strings and language symbols (words).

Unfortunately, implementation of a hybrid scheme is going to be  
relatively expensive: When asking for an object's name, we need to use  
a graph or to ask for both types of name. Moreover, when writing  
English strings, we would need to check for existing words with the  
same spelling and refer to those instead of creating a literal string  
value. Lastly, when creating a new Wordnet word, we would need to  
check for existing English strings with the same spelling and replace  
them with links to the new word. Pretty daunting.

It would be interesting to know how many English strings in the  
current OTG would be candidates for replacement wit h a link to a  
Wordnet word.

I don't propose that we actually do this, but it might be worth  
thinking about. Certainly, if you're a fan of "reified strings" this  
is the sort of thing that they're supposed to be good for.

Much of the need for this would go away if you could search "through"  
a /type/text in mql using a reverse property. then the /type/text  
would lead you straight to the "word" object. Nix 01:35, 1 August 2009  
(UTC)
On Aug 12, 2009, at 3:20 PM, Jeff Prucher wrote:

>
>
>> -----Original Message-----
>> From: data-modeling-bounces at freebase.com
>> [mailto:data-modeling-bounces at freebase.com] On Behalf Of Iain Sproat
>> Sent: Wednesday, August 12, 2009 5:34 AM
>> To: Freebase data modeling mailing list
>> Subject: Re: [Data-modeling] English Words
>>
>> Arthur,
>>
>> Thanks for taking a look at it - I've since tweaked the
>> schema (once again!).  Your work made me realise that I was
>> trying to be a bit too clever with a separate synonym
>> property, and that synonyms are already taken care of by the
>> omnipresent "also known as" /common/topic/alias
>> property.   I've changed the symset properties so that word  
>> morphology
>> to have its own property/CVT, and am using the
>> /common/topic/alias for all synonyms.  The data you've added
>> should now display correctly in http://dictionary.freebaseapps.com
>>
>> I agree that each semantic meaning should be a completely
>> separate topic from any other semantic meaning.  e.g. a rat
>> (animal) should be a separate topic from rat (informer).  If
>> different meanings have been merged together in the same
>> topic, then please flag the topic for split.
>
> I disagree that word instances should be merged with the topic for  
> the thing
> they represent. They are really not the same thing at all. The  
> abstract
> notion of the genus Rattus (which is what /en/rat represents) is not  
> the
> same thing as the English noun "rat", which is also definitely not  
> the same
> thing as the Spanish word "rata" or the German word "Ratten", which  
> is what
> this approach seems to imply.
>
> Also, by relying on aliases for synonymy, we lose the ability to do
> WordNet-y things like tell which sense of the word "rat" the topic for
> "rattus" (or "informer") is synonymous with.
>
>> We're lacking Dictionary data at the moment, so the most
>> useful way to contribute would be to import dictionary
>> definitions to Freebase (Wiktionary and WordNet would be good
>> starting points).  Also, working with Shawn's Alias app to
>> improve topic aliases would definitely help.
>
> One idea about WordNet that's been suggested is that a "word" type in
> Freebase could be created that didn't include /common/topic. This  
> would
> prevent the client from becoming cluttered up with topics for words  
> (which
> users would obviously try to use in place of the topics for the  
> thing the
> words represent), but would be no less easy to use through the API.
>
> Jeff
>
>> Iain
>>
>> On Wed, Aug 12, 2009 at 1:29 PM, Arthur van
>> Hoff<arthur.van.hoff at gmail.com> wrote:
>>> Hi Ian,
>>>
>>> Thanks for doing this. It looks very promising. I tried manually
>>> adding two synonyms for "rat" (verbs) from wordnet, I'm not
>> sure I did
>>> it right. Can you check?
>>>
>>> Have you considered how other languages feature in this schema? It
>>> would be great if it were possible to find synonyms for
>> words in other
>>> human languages. We could scrape a lot of translations from
>> Wikipedia
>>> if that is useful.
>>>
>>> I noticed that for the noun "Rat" you have merged the
>> concept of the
>>> Animal with the Noun. I'm not sure that this is the right
>> approach. In
>>> my view the noun "Rat" is not the same as the animal "Rat". This
>>> approach might get confusing once there are nouns in other
>> languages for the word "Rat".
>>>
>>> Alternatively, you could model the noun Rat as a seperate
>> topic with a
>>> property which refers to the defining topic (the animal).
>> That way the
>>> animal topic would have reverse properties for all nouns in all
>>> languages (eventually). Perhaps that will work?
>>>
>>> I'd like to contribute some, let me know if there is
>> anything I can do.
>>>
>>> Thanks.
>>>
>>>
>>> On Tue, Aug 11, 2009 at 9:30 PM, Iain Sproat
>> <iainsproat at gmail.com> wrote:
>>>>
>>>> I made a few tweaks to the schema at http://writing.freebase.com,
>>>> which meant that the WordNet stuff didn't work so well on
>>>> freebase.com (you can't easily see all the symsets of a word).  To
>>>> compensate, I've created a freebase dictionary app at
>>>> http://dictionary.freebaseapps.com which emulates the
>> WordNet web interface.
>>>> There's only a couple of dictionary examples in freebase (waiting
>>>> until the schema is stable before importing WordNet) - and
>> these can
>>>> be seen at http://dictionary.freebaseapps.com/?word=rat &
>>>> http://dictionary.freebaseapps.com/?word=red
>>>> There's also a bleeding edge view (showing hypernyms and
>> hyponyms) at
>>>>
>> http://2.dictionary.sprocketonline.user.dev.freebaseapps.com/?word=ra
>>>> t Finally, I've added a pronounciation type to the base
>> but haven't
>>>> filled in any data for that yet.
>>>>
>> http://www.freebase.com/type/schema/base/writing/pronounciation?domai
>>>> n=/base/writing
>>>> Iain
>>>> On Tue, Aug 11, 2009 at 1:39 AM, Iain Sproat
>> <iainsproat at gmail.com> wrote:
>>>>>
>>>>> I've had a go at modelling this.  My effort is primarily
>> a synonym
>>>>> set type and a word CVT (linked to the synonym property
>> of symset).
>>>>> see also
>>>>> http://www.freebase.com/view/guid/9202a8c04000641f8000000007cf5081
>>>>> I went a bit overboard and also modelled glyphs, graphemes,
>>>>> diacritic, lexical categories, morphemes, Phonemes etc. -
>> all in the
>>>>> (poorly
>>>>> named) writing base.  There's a few things missing, particularly
>>>>> lemmas and word roots which would be useful if anyone is planning
>>>>> using freebase data with NLP.
>>>>> One of the things I noticed was that freebase only really
>> has nouns.
>>>>> I assume that verbs, adjectives etc. are also suitable
>> for freebase,
>>>>> but nobody's yet loaded them?
>>>>> Iain
>>>>>
>>>>> On Fri, May 8, 2009 at 1:10 AM, spencer kelly
>>>>> <spencerkelly86 at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> agree, i think the value of linguistic data is >= the
>> value of any
>>>>>> other data we have in freebase -- only more awkward to enter.
>>>>>> with faith in the modelling power of the graph, i assume someone
>>>>>> will figure out a good way to do it eventually.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Data-modeling mailing list
>>>>>> Data-modeling at freebase.com
>>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling
>>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Data-modeling mailing list
>>>> Data-modeling at freebase.com
>>>> http://lists.freebase.com/mailman/listinfo/data-modeling
>>>>
>>>
>>>
>>>
>>> --
>>> Arthur van Hoff
>>> arthur.van.hoff at gmail.com
>>> 650-283-0842
>>>
>>> _______________________________________________
>>> Data-modeling mailing list
>>> Data-modeling at freebase.com
>>> http://lists.freebase.com/mailman/listinfo/data-modeling
>>>
>>>
>> _______________________________________________
>> Data-modeling mailing list
>> Data-modeling at freebase.com
>> http://lists.freebase.com/mailman/listinfo/data-modeling
>>
>
> _______________________________________________
> Data-modeling mailing list
> Data-modeling at freebase.com
> http://lists.freebase.com/mailman/listinfo/data-modeling

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090812/2a2d2b23/attachment-0001.htm 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2434 bytes
Desc: not available
Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090812/2a2d2b23/attachment-0001.bin 


More information about the Data-modeling mailing list