[Data-modeling] English Words
Reilly Hayes
rfh at metaweb.com
Wed Aug 12 22:45:48 UTC 2009
Loading wordnet has been a topic of internal discussion for a while.
I think it is important to bring some of the fruits of that internal
discussion into this conversation. I've pasted in the document from
our internal Wiki. Don't click on the links, they go to our internal
wiki.
As an aside, we've been told (by a leading NLP researcher who is also
a fan of Freebase) that the prolog version of the wordnet database is
the best place to start. http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz
-r
Finally Loading Wordnet
From Metaweb
Wordnet has a bit of a tortured history at Metaweb. One of the
earliest test cases for graphd, it seems to be one of those obviously
useful data sets which we have somehow never managed to load. A number
of early Freebase users (most notably Powerset) have made use of it.
In the general NLP community Wordnet seems to be heavily used.
Basically, Wordnet is an machine readable description of the English
language carrying information roughly comparable to a dictionary and a
thesaurus. For example, Wordnet knows that leukemia is a form of
cancer which is a type of sickness, and that there are several words
or phrases for specific types of leukemia (hyponyms) for example,
"myeloid leukemia."
Contents
[hide]
1 Summary
1.1 Reconciling Wordnet
1.1.1 Cotyping Topics and Wordnet Words
1.1.1.1 Usage
1.1.2 Separate Identities for Wordnet Words
1.1.2.1 Usage
1.2 String Values Revisited
Summary
I'm suggesting that we load Wordnet in a novel way: Each Wordnet word
would become an instance new top-level type, (/common/symbol?)
analogous to /common/topic. Wordnet words would never be typed as /
common/topic but would be related to topics the most common (only)
relationship meaning roughly "this symbol (word) can stand for this
topic"
Reconciling Wordnet
The Wordnet schema is mature and amenable to relatively direct
translation into Freebase schema. The principle problem to be resolved
by a Wordnet load is how to relate Wordnet words to existing Metaweb
topics. Basically there are two choices:
cotype existing topics as "Wordnet words" where we can reconcile
Wordnet to existing topics, create new new identites for words than we
cannot reconcile
create new identities for all Wordnet words and relate those identites
to existing topics when they can be reconciled
wordnet "synsets" may be better candidates for reconciliation with
topics. Nix 01:35, 1 August 2009 (UTC)
wikipedia has several structures that map words to topics, including
disambiguation pages and a template that refers to other notable uses
of a word. this should be easy to extract from wex. Nix 01:35, 1
August 2009 (UTC)
Cotyping Topics and Wordnet Words
Cotyping entities as Topics and Wordnet Words engenders a number of
problems:
The spelling of a word (name) is fundamental to identity, not so for a
topic
For example someone might decide that a Wikipedia article entitled
"Leukemia" was really about "Myeloid Leukemia" and rename it. From the
standpoint of concepts, renaming is entirely correct, from the
standpoint of words, renaming is disastrous.
Creation of Topics which are just English words is going to lead to
multiple matches in autocomplete and may cause confusion. Usually, a
word is not what you want to see, you want the concept that the word
represents.
Translations for Concepts and Words are different. Typically, a
concept will have one "best" representation in every language. For
words, this is not true at all. Most words have many possible
translations depending on the concept they are intended to express.
I think this one is pretty much a dealbreaker. Nix 01:35, 1 August
2009 (UTC)
Information about a word, for example philology, may not apply to all
translations. For example the English word "gift" is Germanic, from
"mitgift", the French equivalent, "cadeau", is romantic.
Because words have many senses, it is difficult to reconcile topics to
words.
Usage
Looking for topics which are special cases of a topic looks something
like this:
{
...
"type" : "/common/wordnet_word",
"hyponym" : [{
"type" : "/common/topic",
"name" : "null"
"id" : null,
}],
...
}
Separate Identities for Wordnet Words
An alternative to merging Wordnet Words and Freebase topics is to
create a new "top-level" type analogous to /common/topic to represent
words. Where we can reconcile them, instances of this new /common/
wordnet_word type are related to topics, typically those whose name
has the same spelling. The reconciliation problem is the same as in
the co-typing case (hard) but since reconciliation "failure" doesn't
result in confusing pairs of words and concepts with similar spellings
solving it is much less urgent. The Wordnet data is usable as is, and
any connections between it and the existing world of Freebase topics
are added bonuses.
At a very basic level, this representation makes a clear distinction
between a concept and the symbols used to represent that concept in
various human languages.
Problems with separate Wordnet identities:
A type with implicit language binding such as /common/wordnet_word
fights with our existing /lang/* localization mechanism. Probably, we
don't want to allow a German name for an English word as this simple
pairing is woefully inadequate for the purposes of establishing a
translation.
i think this is ok. but see comment below on joining through a /type/
text. Nix 01:35, 1 August 2009 (UTC)
Usage
Looking for topics which are special cases of a topic looks something
like this:
{
...
"symbol" : {
"lang" : "/lang/en",
"hyponym" : [{
"topic" : [{
"name" : null,
"id" : null,
}]
}],
}
...
}
where "symbol" is a new property of /common/topic which associates a
topic with a language symbol. associates a topic with with a language
symbol (wordnet word being the first example of such).
String Values Revisited
All of this (and some past discussions with Warren) prompts me to
revisit the MQL value type /type/text. A MQL string value (instance
of /type/text) is represented as a single link whose left is the
subject, right is the language identity (for example /lang/en), and
value is the string itself. This representation was chosen to save
primitives. MQL, at some transformational expense, makes this single
primitive look like an object with properties like "value" and "lang".
However, given the presence of an English Wordnet and the possibility
of other Wordnets, it would be quite natural to create actual
identities to represent words and phrases. As a the sole
representation for text, this is extremely expensive. For example,
naming something "leukemia" would cost 5 primitives:
the link from subject to "leukemia"
the identity for "leukemia"
the permission for the identity "leukemia"
the value link carrying the rawstring "leukemia"
a link from the identity "leukemia" to the language, /lang/en
To a very slight degree, the primitive burn is offset by the fact that
naming any subsequent thing "leukemia" will cost one primitive, a link
to the identity.
As our sole representation for strings, this excessively expensive.
However, if we're going to be loading Wordnet anyway, it becomes
tempting to allow this "expanded" form of /type/text in addition to
the current compact form because it resolves the tension between
localized strings and language symbols (words).
Unfortunately, implementation of a hybrid scheme is going to be
relatively expensive: When asking for an object's name, we need to use
a graph or to ask for both types of name. Moreover, when writing
English strings, we would need to check for existing words with the
same spelling and refer to those instead of creating a literal string
value. Lastly, when creating a new Wordnet word, we would need to
check for existing English strings with the same spelling and replace
them with links to the new word. Pretty daunting.
It would be interesting to know how many English strings in the
current OTG would be candidates for replacement wit h a link to a
Wordnet word.
I don't propose that we actually do this, but it might be worth
thinking about. Certainly, if you're a fan of "reified strings" this
is the sort of thing that they're supposed to be good for.
Much of the need for this would go away if you could search "through"
a /type/text in mql using a reverse property. then the /type/text
would lead you straight to the "word" object. Nix 01:35, 1 August 2009
(UTC)
On Aug 12, 2009, at 3:20 PM, Jeff Prucher wrote:
>
>
>> -----Original Message-----
>> From: data-modeling-bounces at freebase.com
>> [mailto:data-modeling-bounces at freebase.com] On Behalf Of Iain Sproat
>> Sent: Wednesday, August 12, 2009 5:34 AM
>> To: Freebase data modeling mailing list
>> Subject: Re: [Data-modeling] English Words
>>
>> Arthur,
>>
>> Thanks for taking a look at it - I've since tweaked the
>> schema (once again!). Your work made me realise that I was
>> trying to be a bit too clever with a separate synonym
>> property, and that synonyms are already taken care of by the
>> omnipresent "also known as" /common/topic/alias
>> property. I've changed the symset properties so that word
>> morphology
>> to have its own property/CVT, and am using the
>> /common/topic/alias for all synonyms. The data you've added
>> should now display correctly in http://dictionary.freebaseapps.com
>>
>> I agree that each semantic meaning should be a completely
>> separate topic from any other semantic meaning. e.g. a rat
>> (animal) should be a separate topic from rat (informer). If
>> different meanings have been merged together in the same
>> topic, then please flag the topic for split.
>
> I disagree that word instances should be merged with the topic for
> the thing
> they represent. They are really not the same thing at all. The
> abstract
> notion of the genus Rattus (which is what /en/rat represents) is not
> the
> same thing as the English noun "rat", which is also definitely not
> the same
> thing as the Spanish word "rata" or the German word "Ratten", which
> is what
> this approach seems to imply.
>
> Also, by relying on aliases for synonymy, we lose the ability to do
> WordNet-y things like tell which sense of the word "rat" the topic for
> "rattus" (or "informer") is synonymous with.
>
>> We're lacking Dictionary data at the moment, so the most
>> useful way to contribute would be to import dictionary
>> definitions to Freebase (Wiktionary and WordNet would be good
>> starting points). Also, working with Shawn's Alias app to
>> improve topic aliases would definitely help.
>
> One idea about WordNet that's been suggested is that a "word" type in
> Freebase could be created that didn't include /common/topic. This
> would
> prevent the client from becoming cluttered up with topics for words
> (which
> users would obviously try to use in place of the topics for the
> thing the
> words represent), but would be no less easy to use through the API.
>
> Jeff
>
>> Iain
>>
>> On Wed, Aug 12, 2009 at 1:29 PM, Arthur van
>> Hoff<arthur.van.hoff at gmail.com> wrote:
>>> Hi Ian,
>>>
>>> Thanks for doing this. It looks very promising. I tried manually
>>> adding two synonyms for "rat" (verbs) from wordnet, I'm not
>> sure I did
>>> it right. Can you check?
>>>
>>> Have you considered how other languages feature in this schema? It
>>> would be great if it were possible to find synonyms for
>> words in other
>>> human languages. We could scrape a lot of translations from
>> Wikipedia
>>> if that is useful.
>>>
>>> I noticed that for the noun "Rat" you have merged the
>> concept of the
>>> Animal with the Noun. I'm not sure that this is the right
>> approach. In
>>> my view the noun "Rat" is not the same as the animal "Rat". This
>>> approach might get confusing once there are nouns in other
>> languages for the word "Rat".
>>>
>>> Alternatively, you could model the noun Rat as a seperate
>> topic with a
>>> property which refers to the defining topic (the animal).
>> That way the
>>> animal topic would have reverse properties for all nouns in all
>>> languages (eventually). Perhaps that will work?
>>>
>>> I'd like to contribute some, let me know if there is
>> anything I can do.
>>>
>>> Thanks.
>>>
>>>
>>> On Tue, Aug 11, 2009 at 9:30 PM, Iain Sproat
>> <iainsproat at gmail.com> wrote:
>>>>
>>>> I made a few tweaks to the schema at http://writing.freebase.com,
>>>> which meant that the WordNet stuff didn't work so well on
>>>> freebase.com (you can't easily see all the symsets of a word). To
>>>> compensate, I've created a freebase dictionary app at
>>>> http://dictionary.freebaseapps.com which emulates the
>> WordNet web interface.
>>>> There's only a couple of dictionary examples in freebase (waiting
>>>> until the schema is stable before importing WordNet) - and
>> these can
>>>> be seen at http://dictionary.freebaseapps.com/?word=rat &
>>>> http://dictionary.freebaseapps.com/?word=red
>>>> There's also a bleeding edge view (showing hypernyms and
>> hyponyms) at
>>>>
>> http://2.dictionary.sprocketonline.user.dev.freebaseapps.com/?word=ra
>>>> t Finally, I've added a pronounciation type to the base
>> but haven't
>>>> filled in any data for that yet.
>>>>
>> http://www.freebase.com/type/schema/base/writing/pronounciation?domai
>>>> n=/base/writing
>>>> Iain
>>>> On Tue, Aug 11, 2009 at 1:39 AM, Iain Sproat
>> <iainsproat at gmail.com> wrote:
>>>>>
>>>>> I've had a go at modelling this. My effort is primarily
>> a synonym
>>>>> set type and a word CVT (linked to the synonym property
>> of symset).
>>>>> see also
>>>>> http://www.freebase.com/view/guid/9202a8c04000641f8000000007cf5081
>>>>> I went a bit overboard and also modelled glyphs, graphemes,
>>>>> diacritic, lexical categories, morphemes, Phonemes etc. -
>> all in the
>>>>> (poorly
>>>>> named) writing base. There's a few things missing, particularly
>>>>> lemmas and word roots which would be useful if anyone is planning
>>>>> using freebase data with NLP.
>>>>> One of the things I noticed was that freebase only really
>> has nouns.
>>>>> I assume that verbs, adjectives etc. are also suitable
>> for freebase,
>>>>> but nobody's yet loaded them?
>>>>> Iain
>>>>>
>>>>> On Fri, May 8, 2009 at 1:10 AM, spencer kelly
>>>>> <spencerkelly86 at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> agree, i think the value of linguistic data is >= the
>> value of any
>>>>>> other data we have in freebase -- only more awkward to enter.
>>>>>> with faith in the modelling power of the graph, i assume someone
>>>>>> will figure out a good way to do it eventually.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Data-modeling mailing list
>>>>>> Data-modeling at freebase.com
>>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling
>>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Data-modeling mailing list
>>>> Data-modeling at freebase.com
>>>> http://lists.freebase.com/mailman/listinfo/data-modeling
>>>>
>>>
>>>
>>>
>>> --
>>> Arthur van Hoff
>>> arthur.van.hoff at gmail.com
>>> 650-283-0842
>>>
>>> _______________________________________________
>>> Data-modeling mailing list
>>> Data-modeling at freebase.com
>>> http://lists.freebase.com/mailman/listinfo/data-modeling
>>>
>>>
>> _______________________________________________
>> Data-modeling mailing list
>> Data-modeling at freebase.com
>> http://lists.freebase.com/mailman/listinfo/data-modeling
>>
>
> _______________________________________________
> Data-modeling mailing list
> Data-modeling at freebase.com
> http://lists.freebase.com/mailman/listinfo/data-modeling
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090812/2a2d2b23/attachment-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 2434 bytes
Desc: not available
Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090812/2a2d2b23/attachment-0001.bin
More information about the Data-modeling
mailing list