From tfmorris at gmail.com Mon Aug 3 17:38:12 2009 From: tfmorris at gmail.com (Tom Morris) Date: Mon, 3 Aug 2009 13:38:12 -0400 Subject: [Data-modeling] Disambiguating places with the same name Message-ID: Currently Freebase has too little information to disambiguate the Town of Ithaca, New York from the City of Ithaca, New York because they both have the same name and same containment hierarchy. Additionally, the postal service treats them as a single place (actually Ithaca has multiple zip codes, but the 6 digit zip code 12550 is shared by both the City of Newburgh and the Town of Newburgh, NY), and people generally just talk about "Ithaca" without distinguishing the city from the town. Resolving geographic place names is a fundamental and important capability, so I think the existing situation needs to be improved upon. At a minimum, we need a way to distinguish the city from the town, but additionally I think a way is needed to refer to them collectively when we don't care which one. There's another little nit in that the administrative containment hierarchy used for census government, etc and the geographic containment aren't compatible (see the map at http://www.freebase.com/view/en/ithaca_united_states), but that's more an issue for the future when people are trying to do roll-ups aggregating data. As an initial proposal I'd suggest either restoring the Wikipedia naming convention of Ithaca (city)/Ithaca(town) or some similar convention and introducing a third location named just "Ithaca" which contains the other two. Anyone got other suggestions for how to make this work better? Tom From jeff at metaweb.com Mon Aug 3 18:04:50 2009 From: jeff at metaweb.com (Jeff Prucher) Date: Mon, 3 Aug 2009 11:04:50 -0700 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: References: Message-ID: For something like the two Ithacas, we could rename them to their more official names, "City of Ithaca" and "Town of Ithaca". I like this better than the WP solution ("Ithaca (city)"), since it's an actual name of the town, rather than just a disambiguating convention. In general, I think it's good that we don't use "official" names for most locations, but where there's basically no other way to tell them apart, I think it might help. Jeff > -----Original Message----- > From: data-modeling-bounces at freebase.com > [mailto:data-modeling-bounces at freebase.com] On Behalf Of Tom Morris > Sent: Monday, August 03, 2009 10:38 AM > To: Freebase data modeling mailing list > Subject: [Data-modeling] Disambiguating places with the same name > > Currently Freebase has too little information to disambiguate > the Town of Ithaca, New York from the City of Ithaca, New > York because they both have the same name and same > containment hierarchy. Additionally, the postal service > treats them as a single place (actually Ithaca has multiple > zip codes, but the 6 digit zip code 12550 is shared by both > the City of Newburgh and the Town of Newburgh, NY), and > people generally just talk about "Ithaca" without > distinguishing the city from the town. > > Resolving geographic place names is a fundamental and > important capability, so I think the existing situation needs > to be improved upon. At a minimum, we need a way to > distinguish the city from the town, but additionally I think > a way is needed to refer to them collectively when we don't > care which one. > > There's another little nit in that the administrative > containment hierarchy used for census government, etc and the > geographic containment aren't compatible (see the map at > http://www.freebase.com/view/en/ithaca_united_states), but > that's more an issue for the future when people are trying to > do roll-ups aggregating data. > > As an initial proposal I'd suggest either restoring the > Wikipedia naming convention of Ithaca (city)/Ithaca(town) or > some similar convention and introducing a third location > named just "Ithaca" which contains the other two. Anyone got > other suggestions for how to make this work better? > > Tom > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From jackpark at gmail.com Mon Aug 3 18:05:45 2009 From: jackpark at gmail.com (Jack Park) Date: Mon, 3 Aug 2009 11:05:45 -0700 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: References: Message-ID: <5179aafa0908031105ia1424dcibfca5ac48ed8683b@mail.gmail.com> As Wikipedia learned over time, using names for things is a weak way to identify those things. Finding solutions to subject identification, it seems to me as I navigate identity space while building topic maps, is a non-trivial issue. It's an issue that warrants more effort than it might take to solve this one case. I'd like to see more dialogue on subject identification using sets of key/value properties. One reality lurking in that inquiry is the existence of cultural approaches to subject identity that don't easily blend with others. Jack On Mon, Aug 3, 2009 at 10:38 AM, Tom Morris wrote: > Currently Freebase has too little information to disambiguate the Town > of Ithaca, New York from the City of Ithaca, New York because they > both have the same name and same containment hierarchy. ?Additionally, > the postal service treats them as a single place (actually Ithaca has > multiple zip codes, but the 6 digit zip code 12550 is shared by both > the City of Newburgh and the Town of Newburgh, NY), and people > generally just talk about "Ithaca" without distinguishing the city > from the town. > > Resolving geographic place names is a fundamental and important > capability, so I think the existing situation needs to be improved > upon. ?At a minimum, we need a way to distinguish the city from the > town, but additionally I think a way is needed to refer to them > collectively when we don't care which one. > > There's another little nit in that the administrative containment > hierarchy used for census government, etc and the geographic > containment aren't compatible (see the map at > http://www.freebase.com/view/en/ithaca_united_states), but that's more > an issue for the future when people are trying to do roll-ups > aggregating data. > > As an initial proposal I'd suggest either restoring the Wikipedia > naming convention of Ithaca (city)/Ithaca(town) or some similar > convention and introducing a third location named just "Ithaca" which > contains the other two. ?Anyone got other suggestions for how to make > this work better? > > Tom > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From kurt at spaceship.com Mon Aug 3 18:45:33 2009 From: kurt at spaceship.com (Kurt Bollacker) Date: Mon, 3 Aug 2009 11:45:33 -0700 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: <5179aafa0908031105ia1424dcibfca5ac48ed8683b@mail.gmail.com> References: <5179aafa0908031105ia1424dcibfca5ac48ed8683b@mail.gmail.com> Message-ID: <20090803184533.GJ28167@spaceship.com> On Mon, Aug 03, 2009 at 11:05:45AM -0700, Jack Park wrote: > As Wikipedia learned over time, using names for things is a weak way > to identify those things. Finding solutions to subject identification, > it seems to me as I navigate identity space while building topic maps, > is a non-trivial issue. It's an issue that warrants more effort than > it might take to solve this one case. I'd like to see more dialogue on > subject identification using sets of key/value properties. One reality > lurking in that inquiry is the existence of cultural approaches to > subject identity that don't easily blend with others. I once wrote a heuristic that would choose a disambiguating property for topics in the cases where the /type/object/type and /type/object/name properties failed to do so. It worked reasonably well, but it is was too slow to operate in realtime. An example was to distinguish the two "Harrison Ford" actors in Freebase when one had virtually no properties. I think birthdate was the answer in this case. A failing of my approach was that it was ephemeral-- as data changed and was added to Freebase, the disambiguating property would change. Kurt :-) > Jack > > On Mon, Aug 3, 2009 at 10:38 AM, Tom Morris wrote: > > Currently Freebase has too little information to disambiguate the Town > > of Ithaca, New York from the City of Ithaca, New York because they > > both have the same name and same containment hierarchy. ?Additionally, > > the postal service treats them as a single place (actually Ithaca has > > multiple zip codes, but the 6 digit zip code 12550 is shared by both > > the City of Newburgh and the Town of Newburgh, NY), and people > > generally just talk about "Ithaca" without distinguishing the city > > from the town. > > > > Resolving geographic place names is a fundamental and important > > capability, so I think the existing situation needs to be improved > > upon. ?At a minimum, we need a way to distinguish the city from the > > town, but additionally I think a way is needed to refer to them > > collectively when we don't care which one. > > > > There's another little nit in that the administrative containment > > hierarchy used for census government, etc and the geographic > > containment aren't compatible (see the map at > > http://www.freebase.com/view/en/ithaca_united_states), but that's more > > an issue for the future when people are trying to do roll-ups > > aggregating data. > > > > As an initial proposal I'd suggest either restoring the Wikipedia > > naming convention of Ithaca (city)/Ithaca(town) or some similar > > convention and introducing a third location named just "Ithaca" which > > contains the other two. ?Anyone got other suggestions for how to make > > this work better? > > > > Tom > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling From faye at metaweb.com Mon Aug 3 18:49:20 2009 From: faye at metaweb.com (Faye Harris) Date: Mon, 03 Aug 2009 11:49:20 -0700 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: References: Message-ID: <4A773130.9070404@metaweb.com> +1 on Jeff's proposal. This solves the immediate problem in a matter of seconds, so that Ithaca seekers can easily disambiguate the two. Then we can look for, research, discuss, debate, implement, and improve real solutions that will probably take a while. -- Faye Jeff Prucher wrote: > For something like the two Ithacas, we could rename them to their more > official names, "City of Ithaca" and "Town of Ithaca". I like this better > than the WP solution ("Ithaca (city)"), since it's an actual name of the > town, rather than just a disambiguating convention. In general, I think > it's good that we don't use "official" names for most locations, but where > there's basically no other way to tell them apart, I think it might help. > > Jeff > > >> -----Original Message----- >> From: data-modeling-bounces at freebase.com >> [mailto:data-modeling-bounces at freebase.com] On Behalf Of Tom Morris >> Sent: Monday, August 03, 2009 10:38 AM >> To: Freebase data modeling mailing list >> Subject: [Data-modeling] Disambiguating places with the same name >> >> Currently Freebase has too little information to disambiguate >> the Town of Ithaca, New York from the City of Ithaca, New >> York because they both have the same name and same >> containment hierarchy. Additionally, the postal service >> treats them as a single place (actually Ithaca has multiple >> zip codes, but the 6 digit zip code 12550 is shared by both >> the City of Newburgh and the Town of Newburgh, NY), and >> people generally just talk about "Ithaca" without >> distinguishing the city from the town. >> >> Resolving geographic place names is a fundamental and >> important capability, so I think the existing situation needs >> to be improved upon. At a minimum, we need a way to >> distinguish the city from the town, but additionally I think >> a way is needed to refer to them collectively when we don't >> care which one. >> >> There's another little nit in that the administrative >> containment hierarchy used for census government, etc and the >> geographic containment aren't compatible (see the map at >> http://www.freebase.com/view/en/ithaca_united_states), but >> that's more an issue for the future when people are trying to >> do roll-ups aggregating data. >> >> As an initial proposal I'd suggest either restoring the >> Wikipedia naming convention of Ithaca (city)/Ithaca(town) or >> some similar convention and introducing a third location >> named just "Ithaca" which contains the other two. Anyone got >> other suggestions for how to make this work better? >> >> Tom >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling >> >> > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > > From robert at metaweb.com Mon Aug 3 19:08:28 2009 From: robert at metaweb.com (Robert Cook) Date: Mon, 3 Aug 2009 12:08:28 -0700 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: <4A773130.9070404@metaweb.com> References: <4A773130.9070404@metaweb.com> Message-ID: <7CD11508-F29A-490D-B22C-2D4900F418A0@metaweb.com> We should be careful when renaming cities as they are expected to be used in mailing addresses. It may look awkward to always see "City of Ithaca" when most people would simply write it Ithaca. Also, one point of Tom's email was lost, I think. Most people don't care about the difference between the two. It may be better to create a new Ithaca that contains the City and Town. And if so, we could rename them as Jeff suggests. R On Aug 3, 2009, at 11:49 AM, Faye Harris wrote: > +1 on Jeff's proposal. > > This solves the immediate problem in a matter of seconds, so that > Ithaca > seekers can easily disambiguate the two. Then we can look for, > research, > discuss, debate, implement, and improve real solutions that will > probably take a while. > > -- Faye > > > Jeff Prucher wrote: >> For something like the two Ithacas, we could rename them to their >> more >> official names, "City of Ithaca" and "Town of Ithaca". I like this >> better >> than the WP solution ("Ithaca (city)"), since it's an actual name >> of the >> town, rather than just a disambiguating convention. In general, I >> think >> it's good that we don't use "official" names for most locations, >> but where >> there's basically no other way to tell them apart, I think it might >> help. >> >> Jeff >> >> >>> -----Original Message----- >>> From: data-modeling-bounces at freebase.com >>> [mailto:data-modeling-bounces at freebase.com] On Behalf Of Tom Morris >>> Sent: Monday, August 03, 2009 10:38 AM >>> To: Freebase data modeling mailing list >>> Subject: [Data-modeling] Disambiguating places with the same name >>> >>> Currently Freebase has too little information to disambiguate >>> the Town of Ithaca, New York from the City of Ithaca, New >>> York because they both have the same name and same >>> containment hierarchy. Additionally, the postal service >>> treats them as a single place (actually Ithaca has multiple >>> zip codes, but the 6 digit zip code 12550 is shared by both >>> the City of Newburgh and the Town of Newburgh, NY), and >>> people generally just talk about "Ithaca" without >>> distinguishing the city from the town. >>> >>> Resolving geographic place names is a fundamental and >>> important capability, so I think the existing situation needs >>> to be improved upon. At a minimum, we need a way to >>> distinguish the city from the town, but additionally I think >>> a way is needed to refer to them collectively when we don't >>> care which one. >>> >>> There's another little nit in that the administrative >>> containment hierarchy used for census government, etc and the >>> geographic containment aren't compatible (see the map at >>> http://www.freebase.com/view/en/ithaca_united_states), but >>> that's more an issue for the future when people are trying to >>> do roll-ups aggregating data. >>> >>> As an initial proposal I'd suggest either restoring the >>> Wikipedia naming convention of Ithaca (city)/Ithaca(town) or >>> some similar convention and introducing a third location >>> named just "Ithaca" which contains the other two. Anyone got >>> other suggestions for how to make this work better? >>> >>> Tom >>> _______________________________________________ >>> Data-modeling mailing list >>> Data-modeling at freebase.com >>> http://lists.freebase.com/mailman/listinfo/data-modeling >>> >>> >> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling >> >> > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling From paul at ontology2.com Mon Aug 3 19:20:57 2009 From: paul at ontology2.com (Paul Houle) Date: Mon, 03 Aug 2009 15:20:57 -0400 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: References: Message-ID: <4A773899.7010108@ontology2.com> Tom Morris wrote: > Currently Freebase has too little information to disambiguate the Town > of Ithaca, New York from the City of Ithaca, New York because they > both have the same name and same containment hierarchy. Additionally, > the postal service treats them as a single place (actually Ithaca has > multiple zip codes, but the 6 digit zip code 12550 is shared by both > the City of Newburgh and the Town of Newburgh, NY), and people > generally just talk about "Ithaca" without distinguishing the city > from the town. > ZIP codes are bad. They're not about geography, they're about the way the post office delivers mail. There are some ZIP codes that span more than one state! Many US states have a good GIS authority which provides good shapes and other data for the whole state. Barring that, your best source is the TIGER data provided by the US Census. TIGER gets complicated when you go to levels finer than counties, but that reflects the truth on the ground. The legal status of spatial subdivisions varies in different US states. In NY, for instance, it's common to have a "Town of X" and a "City of X"; every point in NY has a "city", "town" or the equivalent thereof assigned to it. Legal documents tend to say "Town Of Caroline", "County Of Tompkins." I pay taxes to the "Town Of Caroline" (near Ithaca) although my mailing address says "Brooktondale, NY" -- Brooktondale has no government, although it does have a volunteer fire department, community center, post office and general store. There are plenty of other strange things about NY: http://en.wikipedia.org/wiki/Administrative_divisions_of_New_York One of them is that the five boroughs of NYC are actually counties. In New Mexico, it's a different story. Most of the land area in Socorro County is not part of a town or city, and the only government involved is the county. From jeff at metaweb.com Mon Aug 3 19:58:37 2009 From: jeff at metaweb.com (Jeff Prucher) Date: Mon, 3 Aug 2009 12:58:37 -0700 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: <7CD11508-F29A-490D-B22C-2D4900F418A0@metaweb.com> References: <4A773130.9070404@metaweb.com> <7CD11508-F29A-490D-B22C-2D4900F418A0@metaweb.com> Message-ID: Good point. The Region type can be used for creating a joint Ithaca topic that contains both. I agree we shouldn't go crazy with renaming cities (or anything else) to an official, rather than colloquial, name, but I also wouldn't get too worried about the postal implications -- our data will be used in a lot of different ways, and sometimes these ways will have competing interests in terms of what kind of display names are more or less acceptable. Addresses in non-English-speaking countries already fail this test, anyway. Jeff > -----Original Message----- > From: data-modeling-bounces at freebase.com > [mailto:data-modeling-bounces at freebase.com] On Behalf Of Robert Cook > Sent: Monday, August 03, 2009 12:08 PM > To: Freebase data modeling mailing list > Subject: Re: [Data-modeling] Disambiguating places with the same name > > We should be careful when renaming cities as they are > expected to be used in mailing addresses. It may look > awkward to always see "City of Ithaca" when most people would > simply write it Ithaca. > > Also, one point of Tom's email was lost, I think. Most > people don't care about the difference between the two. It > may be better to create a new Ithaca that contains the City > and Town. And if so, we could rename them as Jeff suggests. > > R > > On Aug 3, 2009, at 11:49 AM, Faye Harris wrote: > > > +1 on Jeff's proposal. > > > > This solves the immediate problem in a matter of seconds, so that > > Ithaca seekers can easily disambiguate the two. Then we can > look for, > > research, discuss, debate, implement, and improve real > solutions that > > will probably take a while. > > > > -- Faye > > > > > > Jeff Prucher wrote: > >> For something like the two Ithacas, we could rename them to their > >> more official names, "City of Ithaca" and "Town of > Ithaca". I like > >> this better than the WP solution ("Ithaca (city)"), since it's an > >> actual name of the town, rather than just a disambiguating > >> convention. In general, I think it's good that we don't use > >> "official" names for most locations, but where there's > basically no > >> other way to tell them apart, I think it might help. > >> > >> Jeff > >> > >> > >>> -----Original Message----- > >>> From: data-modeling-bounces at freebase.com > >>> [mailto:data-modeling-bounces at freebase.com] On Behalf Of > Tom Morris > >>> Sent: Monday, August 03, 2009 10:38 AM > >>> To: Freebase data modeling mailing list > >>> Subject: [Data-modeling] Disambiguating places with the same name > >>> > >>> Currently Freebase has too little information to disambiguate the > >>> Town of Ithaca, New York from the City of Ithaca, New > York because > >>> they both have the same name and same containment hierarchy. > >>> Additionally, the postal service treats them as a single place > >>> (actually Ithaca has multiple zip codes, but the 6 digit zip code > >>> 12550 is shared by both the City of Newburgh and the Town of > >>> Newburgh, NY), and people generally just talk about > "Ithaca" without > >>> distinguishing the city from the town. > >>> > >>> Resolving geographic place names is a fundamental and important > >>> capability, so I think the existing situation needs to be > improved > >>> upon. At a minimum, we need a way to distinguish the > city from the > >>> town, but additionally I think a way is needed to refer to them > >>> collectively when we don't care which one. > >>> > >>> There's another little nit in that the administrative containment > >>> hierarchy used for census government, etc and the geographic > >>> containment aren't compatible (see the map at > >>> http://www.freebase.com/view/en/ithaca_united_states), but that's > >>> more an issue for the future when people are trying to do > roll-ups > >>> aggregating data. > >>> > >>> As an initial proposal I'd suggest either restoring the Wikipedia > >>> naming convention of Ithaca (city)/Ithaca(town) or some similar > >>> convention and introducing a third location named just "Ithaca" > >>> which contains the other two. Anyone got other > suggestions for how > >>> to make this work better? > >>> > >>> Tom > >>> _______________________________________________ > >>> Data-modeling mailing list > >>> Data-modeling at freebase.com > >>> http://lists.freebase.com/mailman/listinfo/data-modeling > >>> > >>> > >> > >> _______________________________________________ > >> Data-modeling mailing list > >> Data-modeling at freebase.com > >> http://lists.freebase.com/mailman/listinfo/data-modeling > >> > >> > > > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From tfmorris at gmail.com Mon Aug 3 23:25:58 2009 From: tfmorris at gmail.com (Tom Morris) Date: Mon, 3 Aug 2009 19:25:58 -0400 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: References: <4A773130.9070404@metaweb.com> <7CD11508-F29A-490D-B22C-2D4900F418A0@metaweb.com> Message-ID: On Mon, Aug 3, 2009 at 3:58 PM, Jeff Prucher wrote: > Good point. ?The Region type can be used for creating a joint Ithaca topic > that contains both. I agree we shouldn't go crazy with renaming cities (or > anything else) to an official, rather than colloquial, name, but I also > wouldn't get too worried about the postal implications -- our data will be > used in a lot of different ways, and sometimes these ways will have > competing interests in terms of what kind of display names are more or less > acceptable. Addresses in non-English-speaking countries already fail this > test, anyway. Using a Region, plus some name disambiguation sounds like a good approach, at least for the time being. It'll certainly fix the machine reconciliation problem (when used with a few rules), but we'll have to see what happens with user entered data. I suspect that'll be heavily influenced by the order the entries appear in the autocomplete box and their perceived correctness. People tend to pick the first acceptable, rather than the best, choice. I've gone ahead and created the Region http://www.freebase.com/edit/topic/guid/9202a8c04000641f800000000de6abd4 One of the problems currently with the official name vs common name situation is that there's no way to tell which one is which. Wikipedia tends towards the common name in the title and the official name in the lead to the article, but a) this is only a convention and b) that styling is much more easily interpreted by a human than a computer. If I could specifically ask for the official name (or common name), that would be an improvement. Responding to a few things further back up thread: - Names as identifiers - Yes, this is sub-optimal in many cases, but it's how humans do it in an awful lot of cases (names + context actually), so we're kind of stuck living with it and finding ways to map back and forth to their view of the world. My alma mater is partly in the city of Ithaca and partly in the town of Ithaca, but I never had to worry about this distinction until now. It was all just "Ithaca." - TIGER shape files - That doesn't really help with the "Ithaca, NY" -> Freebase topic ID for the general Ithaca, NY. If I had a lat/long that I was trying to resolve to containing its containing named shape, they might be useful. Tom From jackpark at gmail.com Mon Aug 3 23:46:35 2009 From: jackpark at gmail.com (Jack Park) Date: Mon, 3 Aug 2009 16:46:35 -0700 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: References: <4A773130.9070404@metaweb.com> <7CD11508-F29A-490D-B22C-2D4900F418A0@metaweb.com> Message-ID: <5179aafa0908031646r54096f7aud9cd961d6c232f3b@mail.gmail.com> Agree "names for things" is how we do it, emphasis on "names+context". That walks away from names alone for things, and wanders into names + other properties. We're back to names + properties. Can't seem to escape that. But...Perhaps it's all in the wrist,so to speak: just how one stubs in those properties. We already saw that "Ithica the Town" is not really all that acceptable. In F2F conversations, context somehow winds its way into the dialogue unless one is forced to back up and disambiguate. Google never did that. Likely, never will, either. Here in Freebase, we have an opportunity to pop multiple hits and ask for human selection (which, when you think about it, is precisely what Google does already). This chunk of this thread seems to be orbiting using "region" as one of those disambiguating properties. What if the user is not familiar with regions? Still, it's a start. Don't want to sound like a stuck record, but it does seem that Freebase cannot duck subject identity issues; I salute the effort to resolve issues as they arise. Jack On Mon, Aug 3, 2009 at 4:25 PM, Tom Morris wrote: > On Mon, Aug 3, 2009 at 3:58 PM, Jeff Prucher wrote: >> Good point. ?The Region type can be used for creating a joint Ithaca topic >> that contains both. I agree we shouldn't go crazy with renaming cities (or >> anything else) to an official, rather than colloquial, name, but I also >> wouldn't get too worried about the postal implications -- our data will be >> used in a lot of different ways, and sometimes these ways will have >> competing interests in terms of what kind of display names are more or less >> acceptable. Addresses in non-English-speaking countries already fail this >> test, anyway. > > Using a Region, plus some name disambiguation sounds like a good > approach, at least for the time being. ?It'll certainly fix the > machine reconciliation problem (when used with a few rules), but we'll > have to see what happens with user entered data. ?I suspect that'll be > heavily influenced by the order the entries appear in the autocomplete > box and their perceived correctness. ?People tend to pick the first > acceptable, rather than the best, choice. > > I've gone ahead and created the Region > http://www.freebase.com/edit/topic/guid/9202a8c04000641f800000000de6abd4 > > One of the problems currently with the official name vs common name > situation is that there's no way to tell which one is which. > Wikipedia tends towards the common name in the title and the official > name in the lead to the article, but a) this is only a convention and > b) that styling is much more easily interpreted by a human than a > computer. ?If I could specifically ask for the official name (or > common name), that would be an improvement. > > Responding to a few things further back up thread: > > - Names as identifiers - Yes, this is sub-optimal in many cases, but > it's how humans do it in an awful lot of cases (names + context > actually), so we're kind of stuck living with it and finding ways to > map back and forth to their view of the world. ?My alma mater is > partly in the city of Ithaca and partly in the town of Ithaca, but I > never had to worry about this distinction until now. ?It was all just > "Ithaca." > > - TIGER shape files - That doesn't really help with the "Ithaca, NY" > -> Freebase topic ID for the general Ithaca, NY. ?If I had a lat/long > that I was trying to resolve to containing its containing named shape, > they might be useful. > > Tom > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From cycjay at gmail.com Tue Aug 4 04:55:09 2009 From: cycjay at gmail.com (Vijay Raj) Date: Mon, 3 Aug 2009 22:55:09 -0600 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: <5179aafa0908031646r54096f7aud9cd961d6c232f3b@mail.gmail.com> References: <4A773130.9070404@metaweb.com> <7CD11508-F29A-490D-B22C-2D4900F418A0@metaweb.com> <5179aafa0908031646r54096f7aud9cd961d6c232f3b@mail.gmail.com> Message-ID: <65d8f61d0908032155r50c9be43pfc44d85519162a77@mail.gmail.com> Hi All, I joined this list just today, so let me introduce myself. My name is Vijay Raj and I design mobile phone/base station physical layer algorithms for a living. For the last few years, ontology engineering has become a passion of mine. Naturally I have been following the development of Freebase. I am impressed at the strides made by the Metaweb team and the openness of data/architecture. As for my ontology background, I started out working with OpenCyc ontology. I actively participate in another open source group called CycFoundation.org. As part of working with CycFoundation, I wrote an algorithm in LISP/Python to automatically map all OpenCyc terms to Wikipedia article names, for the very first time. This mapping was released in DBpedia 3.0. Since then I have continued to work on Cyc and related issues. This being my very first email thread, I should note that my comments on this email thread are more of thinking aloud, rather than suggestions to seasoned veterans of freebase. When I was doing the Cyc to wikipedia mapping, we had analogous problem. How to distinguish and find mapping for a concept lexically represented as "bank" in OpenCyc? It could mean "bank the financial institution" or "bank of a river". The algorithm I wrote measures the closeness of all concepts represented as bank in OpenCyc and all wikipedia articles that have bank in article names. Here in this case, freebase also has a unique GUID for a concept. Here "City of Ithaca" and "Town of Ithaca" are two separate entities. People may refer to both of them as Ithaca, that may be sufficient reason to create a region for Ithaca. But if you see the names as just English representation of a concept, then the burden of disambiguation is on the program (or person) that uses this data. The program that uses this data has to be context aware, and the context should then be compared against knowledge represented in the ontology to disambiguate. Assuming that the KB should not be overly concerned with lexical overlap of concepts, having just an alisa of Ithaca for both the concepts is sufficient instead of new concept that contains both the city and town of Ithaca. I didn't want to be too verbose in my very first email. Please let me know if I overreached for a Freebase newbie. I hope to learn a lot from this email list and interact with very smart ontologists and knowledge engineers here. With kind regards, Vijay. On Mon, Aug 3, 2009 at 5:46 PM, Jack Park wrote: > Agree "names for things" is how we do it, emphasis on "names+context". > That walks away from names alone for things, and wanders into names + > other properties. We're back to names + properties. Can't seem to > escape that. But...Perhaps it's all in the wrist,so to speak: just how > one stubs in those properties. We already saw that "Ithica the Town" > is not really all that acceptable. > > In F2F conversations, context somehow winds its way into the dialogue > unless one is forced to back up and disambiguate. Google never did > that. Likely, never will, either. Here in Freebase, we have an > opportunity to pop multiple hits and ask for human selection (which, > when you think about it, is precisely what Google does already). This > chunk of this thread seems to be orbiting using "region" as one of > those disambiguating properties. What if the user is not familiar with > regions? Still, it's a start. Don't want to sound like a stuck record, > but it does seem that Freebase cannot duck subject identity issues; I > salute the effort to resolve issues as they arise. > > Jack > > > On Mon, Aug 3, 2009 at 4:25 PM, Tom Morris wrote: > > On Mon, Aug 3, 2009 at 3:58 PM, Jeff Prucher wrote: > >> Good point. The Region type can be used for creating a joint Ithaca > topic > >> that contains both. I agree we shouldn't go crazy with renaming cities > (or > >> anything else) to an official, rather than colloquial, name, but I also > >> wouldn't get too worried about the postal implications -- our data will > be > >> used in a lot of different ways, and sometimes these ways will have > >> competing interests in terms of what kind of display names are more or > less > >> acceptable. Addresses in non-English-speaking countries already fail > this > >> test, anyway. > > > > Using a Region, plus some name disambiguation sounds like a good > > approach, at least for the time being. It'll certainly fix the > > machine reconciliation problem (when used with a few rules), but we'll > > have to see what happens with user entered data. I suspect that'll be > > heavily influenced by the order the entries appear in the autocomplete > > box and their perceived correctness. People tend to pick the first > > acceptable, rather than the best, choice. > > > > I've gone ahead and created the Region > > http://www.freebase.com/edit/topic/guid/9202a8c04000641f800000000de6abd4 > > > > One of the problems currently with the official name vs common name > > situation is that there's no way to tell which one is which. > > Wikipedia tends towards the common name in the title and the official > > name in the lead to the article, but a) this is only a convention and > > b) that styling is much more easily interpreted by a human than a > > computer. If I could specifically ask for the official name (or > > common name), that would be an improvement. > > > > Responding to a few things further back up thread: > > > > - Names as identifiers - Yes, this is sub-optimal in many cases, but > > it's how humans do it in an awful lot of cases (names + context > > actually), so we're kind of stuck living with it and finding ways to > > map back and forth to their view of the world. My alma mater is > > partly in the city of Ithaca and partly in the town of Ithaca, but I > > never had to worry about this distinction until now. It was all just > > "Ithaca." > > > > - TIGER shape files - That doesn't really help with the "Ithaca, NY" > > -> Freebase topic ID for the general Ithaca, NY. If I had a lat/long > > that I was trying to resolve to containing its containing named shape, > > they might be useful. > > > > Tom > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090803/b636dfd0/attachment.htm From paul at ontology2.com Tue Aug 4 13:55:15 2009 From: paul at ontology2.com (Paul Houle) Date: Tue, 04 Aug 2009 09:55:15 -0400 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: <5179aafa0908031646r54096f7aud9cd961d6c232f3b@mail.gmail.com> References: <4A773130.9070404@metaweb.com> <7CD11508-F29A-490D-B22C-2D4900F418A0@metaweb.com> <5179aafa0908031646r54096f7aud9cd961d6c232f3b@mail.gmail.com> Message-ID: <4A783DC3.7070601@ontology2.com> Jack Park wrote: > Agree "names for things" is how we do it, emphasis on "names+context". > That walks away from names alone for things, and wanders into names + > other properties. We're back to names + properties. Can't seem to > escape that. But...Perhaps it's all in the wrist,so to speak: just how > one stubs in those properties. We already saw that "Ithica the Town" > is not really all that acceptable. > In my mind, "the naming of things" is a special "area of concern" in generic databases. That is, it makes sense to have a group of specialized data structures and heuristics about names. Some major aspects are: * getting the appropriate name for a topic, and * finding a topic given a name Right now I'm doing a project involving a specific problem domain; on one hand I've got a database derived from dbpedia and freebase, and on the other hand I've got some other files where the main useful identifier is a human-readable label. Two sorts of context exist here: (i) the problem domain (a selection from the main database) and (ii) the nature of the items being pulled in from the secondary database (also a secondary database;) starting with an item from the secondary database, phrase matches are a good way to find candidates in the primary database, filtered, of course, by being in the identified problem domain. The trouble is that I often get multiple candidates; it turns out that numerical scoring based on features in the primary database name is effective for finding candidates that match the problem domain. I found features and computed these scores by hand, using an ad hoc process a lot like C4.5, but if my set was a lot bigger it might have been worth training an SVM. Efforts in this area inevitably get compared to Cyc, but it's easy to get the wrong idea from that. Like Cyc, I think that generic databases will need to compile "knowledge" about specific topic areas. However, that "knowledge" may or may not be represented symbolically and may or may not be compiled by hand. For instance, one kind of commonsense reasoning that people can do is, given a name, guess if the person named is male or female. This ability can be had by statistical models of two types: (i) a database of known names, nicknames and genders. If you find 150 women named "Mary" and no men, you can make a statement with very high confidence. (ii) a model of how names "sound"; this can be implemented with, say, an N=5 Markov chain at the letter level. With Freebase it's a quick project to train either of the above models. From tfmorris at gmail.com Tue Aug 4 15:36:09 2009 From: tfmorris at gmail.com (Tom Morris) Date: Tue, 4 Aug 2009 11:36:09 -0400 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: <65d8f61d0908032155r50c9be43pfc44d85519162a77@mail.gmail.com> References: <4A773130.9070404@metaweb.com> <7CD11508-F29A-490D-B22C-2D4900F418A0@metaweb.com> <5179aafa0908031646r54096f7aud9cd961d6c232f3b@mail.gmail.com> <65d8f61d0908032155r50c9be43pfc44d85519162a77@mail.gmail.com> Message-ID: On Tue, Aug 4, 2009 at 12:55 AM, Vijay Raj wrote: > I joined this list just today, so let me introduce myself. My name is Vijay > Raj and I design mobile phone/base station physical layer algorithms for a > living. For the last few years, ontology engineering has become a passion of > mine. Naturally I have been following the development of Freebase. I am > impressed at the strides made by the Metaweb team and the openness of > data/architecture. Welcome Vijay! > Here in this case, freebase also has a unique GUID for a concept. Here "City > of Ithaca" and "Town of Ithaca" are two separate entities. People may refer > to both of them as Ithaca, that may be sufficient reason to create a region > for Ithaca. But if you see the names as just English representation of a > concept, then the burden of disambiguation is on the program (or person) > that uses this data. The program that uses this data has to be context > aware, and the context should then be compared against knowledge represented > in the ontology to disambiguate. > > Assuming that the KB should not be overly concerned with lexical overlap of > concepts, having just an alisa of Ithaca for both the concepts is sufficient > instead of new concept that contains both the city and town of Ithaca. The point is that if you force people (or a classifier) to choose one of the administrative units, "City of Ithaca" or "Town of Ithaca," when what they want is the named place "Ithaca" that geographically contains the area of the two administrative units, you're going to force them either to a) lie or b) omit the information that they have. Both of the administrative units are already aliased "Ithaca," but that doesn't solve the problem if neither is the right choice. One of the strengths of Freebase is that neither the schema nor the data is fixed. They can be molded to suit the needs of users. Of course, since everything is malleable and because there's a tradeoff between correctness and usability, one needs to carefully choose the best place to address any particular problem. Tom From jackpark at gmail.com Tue Aug 4 16:14:21 2009 From: jackpark at gmail.com (Jack Park) Date: Tue, 4 Aug 2009 09:14:21 -0700 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: References: <4A773130.9070404@metaweb.com> <7CD11508-F29A-490D-B22C-2D4900F418A0@metaweb.com> <5179aafa0908031646r54096f7aud9cd961d6c232f3b@mail.gmail.com> <65d8f61d0908032155r50c9be43pfc44d85519162a77@mail.gmail.com> Message-ID: <5179aafa0908040914r32aa596eh4b5d96e1756ab8f0@mail.gmail.com> Patrick Durusau and I gave a telecon talk to the Ontolog community [1] in which we spoke of Hobson's choice: you can select any horse in the barn, so long as it's the first one. This is equivalent to a kind of conundrum that exists when ontological commitments are made on behalf of the many by a select few. It strikes me that Freebase has this as if magic opportunity to rise above such commitments and, perhaps, invent the future, as many in the topic mapping community are trying to do. Jack [1] http://ontolog.cim3.net/cgi-bin/wiki.pl?ConferenceCall_2006_04_27 On Tue, Aug 4, 2009 at 8:36 AM, Tom Morris wrote: > On Tue, Aug 4, 2009 at 12:55 AM, Vijay Raj wrote: > >> I joined this list just today, so let me introduce myself. My name is Vijay >> Raj and I design mobile phone/base station physical layer algorithms for a >> living. For the last few years, ontology engineering has become a passion of >> mine. Naturally I have been following the development of Freebase. I am >> impressed at the strides made by the Metaweb team and the openness of >> data/architecture. > > Welcome Vijay! > >> Here in this case, freebase also has a unique GUID for a concept. Here "City >> of Ithaca" and "Town of Ithaca" are two separate entities. People may refer >> to both of them as Ithaca, that may be sufficient reason to create a region >> for Ithaca. But if you see the names as just English representation of a >> concept, then the burden of disambiguation is on the program (or person) >> that uses this data. The program that uses this data has to be context >> aware, and the context should then be compared against knowledge represented >> in the ontology to disambiguate. >> >> Assuming that the KB should not be overly concerned with lexical overlap of >> concepts, having just an alisa of Ithaca for both the concepts is sufficient >> instead of new concept that contains both the city and town of Ithaca. > > The point is that if you force people (or a classifier) to choose one > of the administrative units, "City of Ithaca" or "Town of Ithaca," > when what they want is the named place "Ithaca" that geographically > contains the area of the two administrative units, you're going to > force them either to a) lie or b) omit the information that they have. > ?Both of the administrative units are already aliased "Ithaca," but > that doesn't solve the problem if neither is the right choice. > > One of the strengths of Freebase is that neither the schema nor the > data is fixed. ?They can be molded to suit the needs of users. ?Of > course, since everything is malleable and because there's a tradeoff > between correctness and usability, one needs to carefully choose the > best place to address any particular problem. > > Tom > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From cycjay at gmail.com Wed Aug 5 01:11:33 2009 From: cycjay at gmail.com (Vijay Raj) Date: Tue, 4 Aug 2009 19:11:33 -0600 Subject: [Data-modeling] Disambiguating places with the same name In-Reply-To: References: <4A773130.9070404@metaweb.com> <7CD11508-F29A-490D-B22C-2D4900F418A0@metaweb.com> <5179aafa0908031646r54096f7aud9cd961d6c232f3b@mail.gmail.com> <65d8f61d0908032155r50c9be43pfc44d85519162a77@mail.gmail.com> Message-ID: <65d8f61d0908041811r7b00cb15qce79d9ea4daef844@mail.gmail.com> On Tue, Aug 4, 2009 at 9:36 AM, Tom Morris wrote: > On Tue, Aug 4, 2009 at 12:55 AM, Vijay Raj wrote: > > > I joined this list just today, so let me introduce myself. My name is > Vijay > > Raj and I design mobile phone/base station physical layer algorithms for > a > > living. For the last few years, ontology engineering has become a passion > of > > mine. Naturally I have been following the development of Freebase. I am > > impressed at the strides made by the Metaweb team and the openness of > > data/architecture. > > Welcome Vijay! Thank you Tom. > > > Here in this case, freebase also has a unique GUID for a concept. Here > "City > > of Ithaca" and "Town of Ithaca" are two separate entities. People may > refer > > to both of them as Ithaca, that may be sufficient reason to create a > region > > for Ithaca. But if you see the names as just English representation of a > > concept, then the burden of disambiguation is on the program (or person) > > that uses this data. The program that uses this data has to be context > > aware, and the context should then be compared against knowledge > represented > > in the ontology to disambiguate. > > > > Assuming that the KB should not be overly concerned with lexical overlap > of > > concepts, having just an alisa of Ithaca for both the concepts is > sufficient > > instead of new concept that contains both the city and town of Ithaca. > > The point is that if you force people (or a classifier) to choose one > of the administrative units, "City of Ithaca" or "Town of Ithaca," > when what they want is the named place "Ithaca" that geographically > contains the area of the two administrative units, you're going to > force them either to a) lie or b) omit the information that they have. > Both of the administrative units are already aliased "Ithaca," but > that doesn't solve the problem if neither is the right choice. If the purpose of the additional topic is to represent a more general region, like "bay area" for greater San Francisco region, I agree it is the right thing to do. I interpreted earlier, mistakenly, that the purpose is to help bypass disambiguation between the two disjoint regions. May be because users are not careful to distinguish between the two separate topics. > > One of the strengths of Freebase is that neither the schema nor the > data is fixed. They can be molded to suit the needs of users. Of > course, since everything is malleable and because there's a tradeoff > between correctness and usability, one needs to carefully choose the > best place to address any particular problem. > > Tom > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090804/e5b6110a/attachment.htm From paul at ontology2.com Fri Aug 7 18:14:47 2009 From: paul at ontology2.com (Paul Houle) Date: Fri, 07 Aug 2009 14:14:47 -0400 Subject: [Data-modeling] "Duck Types" in Freebase Message-ID: <4A7C6F17.1060108@ontology2.com> Something I've seen in how types are used in Freebase and how the system pushes me to design schemas is a pattern similar to the "Duck Typing" used in many languages. Rather than working out what sort of existing type is correct for the value of the property, it often makes sense to define a new type. I was designing a set of types for describing nuclear reactors and this came up. Many nuclear reactors contain a "Moderator", which is a substance that slows neutrons down, increasing the interaction cross-sections, and reducing the amount of fissile material required to make a critical mass. My first instinct in modeling is to say, "A NuclearModerator isA Substance" so that the system allows you to say that http://www.freebase.com/view/en/graphite is used as a moderator in a reactor, but you can't say that http://www.freebase.com/view/en/frank_zappa is used as a moderator. Now, Freebase doesn't have a "Substance" type; instead it has "Chemical Element", "Chemical Compound", "Ingredient", "Nutrient" and other nonoverlapping types. In particular, some moderators are compounds, http://www.freebase.com/view/en/heavy_water and others are elements or allotropes of elements, so there's no existing "master type" that contains all of the substances which could be moderators. The right thing to do in this case is to create a new type with a name like "NuclearReactorModerator", and force the "Moderator" property to be of that type. Practically, this works quite well. After adding a few reactor instance, you'd have about 5-10 or so items under NuclearReactorModerator. If we find a new kind of reactor that uses a new moderator, it's not hard to add a type. In the meantime, the new type lets FB provide useful autocompletion for the type, a list of possible moderator materials, and an annotation on the moderator materials that they have that use. The one area where I feel a little uncomfortable is that I'd like an official and consistent answer on how I should say "Reactor X has no moderator" which is the case in http://www.freebase.com/view/en/liquid_metal_cooled_reactor I'd like something a little more definitive than leaving the field blank, since FB's "Open World Assumption" means that lots of fields are going to be left blank because nobody bothered to fill them in. Many people link to http://www.freebase.com/view/en/independent but there's also http://www.freebase.com/view/guid/9202a8c04000641f800000000bc15fef and http://www.freebase.com/view/guid/9202a8c04000641f800000000bd0b576 and http://www.freebase.com/view/guid/9202a8c04000641f800000000bb2bb90 and http://www.freebase.com/view/guid/9202a8c04000641f800000000bd5dd72 Something ought to be done about this... From spatial.db at gmail.com Fri Aug 7 18:39:11 2009 From: spatial.db at gmail.com (Ed Laurent) Date: Fri, 7 Aug 2009 14:39:11 -0400 Subject: [Data-modeling] "Duck Types" in Freebase In-Reply-To: <4A7C6F17.1060108@ontology2.com> References: <4A7C6F17.1060108@ontology2.com> Message-ID: I suggest you use "None": http://www.freebase.com/view/en/independent Unfortunately, this very important topic has somehow become merged with Independent. I agree with Faye that something needs to be done about that. There was a long discussion about the use of "None" awhile back. I think on this list. -Ed On Fri, Aug 7, 2009 at 2:14 PM, Paul Houle wrote: > Something I've seen in how types are used in Freebase and how the > system pushes me to design schemas is a pattern similar to the "Duck > Typing" used in many languages. Rather than working out what sort of > existing type is correct for the value of the property, it often makes > sense to define a new type. > > I was designing a set of types for describing nuclear reactors and > this came up. Many nuclear reactors contain a "Moderator", which is a > substance that slows neutrons down, increasing the interaction > cross-sections, and reducing the amount of fissile material required to > make a critical mass. > > My first instinct in modeling is to say, "A NuclearModerator isA > Substance" so that the system allows you to say that > > http://www.freebase.com/view/en/graphite > > is used as a moderator in a reactor, but you can't say that > > http://www.freebase.com/view/en/frank_zappa > > is used as a moderator. Now, Freebase doesn't have a "Substance" > type; instead it has "Chemical Element", "Chemical Compound", > "Ingredient", "Nutrient" and other nonoverlapping types. In > particular, some moderators are compounds, > > http://www.freebase.com/view/en/heavy_water > > and others are elements or allotropes of elements, so there's no > existing "master type" that contains all of the substances which could > be moderators. > > The right thing to do in this case is to create a new type with a > name like "NuclearReactorModerator", and force the "Moderator" property > to be of that type. > > Practically, this works quite well. After adding a few reactor > instance, you'd have about 5-10 or so items under > NuclearReactorModerator. If we find a new kind of reactor that uses a > new moderator, it's not hard to add a type. In the meantime, the new > type lets FB provide useful autocompletion for the type, a list of > possible moderator materials, and an annotation on the moderator > materials that they have that use. > > The one area where I feel a little uncomfortable is that I'd like an > official and consistent answer on how I should say > > "Reactor X has no moderator" > > which is the case in > > http://www.freebase.com/view/en/liquid_metal_cooled_reactor > > I'd like something a little more definitive than leaving the field > blank, since FB's "Open World Assumption" means that lots of fields are > going to be left blank because nobody bothered to fill them in. Many > people link to > > http://www.freebase.com/view/en/independent > > but there's also > > http://www.freebase.com/view/guid/9202a8c04000641f800000000bc15fef > > and > > http://www.freebase.com/view/guid/9202a8c04000641f800000000bd0b576 > > and > > http://www.freebase.com/view/guid/9202a8c04000641f800000000bb2bb90 > > and > > http://www.freebase.com/view/guid/9202a8c04000641f800000000bd5dd72 > > Something ought to be done about this... > > > > > > > > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090807/8a1ba549/attachment.htm From jeff at metaweb.com Fri Aug 7 23:58:32 2009 From: jeff at metaweb.com (Jeff Prucher) Date: Fri, 7 Aug 2009 16:58:32 -0700 Subject: [Data-modeling] Suggestion for addressing potentially offensive content Message-ID: The problem of filtering out offensive content from Freebase data is something that keeps coming up, and clearly needs to be addressed. Kirrily and I were discussing this, and we think it can be done simply using our type system (and maybe with a little help from acre). Looking at the ways other sites deal with this, we liked Flickr's approach, which uses two levels of filter (besides "none"): Moderate and Restricted. Flickr says of these that moderate "may be considered offensive by some people" (which seems to include artistic nudes and the like) and that restricted is "unsuitable for children, your grandmother or your workmates". So we propose two new types. These types would have no properties, and would not include /common/topic. The types would be called something like "Moderate Content" and "Restricted Content" (I just made these labels up right now; better names will hopefully be suggested). "Restricted Content" would have "Moderate Content" as an included type. These types could be applied to any topic or image in Freebase. It would then be very simple for applications to exclude either of these types from their results. Also, if there are types that are likely to always have offensive instances, those could include the Moderate or Restricted Content types. We would then need a queue (probably an acre/RABJy sort of thing) to process the "flagged as potentially offensive" topics and to add the types as appropriate. But we need a model before we can implement it. What do you think? Jeff Prucher Type Librarian & Ontologist Metaweb Technologies, Inc. From vishal at metaweb.com Sat Aug 8 01:32:53 2009 From: vishal at metaweb.com (Vishal Talwar) Date: Fri, 7 Aug 2009 18:32:53 -0700 (PDT) Subject: [Data-modeling] wfh monday Message-ID: <394908.118701249695173559.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> I need to be around to receive a bed frame, but if it arrives early enough I'll come in for the rest of the day. Vishal From evening0star at gmail.com Sat Aug 8 20:36:23 2009 From: evening0star at gmail.com (evening0star) Date: Sat, 8 Aug 2009 16:36:23 -0400 Subject: [Data-modeling] Suggestion for addressing potentially offensive content In-Reply-To: References: Message-ID: <259286b0908081336l606c8947r3b915be3d1e1ac28@mail.gmail.com> Good idea! On Fri, Aug 7, 2009 at 7:58 PM, Jeff Prucher wrote: > The problem of filtering out offensive content from Freebase data is > something that keeps coming up, and clearly needs to be addressed. ?Kirrily > and I were discussing this, and we think it can be done simply using our > type system (and maybe with a little help from acre). Looking at the ways > other sites deal with this, we liked Flickr's approach, which uses two > levels of filter (besides "none"): Moderate and Restricted. Flickr says of > these that moderate "may be considered offensive by some people" (which > seems to include artistic nudes and the like) and that restricted is > "unsuitable for children, your grandmother or your workmates". > > So we propose two new types. These types would have no properties, and would > not include /common/topic. The types would be called something like > "Moderate Content" and "Restricted Content" (I just made these labels up > right now; better names will hopefully be suggested). "Restricted Content" > would have "Moderate Content" as an included type. These types could be > applied to any topic or image in Freebase. It would then be very simple for > applications to exclude either of these types from their results. Also, if > there are types that are likely to always have offensive instances, those > could include the Moderate or Restricted Content types. > > We would then need a queue (probably an acre/RABJy sort of thing) to process > the "flagged as potentially offensive" topics and to add the types as > appropriate. But we need a model before we can implement it. > > What do you think? > > Jeff Prucher > Type Librarian & Ontologist > Metaweb Technologies, Inc. > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From narphorium at gmail.com Sat Aug 8 21:37:20 2009 From: narphorium at gmail.com (Shawn Simister) Date: Sat, 08 Aug 2009 17:37:20 -0400 Subject: [Data-modeling] Suggestion for addressing potentially offensive content In-Reply-To: References: Message-ID: <4A7DF010.3070408@gmail.com> This sounds like a good solution to me. It would be interesting to be able to co-type individual images or articles as Restricted Content so that a topic could have a mix of restricted and non-restricted content associated with it which would allow the Freebase client to choose how to display that topic depending on the user's display settings. For example, porn actors could have a full-body picture marked as restricted but also have a cropped version which could be displayed to more sensitive users. Or a topic about a mass murderer might have a second version of its article which leaves out some of the more violent details. It might also be helpful to get some bases to use Restricted Content as an included type on some of their schema to avoid filling up the queue. Shawn Jeff Prucher wrote: > The problem of filtering out offensive content from Freebase data is > something that keeps coming up, and clearly needs to be addressed. Kirrily > and I were discussing this, and we think it can be done simply using our > type system (and maybe with a little help from acre). Looking at the ways > other sites deal with this, we liked Flickr's approach, which uses two > levels of filter (besides "none"): Moderate and Restricted. Flickr says of > these that moderate "may be considered offensive by some people" (which > seems to include artistic nudes and the like) and that restricted is > "unsuitable for children, your grandmother or your workmates". > > So we propose two new types. These types would have no properties, and would > not include /common/topic. The types would be called something like > "Moderate Content" and "Restricted Content" (I just made these labels up > right now; better names will hopefully be suggested). "Restricted Content" > would have "Moderate Content" as an included type. These types could be > applied to any topic or image in Freebase. It would then be very simple for > applications to exclude either of these types from their results. Also, if > there are types that are likely to always have offensive instances, those > could include the Moderate or Restricted Content types. > > We would then need a queue (probably an acre/RABJy sort of thing) to process > the "flagged as potentially offensive" topics and to add the types as > appropriate. But we need a model before we can implement it. > > What do you think? > > Jeff Prucher > Type Librarian & Ontologist > Metaweb Technologies, Inc. > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From iainsproat at gmail.com Mon Aug 10 21:39:34 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Tue, 11 Aug 2009 01:39:34 +0400 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: I've had a go at modelling this. My effort is primarily a synonym set typeand a word CVT (linked to the synonym property of symset). see also http://www.freebase.com/view/guid/9202a8c04000641f8000000007cf5081 I went a bit overboard and also modelled glyphs, graphemes, diacritic, lexical categories, morphemes, Phonemes etc. - all in the (poorly named) writing base . There's a few things missing, particularly lemmas and word roots which would be useful if anyone is planning using freebase data with NLP. One of the things I noticed was that freebase only really has nouns. I assume that verbs, adjectives etc. are also suitable for freebase, but nobody's yet loaded them? Iain On Fri, May 8, 2009 at 1:10 AM, spencer kelly wrote: > agree, i think the value of linguistic data is >= the value of any other > data we have in freebase -- only more awkward to enter. > with faith in the modelling power of the graph, i assume someone will > figure out a good way to do it eventually. > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090811/50d28416/attachment.htm From tfmorris at gmail.com Tue Aug 11 15:41:07 2009 From: tfmorris at gmail.com (Tom Morris) Date: Tue, 11 Aug 2009 11:41:07 -0400 Subject: [Data-modeling] Type for groups/sets of people (or any entity?) In-Reply-To: <7510661B-F071-4613-8C5F-03B74AC5C57A@metaweb.com> References: <7510661B-F071-4613-8C5F-03B74AC5C57A@metaweb.com> Message-ID: On Mon, Jul 20, 2009 at 5:13 PM, Robert Cook wrote: > > So, I tried the list type on it, and the results are here: > http://www.freebase.com/edit/topic/en/erinyes If you type Erinyes in the search box and navigate to the topic, you end up at http://socialist.freebase.com/list/en/erinyes which is styled completely differently than a normal topic page. This seems like an undesirable side effect of using the List type which probably makes it unacceptable for general use. I applied it to The Wachowski Brothers, inadvertently disappearing all the useful information on their topic page. You can still navigate to it directly, but if you use the search box, you end up in Socialist mode and any links you follow stay stuck in that display mode. Tom On Mon, Jul 20, 2009 at 5:13 PM, Robert Cook wrote: > > So, I tried the list type on it, and the results are here: > ??http://www.freebase.com/edit/topic/en/erinyes > This does seem awkward in this case as one?wouldn't?normally?think?of?this?as?a?list?and?the?members?as?entries,?so?perhaps?it's?a?bit?too?generic?(but?clearly?this?isn't?a?group?of?people?either.) > My?data?modeling?instincts?tell?me?that?this?would?quickly?turn?into?something like?a?Group?of?Mythological?Entities?type,?so?this?may?not?be?the?best?example?of?when?to?use?a?generic?list?type. > R > On Jul 20, 2009, at 2:02 PM, Ed Laurent wrote: > > Another use case is/are Erinyes. > > -Ed > > > On Mon, Jul 20, 2009 at 4:56 PM, Robert Cook wrote: >> >> On Jul 20, 2009, at 1:20 PM, Iain Sproat wrote: >> >> > The reason for creating the type was that groups are often mentioned >> > in literature, art and media, and not just as a list of >> > individuals . ?The group as an entity is the focus, and not the >> > component pieces. >> > >> > ?the name "people" seems a little awkward. >> > >> > Yeah, agreed. ?My initial idea of the type is that it is to be used >> > for small definable groups which share some sort of connection and >> > relevance to each other; e.g. played music together, attended an >> > event together, lived together, worked closely together, ?are/were >> > conjoined, wrote together etc.. ?"People" doesn't really get that >> > across. >> >> This is very interesting. ?Normally I would say that these could be >> defined with properties on a topic, but what I think you're implying >> is that there is no topic in many cases. ?For instance, you would >> normally be able to tell of people lived together if they all shared >> the same value on their "Places lived" property. ?But perhaps you >> don't know where they lived, but just that they lived together. >> >> I think a generic list type is a great way to start capturing data if >> even in a semi-structured way. ?As long as the data gets into >> Freebase, then it should be straightforward to upgrade the formality >> as needed. >> >> R >> >> >> >> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From iainsproat at gmail.com Tue Aug 11 19:30:28 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Tue, 11 Aug 2009 23:30:28 +0400 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: I made a few tweaks to the schema at http://writing.freebase.com, which meant that the WordNet stuff didn't work so well on freebase.com (you can't easily see all the symsets of a word). To compensate, I've created a freebase dictionary app at http://dictionary.freebaseapps.com which emulates the WordNet web interface. There's only a couple of dictionary examples in freebase (waiting until the schema is stable before importing WordNet) - and these can be seen at http://dictionary.freebaseapps.com/?word=rat & http://dictionary.freebaseapps.com/?word=red There's also a bleeding edge view (showing hypernyms and hyponyms) at http://2.dictionary.sprocketonline.user.dev.freebaseapps.com/?word=rat Finally, I've added a pronounciation type to the base but haven't filled in any data for that yet. http://www.freebase.com/type/schema/base/writing/pronounciation?domain=/base/writing Iain On Tue, Aug 11, 2009 at 1:39 AM, Iain Sproat wrote: > I've had a go at modelling this. My effort is primarily a synonym set > typeand a word > CVT (linked to the > synonym property of symset). see also > http://www.freebase.com/view/guid/9202a8c04000641f8000000007cf5081 > I went a bit overboard and also modelled glyphs, graphemes, diacritic, > lexical categories, morphemes, Phonemes etc. - all in the (poorly named) writing > base . There's a few things missing, > particularly lemmas and word roots which would be useful if anyone is > planning using freebase data with NLP. > > One of the things I noticed was that freebase only really has nouns. I > assume that verbs, adjectives etc. are also suitable for freebase, but > nobody's yet loaded them? > > Iain > > On Fri, May 8, 2009 at 1:10 AM, spencer kelly wrote: > >> agree, i think the value of linguistic data is >= the value of any other >> data we have in freebase -- only more awkward to enter. >> with faith in the modelling power of the graph, i assume someone will >> figure out a good way to do it eventually. >> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090811/7f32c166/attachment-0001.htm From arthur.van.hoff at gmail.com Wed Aug 12 09:29:34 2009 From: arthur.van.hoff at gmail.com (Arthur van Hoff) Date: Wed, 12 Aug 2009 11:29:34 +0200 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: Hi Ian, Thanks for doing this. It looks very promising. I tried manually adding two synonyms for "rat" (verbs) from wordnet, I'm not sure I did it right. Can you check? Have you considered how other languages feature in this schema? It would be great if it were possible to find synonyms for words in other human languages. We could scrape a lot of translations from Wikipedia if that is useful. I noticed that for the noun "Rat" you have merged the concept of the Animal with the Noun. I'm not sure that this is the right approach. In my view the noun "Rat" is not the same as the animal "Rat". This approach might get confusing once there are nouns in other languages for the word "Rat". Alternatively, you could model the noun Rat as a seperate topic with a property which refers to the defining topic (the animal). That way the animal topic would have reverse properties for all nouns in all languages (eventually). Perhaps that will work? I'd like to contribute some, let me know if there is anything I can do. Thanks. On Tue, Aug 11, 2009 at 9:30 PM, Iain Sproat wrote: > I made a few tweaks to the schema at http://writing.freebase.com, which > meant that the WordNet stuff didn't work so well on freebase.com (you > can't easily see all the symsets of a word). To compensate, I've created a > freebase dictionary app at http://dictionary.freebaseapps.com which > emulates the WordNet web interface. > There's only a couple of dictionary examples in freebase (waiting until the > schema is stable before importing WordNet) - and these can be seen at > http://dictionary.freebaseapps.com/?word=rat & > http://dictionary.freebaseapps.com/?word=red > There's also a bleeding > edge view (showing hypernyms and hyponyms) at > http://2.dictionary.sprocketonline.user.dev.freebaseapps.com/?word=rat > > Finally, I've added a pronounciation type to the base but haven't filled in > any data for that yet. > http://www.freebase.com/type/schema/base/writing/pronounciation?domain=/base/writing > > Iain > > On Tue, Aug 11, 2009 at 1:39 AM, Iain Sproat wrote: > >> I've had a go at modelling this. My effort is primarily a synonym set >> typeand a word >> CVT (linked to >> the synonym property of symset). see also >> http://www.freebase.com/view/guid/9202a8c04000641f8000000007cf5081 >> I went a bit overboard and also modelled glyphs, graphemes, diacritic, >> lexical categories, morphemes, Phonemes etc. - all in the (poorly named) writing >> base . There's a few things missing, >> particularly lemmas and word roots which would be useful if anyone is >> planning using freebase data with NLP. >> >> One of the things I noticed was that freebase only really has nouns. I >> assume that verbs, adjectives etc. are also suitable for freebase, but >> nobody's yet loaded them? >> >> Iain >> >> On Fri, May 8, 2009 at 1:10 AM, spencer kelly wrote: >> >>> agree, i think the value of linguistic data is >= the value of any other >>> data we have in freebase -- only more awkward to enter. >>> with faith in the modelling power of the graph, i assume someone will >>> figure out a good way to do it eventually. >>> >>> _______________________________________________ >>> Data-modeling mailing list >>> Data-modeling at freebase.com >>> http://lists.freebase.com/mailman/listinfo/data-modeling >>> >>> >> > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > > -- Arthur van Hoff arthur.van.hoff at gmail.com 650-283-0842 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090812/c45abf72/attachment.htm From iainsproat at gmail.com Wed Aug 12 12:34:03 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Wed, 12 Aug 2009 16:34:03 +0400 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: Arthur, Thanks for taking a look at it - I've since tweaked the schema (once again!). Your work made me realise that I was trying to be a bit too clever with a separate synonym property, and that synonyms are already taken care of by the omnipresent "also known as" /common/topic/alias property. I've changed the symset properties so that word morphology to have its own property/CVT, and am using the /common/topic/alias for all synonyms. The data you've added should now display correctly in http://dictionary.freebaseapps.com Rather wonderfully Freebase already takes care of the language problem - all text values have a /type/text/lang property. The MQL reference documentation has a good explanation of how this behaves http://www.freebase.com/docs/mql/ch02.html#namesandids For finding more synonyms from wikipedia we're again lucky that we already have an Alias app to do just that (thanks Shawn) http://aliases.freebaseapps.com/ noun/Rat - I'm not too sure what you mean here? The lexical category is straight from WordNet. http://wordnetweb.princeton.edu/perl/webwn?s=rat&sub=Search+WordNet&o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&h= I agree that each semantic meaning should be a completely separate topic from any other semantic meaning. e.g. a rat (animal) should be a separate topic from rat (informer). If different meanings have been merged together in the same topic, then please flag the topic for split. We're lacking Dictionary data at the moment, so the most useful way to contribute would be to import dictionary definitions to Freebase (Wiktionary and WordNet would be good starting points). Also, working with Shawn's Alias app to improve topic aliases would definitely help. Iain On Wed, Aug 12, 2009 at 1:29 PM, Arthur van Hoff wrote: > Hi Ian, > > Thanks for doing this. It looks very promising. I tried manually adding two > synonyms for "rat" (verbs) from wordnet, I'm not sure I did it right. Can > you check? > > Have you considered how other languages feature in this schema? It would be > great if it were possible to find synonyms for words in other human > languages. We could scrape a lot of translations from Wikipedia if that is > useful. > > I noticed that for the noun "Rat" you have merged the concept of the Animal > with the Noun. I'm not sure that this is the right approach. In my view the > noun "Rat" is not the same as the animal "Rat". This approach might get > confusing once there are nouns in other languages for the word "Rat". > > Alternatively, you could model the noun Rat as a seperate topic with a > property which refers to the defining topic (the animal). That way the > animal topic would have reverse properties for all nouns in all languages > (eventually). Perhaps that will work? > > I'd like to contribute some, let me know if there is anything I can do. > > Thanks. > > > On Tue, Aug 11, 2009 at 9:30 PM, Iain Sproat wrote: >> >> I made a few tweaks to the schema at http://writing.freebase.com, which >> meant that the WordNet stuff didn't work so well on freebase.com (you can't >> easily see all the symsets of a word). ?To compensate, I've created a >> freebase dictionary app at?http://dictionary.freebaseapps.com?which emulates >> the WordNet web interface. >> There's only a couple of dictionary examples in freebase (waiting until >> the schema is stable before importing WordNet) - and these can be seen >> at?http://dictionary.freebaseapps.com/?word=rat?&?http://dictionary.freebaseapps.com/?word=red >> There's also a bleeding edge view (showing hypernyms and hyponyms) >> at?http://2.dictionary.sprocketonline.user.dev.freebaseapps.com/?word=rat >> Finally, I've added a pronounciation type to the base but haven't filled >> in any data for that >> yet.?http://www.freebase.com/type/schema/base/writing/pronounciation?domain=/base/writing >> Iain >> On Tue, Aug 11, 2009 at 1:39 AM, Iain Sproat wrote: >>> >>> I've had a go at modelling this. ?My effort is primarily a synonym set >>> type and a word CVT (linked to the synonym property of symset). ?see >>> also?http://www.freebase.com/view/guid/9202a8c04000641f8000000007cf5081 >>> I went a bit overboard and also modelled glyphs, graphemes, diacritic, >>> lexical categories, morphemes, Phonemes etc. - all in the (poorly >>> named)?writing base. ?There's a few things missing, particularly lemmas and >>> word roots which would be useful if anyone is planning using freebase data >>> with NLP. >>> One of the things I noticed was that freebase only really has nouns. ?I >>> assume that verbs, adjectives etc. are also suitable for freebase, but >>> nobody's yet loaded them? >>> Iain >>> >>> On Fri, May 8, 2009 at 1:10 AM, spencer kelly >>> wrote: >>>> >>>> agree, i think the value of linguistic data is >= the value of any other >>>> data we have in freebase -- only more awkward to enter. >>>> with faith in the modelling power of the graph, i assume someone will >>>> figure out a good way to do it eventually. >>>> >>>> _______________________________________________ >>>> Data-modeling mailing list >>>> Data-modeling at freebase.com >>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>> >>> >> >> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling >> > > > > -- > Arthur van Hoff > arthur.van.hoff at gmail.com > 650-283-0842 > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > > From tfmorris at gmail.com Wed Aug 12 13:36:32 2009 From: tfmorris at gmail.com (Tom Morris) Date: Wed, 12 Aug 2009 09:36:32 -0400 Subject: [Data-modeling] Type for groups/sets of people (or any entity?) In-Reply-To: References: <7510661B-F071-4613-8C5F-03B74AC5C57A@metaweb.com> Message-ID: On Tue, Aug 11, 2009 at 11:41 AM, Tom Morris wrote: > On Mon, Jul 20, 2009 at 5:13 PM, Robert Cook wrote: >> >> So, I tried the list type on it, and the results are here: >> ? http://www.freebase.com/edit/topic/en/erinyes > > If you type Erinyes in the search box and navigate to the topic, you > end up at http://socialist.freebase.com/list/en/erinyes which is > styled completely differently than a normal topic page. ?This seems > like an undesirable side effect of using the List type which probably > makes it unacceptable for general use. It appears that this is considered a bug because the following bug reports have been created in JIRA. https://bugs.freebase.com/browse/CLI-8846 https://bugs.freebase.com/browse/CLI-8845 https://bugs.freebase.com/browse/CLI-8844 Tom From arthur.van.hoff at gmail.com Wed Aug 12 13:55:34 2009 From: arthur.van.hoff at gmail.com (Arthur van Hoff) Date: Wed, 12 Aug 2009 15:55:34 +0200 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: Hi Ian, On Wed, Aug 12, 2009 at 2:34 PM, Iain Sproat wrote: > > Rather wonderfully Freebase already takes care of the language problem > - all text values have a /type/text/lang property. The MQL reference > documentation has a good explanation of how this behaves > http://www.freebase.com/docs/mql/ch02.html#namesandids I'm aware of the language property for text values. However, I am not sure that is enough. I don't think it is true that each english verb translates directly into a french verb, with the same properties and symonym sets. Either way, it will get very confusing if all synonyms are listed in all languages for each synonym set, since eventually there will be hundreds, randomly mixed together from various languages. > noun/Rat - I'm not too sure what you mean here? The lexical > category is straight from WordNet. > > http://wordnetweb.princeton.edu/perl/webwn?s=rat&sub=Search+WordNet&o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&h= For "rat" I dont think that the mamal and the noun should be the same topic. These should be separate topics since they are separate concepts (one is an animal, the other is a feature of a language). -- Arthur van Hoff arthur.van.hoff at gmail.com 650-283-0842 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090812/c0174375/attachment-0001.htm From iainsproat at gmail.com Wed Aug 12 15:19:07 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Wed, 12 Aug 2009 19:19:07 +0400 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: > I don't think it is true that each english verb translates directly into a french verb Agreed, they don't. But that's where the beauty of a symset (aka topic) comes in - symsets are language independent. The concept of a dog (animal) is still the same in any language, whether it's called a chien, hund, koira or dog. > eventually there will be hundreds, randomly mixed together from various languages We can filter aliases by the /type/text/lang property value, which I think the freebase client does already - so it shouldn't become a problem. Hopefully we can build this semantic dictionary into a translation engine, where a word in /lang/en is used to find a language independent semantic concept (a freebase topic) and from that we can select one of that topic's /lang/fr or /lang/de synonyms. (as a side, we'd need a rating system for the synonyms, so that an obscure or archaic word wasn't chosen) > For "rat" I dont think that the mamal and the noun should be the same topic. > These should be separate topics since they are separate concepts (one is an > animal, the other is a feature of a language). agreed as /en/noun is a different topic from /en/rat - but they are linked through the /base/writing/symset/category property. i.e. rat (animal) is in the noun category, rather than a verb or adjective. Iain From iainsproat at gmail.com Wed Aug 12 16:25:34 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Wed, 12 Aug 2009 20:25:34 +0400 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: On Wed, Aug 12, 2009 at 7:19 PM, Iain Sproat wrote: > Hopefully we can build this semantic dictionary into a translation > engine, And a prototype translation tool: http://1.translate.sprocketonline.user.dev.freebaseapps.com/index But we really need to fill out more aliases to make it useful. http://aliases.freebaseapps.com/ Iain -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090812/71e763cb/attachment.htm From jason at metaweb.com Wed Aug 12 17:06:35 2009 From: jason at metaweb.com (Jason Douglas) Date: Wed, 12 Aug 2009 10:06:35 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: <768FA9C9-8E1D-4846-8742-208E801B5B6B@metaweb.com> On Aug 10, 2009, at 2:39 PM, Iain Sproat wrote: > One of the things I noticed was that freebase only really has > nouns. I assume that verbs, adjectives etc. are also suitable for > freebase, but nobody's yet loaded them? Yeah... I'm especially interested in adjectival forms of nouns (since nouns are what we start with as you noted): * Mexico --> Mexican * "Red Wine" Wines --> Red Wines * etc. (David Huynh could probably expand on the specific cases a lot better than I can...) -jason From jeff at metaweb.com Wed Aug 12 22:20:05 2009 From: jeff at metaweb.com (Jeff Prucher) Date: Wed, 12 Aug 2009 15:20:05 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: > -----Original Message----- > From: data-modeling-bounces at freebase.com > [mailto:data-modeling-bounces at freebase.com] On Behalf Of Iain Sproat > Sent: Wednesday, August 12, 2009 5:34 AM > To: Freebase data modeling mailing list > Subject: Re: [Data-modeling] English Words > > Arthur, > > Thanks for taking a look at it - I've since tweaked the > schema (once again!). Your work made me realise that I was > trying to be a bit too clever with a separate synonym > property, and that synonyms are already taken care of by the > omnipresent "also known as" /common/topic/alias > property. I've changed the symset properties so that word morphology > to have its own property/CVT, and am using the > /common/topic/alias for all synonyms. The data you've added > should now display correctly in http://dictionary.freebaseapps.com > > I agree that each semantic meaning should be a completely > separate topic from any other semantic meaning. e.g. a rat > (animal) should be a separate topic from rat (informer). If > different meanings have been merged together in the same > topic, then please flag the topic for split. I disagree that word instances should be merged with the topic for the thing they represent. They are really not the same thing at all. The abstract notion of the genus Rattus (which is what /en/rat represents) is not the same thing as the English noun "rat", which is also definitely not the same thing as the Spanish word "rata" or the German word "Ratten", which is what this approach seems to imply. Also, by relying on aliases for synonymy, we lose the ability to do WordNet-y things like tell which sense of the word "rat" the topic for "rattus" (or "informer") is synonymous with. > We're lacking Dictionary data at the moment, so the most > useful way to contribute would be to import dictionary > definitions to Freebase (Wiktionary and WordNet would be good > starting points). Also, working with Shawn's Alias app to > improve topic aliases would definitely help. One idea about WordNet that's been suggested is that a "word" type in Freebase could be created that didn't include /common/topic. This would prevent the client from becoming cluttered up with topics for words (which users would obviously try to use in place of the topics for the thing the words represent), but would be no less easy to use through the API. Jeff > Iain > > On Wed, Aug 12, 2009 at 1:29 PM, Arthur van > Hoff wrote: > > Hi Ian, > > > > Thanks for doing this. It looks very promising. I tried manually > > adding two synonyms for "rat" (verbs) from wordnet, I'm not > sure I did > > it right. Can you check? > > > > Have you considered how other languages feature in this schema? It > > would be great if it were possible to find synonyms for > words in other > > human languages. We could scrape a lot of translations from > Wikipedia > > if that is useful. > > > > I noticed that for the noun "Rat" you have merged the > concept of the > > Animal with the Noun. I'm not sure that this is the right > approach. In > > my view the noun "Rat" is not the same as the animal "Rat". This > > approach might get confusing once there are nouns in other > languages for the word "Rat". > > > > Alternatively, you could model the noun Rat as a seperate > topic with a > > property which refers to the defining topic (the animal). > That way the > > animal topic would have reverse properties for all nouns in all > > languages (eventually). Perhaps that will work? > > > > I'd like to contribute some, let me know if there is > anything I can do. > > > > Thanks. > > > > > > On Tue, Aug 11, 2009 at 9:30 PM, Iain Sproat > wrote: > >> > >> I made a few tweaks to the schema at http://writing.freebase.com, > >> which meant that the WordNet stuff didn't work so well on > >> freebase.com (you can't easily see all the symsets of a word). ?To > >> compensate, I've created a freebase dictionary app at? > >> http://dictionary.freebaseapps.com?which emulates the > WordNet web interface. > >> There's only a couple of dictionary examples in freebase (waiting > >> until the schema is stable before importing WordNet) - and > these can > >> be seen at?http://dictionary.freebaseapps.com/?word=rat?&? > >> http://dictionary.freebaseapps.com/?word=red > >> There's also a bleeding edge view (showing hypernyms and > hyponyms) at? > >> > http://2.dictionary.sprocketonline.user.dev.freebaseapps.com/?word=ra > >> t Finally, I've added a pronounciation type to the base > but haven't > >> filled in any data for that yet.? > >> > http://www.freebase.com/type/schema/base/writing/pronounciation?domai > >> n=/base/writing > >> Iain > >> On Tue, Aug 11, 2009 at 1:39 AM, Iain Sproat > wrote: > >>> > >>> I've had a go at modelling this. ?My effort is primarily > a synonym > >>> set type and a word CVT (linked to the synonym property > of symset). ? > >>> see also? > >>> http://www.freebase.com/view/guid/9202a8c04000641f8000000007cf5081 > >>> I went a bit overboard and also modelled glyphs, graphemes, > >>> diacritic, lexical categories, morphemes, Phonemes etc. - > all in the > >>> (poorly > >>> named)?writing base. ?There's a few things missing, particularly > >>> lemmas and word roots which would be useful if anyone is planning > >>> using freebase data with NLP. > >>> One of the things I noticed was that freebase only really > has nouns. ? > >>> I assume that verbs, adjectives etc. are also suitable > for freebase, > >>> but nobody's yet loaded them? > >>> Iain > >>> > >>> On Fri, May 8, 2009 at 1:10 AM, spencer kelly > >>> > >>> wrote: > >>>> > >>>> agree, i think the value of linguistic data is >= the > value of any > >>>> other data we have in freebase -- only more awkward to enter. > >>>> with faith in the modelling power of the graph, i assume someone > >>>> will figure out a good way to do it eventually. > >>>> > >>>> _______________________________________________ > >>>> Data-modeling mailing list > >>>> Data-modeling at freebase.com > >>>> http://lists.freebase.com/mailman/listinfo/data-modeling > >>>> > >>> > >> > >> > >> _______________________________________________ > >> Data-modeling mailing list > >> Data-modeling at freebase.com > >> http://lists.freebase.com/mailman/listinfo/data-modeling > >> > > > > > > > > -- > > Arthur van Hoff > > arthur.van.hoff at gmail.com > > 650-283-0842 > > > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From rfh at metaweb.com Wed Aug 12 22:45:48 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Wed, 12 Aug 2009 15:45:48 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: Loading wordnet has been a topic of internal discussion for a while. I think it is important to bring some of the fruits of that internal discussion into this conversation. I've pasted in the document from our internal Wiki. Don't click on the links, they go to our internal wiki. As an aside, we've been told (by a leading NLP researcher who is also a fan of Freebase) that the prolog version of the wordnet database is the best place to start. http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz -r Finally Loading Wordnet From Metaweb Wordnet has a bit of a tortured history at Metaweb. One of the earliest test cases for graphd, it seems to be one of those obviously useful data sets which we have somehow never managed to load. A number of early Freebase users (most notably Powerset) have made use of it. In the general NLP community Wordnet seems to be heavily used. Basically, Wordnet is an machine readable description of the English language carrying information roughly comparable to a dictionary and a thesaurus. For example, Wordnet knows that leukemia is a form of cancer which is a type of sickness, and that there are several words or phrases for specific types of leukemia (hyponyms) for example, "myeloid leukemia." Contents [hide] 1 Summary 1.1 Reconciling Wordnet 1.1.1 Cotyping Topics and Wordnet Words 1.1.1.1 Usage 1.1.2 Separate Identities for Wordnet Words 1.1.2.1 Usage 1.2 String Values Revisited Summary I'm suggesting that we load Wordnet in a novel way: Each Wordnet word would become an instance new top-level type, (/common/symbol?) analogous to /common/topic. Wordnet words would never be typed as / common/topic but would be related to topics the most common (only) relationship meaning roughly "this symbol (word) can stand for this topic" Reconciling Wordnet The Wordnet schema is mature and amenable to relatively direct translation into Freebase schema. The principle problem to be resolved by a Wordnet load is how to relate Wordnet words to existing Metaweb topics. Basically there are two choices: cotype existing topics as "Wordnet words" where we can reconcile Wordnet to existing topics, create new new identites for words than we cannot reconcile create new identities for all Wordnet words and relate those identites to existing topics when they can be reconciled wordnet "synsets" may be better candidates for reconciliation with topics. Nix 01:35, 1 August 2009 (UTC) wikipedia has several structures that map words to topics, including disambiguation pages and a template that refers to other notable uses of a word. this should be easy to extract from wex. Nix 01:35, 1 August 2009 (UTC) Cotyping Topics and Wordnet Words Cotyping entities as Topics and Wordnet Words engenders a number of problems: The spelling of a word (name) is fundamental to identity, not so for a topic For example someone might decide that a Wikipedia article entitled "Leukemia" was really about "Myeloid Leukemia" and rename it. From the standpoint of concepts, renaming is entirely correct, from the standpoint of words, renaming is disastrous. Creation of Topics which are just English words is going to lead to multiple matches in autocomplete and may cause confusion. Usually, a word is not what you want to see, you want the concept that the word represents. Translations for Concepts and Words are different. Typically, a concept will have one "best" representation in every language. For words, this is not true at all. Most words have many possible translations depending on the concept they are intended to express. I think this one is pretty much a dealbreaker. Nix 01:35, 1 August 2009 (UTC) Information about a word, for example philology, may not apply to all translations. For example the English word "gift" is Germanic, from "mitgift", the French equivalent, "cadeau", is romantic. Because words have many senses, it is difficult to reconcile topics to words. Usage Looking for topics which are special cases of a topic looks something like this: { ... "type" : "/common/wordnet_word", "hyponym" : [{ "type" : "/common/topic", "name" : "null" "id" : null, }], ... } Separate Identities for Wordnet Words An alternative to merging Wordnet Words and Freebase topics is to create a new "top-level" type analogous to /common/topic to represent words. Where we can reconcile them, instances of this new /common/ wordnet_word type are related to topics, typically those whose name has the same spelling. The reconciliation problem is the same as in the co-typing case (hard) but since reconciliation "failure" doesn't result in confusing pairs of words and concepts with similar spellings solving it is much less urgent. The Wordnet data is usable as is, and any connections between it and the existing world of Freebase topics are added bonuses. At a very basic level, this representation makes a clear distinction between a concept and the symbols used to represent that concept in various human languages. Problems with separate Wordnet identities: A type with implicit language binding such as /common/wordnet_word fights with our existing /lang/* localization mechanism. Probably, we don't want to allow a German name for an English word as this simple pairing is woefully inadequate for the purposes of establishing a translation. i think this is ok. but see comment below on joining through a /type/ text. Nix 01:35, 1 August 2009 (UTC) Usage Looking for topics which are special cases of a topic looks something like this: { ... "symbol" : { "lang" : "/lang/en", "hyponym" : [{ "topic" : [{ "name" : null, "id" : null, }] }], } ... } where "symbol" is a new property of /common/topic which associates a topic with a language symbol. associates a topic with with a language symbol (wordnet word being the first example of such). String Values Revisited All of this (and some past discussions with Warren) prompts me to revisit the MQL value type /type/text. A MQL string value (instance of /type/text) is represented as a single link whose left is the subject, right is the language identity (for example /lang/en), and value is the string itself. This representation was chosen to save primitives. MQL, at some transformational expense, makes this single primitive look like an object with properties like "value" and "lang". However, given the presence of an English Wordnet and the possibility of other Wordnets, it would be quite natural to create actual identities to represent words and phrases. As a the sole representation for text, this is extremely expensive. For example, naming something "leukemia" would cost 5 primitives: the link from subject to "leukemia" the identity for "leukemia" the permission for the identity "leukemia" the value link carrying the rawstring "leukemia" a link from the identity "leukemia" to the language, /lang/en To a very slight degree, the primitive burn is offset by the fact that naming any subsequent thing "leukemia" will cost one primitive, a link to the identity. As our sole representation for strings, this excessively expensive. However, if we're going to be loading Wordnet anyway, it becomes tempting to allow this "expanded" form of /type/text in addition to the current compact form because it resolves the tension between localized strings and language symbols (words). Unfortunately, implementation of a hybrid scheme is going to be relatively expensive: When asking for an object's name, we need to use a graph or to ask for both types of name. Moreover, when writing English strings, we would need to check for existing words with the same spelling and refer to those instead of creating a literal string value. Lastly, when creating a new Wordnet word, we would need to check for existing English strings with the same spelling and replace them with links to the new word. Pretty daunting. It would be interesting to know how many English strings in the current OTG would be candidates for replacement wit h a link to a Wordnet word. I don't propose that we actually do this, but it might be worth thinking about. Certainly, if you're a fan of "reified strings" this is the sort of thing that they're supposed to be good for. Much of the need for this would go away if you could search "through" a /type/text in mql using a reverse property. then the /type/text would lead you straight to the "word" object. Nix 01:35, 1 August 2009 (UTC) On Aug 12, 2009, at 3:20 PM, Jeff Prucher wrote: > > >> -----Original Message----- >> From: data-modeling-bounces at freebase.com >> [mailto:data-modeling-bounces at freebase.com] On Behalf Of Iain Sproat >> Sent: Wednesday, August 12, 2009 5:34 AM >> To: Freebase data modeling mailing list >> Subject: Re: [Data-modeling] English Words >> >> Arthur, >> >> Thanks for taking a look at it - I've since tweaked the >> schema (once again!). Your work made me realise that I was >> trying to be a bit too clever with a separate synonym >> property, and that synonyms are already taken care of by the >> omnipresent "also known as" /common/topic/alias >> property. I've changed the symset properties so that word >> morphology >> to have its own property/CVT, and am using the >> /common/topic/alias for all synonyms. The data you've added >> should now display correctly in http://dictionary.freebaseapps.com >> >> I agree that each semantic meaning should be a completely >> separate topic from any other semantic meaning. e.g. a rat >> (animal) should be a separate topic from rat (informer). If >> different meanings have been merged together in the same >> topic, then please flag the topic for split. > > I disagree that word instances should be merged with the topic for > the thing > they represent. They are really not the same thing at all. The > abstract > notion of the genus Rattus (which is what /en/rat represents) is not > the > same thing as the English noun "rat", which is also definitely not > the same > thing as the Spanish word "rata" or the German word "Ratten", which > is what > this approach seems to imply. > > Also, by relying on aliases for synonymy, we lose the ability to do > WordNet-y things like tell which sense of the word "rat" the topic for > "rattus" (or "informer") is synonymous with. > >> We're lacking Dictionary data at the moment, so the most >> useful way to contribute would be to import dictionary >> definitions to Freebase (Wiktionary and WordNet would be good >> starting points). Also, working with Shawn's Alias app to >> improve topic aliases would definitely help. > > One idea about WordNet that's been suggested is that a "word" type in > Freebase could be created that didn't include /common/topic. This > would > prevent the client from becoming cluttered up with topics for words > (which > users would obviously try to use in place of the topics for the > thing the > words represent), but would be no less easy to use through the API. > > Jeff > >> Iain >> >> On Wed, Aug 12, 2009 at 1:29 PM, Arthur van >> Hoff wrote: >>> Hi Ian, >>> >>> Thanks for doing this. It looks very promising. I tried manually >>> adding two synonyms for "rat" (verbs) from wordnet, I'm not >> sure I did >>> it right. Can you check? >>> >>> Have you considered how other languages feature in this schema? It >>> would be great if it were possible to find synonyms for >> words in other >>> human languages. We could scrape a lot of translations from >> Wikipedia >>> if that is useful. >>> >>> I noticed that for the noun "Rat" you have merged the >> concept of the >>> Animal with the Noun. I'm not sure that this is the right >> approach. In >>> my view the noun "Rat" is not the same as the animal "Rat". This >>> approach might get confusing once there are nouns in other >> languages for the word "Rat". >>> >>> Alternatively, you could model the noun Rat as a seperate >> topic with a >>> property which refers to the defining topic (the animal). >> That way the >>> animal topic would have reverse properties for all nouns in all >>> languages (eventually). Perhaps that will work? >>> >>> I'd like to contribute some, let me know if there is >> anything I can do. >>> >>> Thanks. >>> >>> >>> On Tue, Aug 11, 2009 at 9:30 PM, Iain Sproat >> wrote: >>>> >>>> I made a few tweaks to the schema at http://writing.freebase.com, >>>> which meant that the WordNet stuff didn't work so well on >>>> freebase.com (you can't easily see all the symsets of a word). To >>>> compensate, I've created a freebase dictionary app at >>>> http://dictionary.freebaseapps.com which emulates the >> WordNet web interface. >>>> There's only a couple of dictionary examples in freebase (waiting >>>> until the schema is stable before importing WordNet) - and >> these can >>>> be seen at http://dictionary.freebaseapps.com/?word=rat & >>>> http://dictionary.freebaseapps.com/?word=red >>>> There's also a bleeding edge view (showing hypernyms and >> hyponyms) at >>>> >> http://2.dictionary.sprocketonline.user.dev.freebaseapps.com/?word=ra >>>> t Finally, I've added a pronounciation type to the base >> but haven't >>>> filled in any data for that yet. >>>> >> http://www.freebase.com/type/schema/base/writing/pronounciation?domai >>>> n=/base/writing >>>> Iain >>>> On Tue, Aug 11, 2009 at 1:39 AM, Iain Sproat >> wrote: >>>>> >>>>> I've had a go at modelling this. My effort is primarily >> a synonym >>>>> set type and a word CVT (linked to the synonym property >> of symset). >>>>> see also >>>>> http://www.freebase.com/view/guid/9202a8c04000641f8000000007cf5081 >>>>> I went a bit overboard and also modelled glyphs, graphemes, >>>>> diacritic, lexical categories, morphemes, Phonemes etc. - >> all in the >>>>> (poorly >>>>> named) writing base. There's a few things missing, particularly >>>>> lemmas and word roots which would be useful if anyone is planning >>>>> using freebase data with NLP. >>>>> One of the things I noticed was that freebase only really >> has nouns. >>>>> I assume that verbs, adjectives etc. are also suitable >> for freebase, >>>>> but nobody's yet loaded them? >>>>> Iain >>>>> >>>>> On Fri, May 8, 2009 at 1:10 AM, spencer kelly >>>>> >>>>> wrote: >>>>>> >>>>>> agree, i think the value of linguistic data is >= the >> value of any >>>>>> other data we have in freebase -- only more awkward to enter. >>>>>> with faith in the modelling power of the graph, i assume someone >>>>>> will figure out a good way to do it eventually. >>>>>> >>>>>> _______________________________________________ >>>>>> Data-modeling mailing list >>>>>> Data-modeling at freebase.com >>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>>> >>>>> >>>> >>>> >>>> _______________________________________________ >>>> Data-modeling mailing list >>>> Data-modeling at freebase.com >>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>> >>> >>> >>> >>> -- >>> Arthur van Hoff >>> arthur.van.hoff at gmail.com >>> 650-283-0842 >>> >>> _______________________________________________ >>> Data-modeling mailing list >>> Data-modeling at freebase.com >>> http://lists.freebase.com/mailman/listinfo/data-modeling >>> >>> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling >> > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090812/2a2d2b23/attachment-0001.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090812/2a2d2b23/attachment-0001.bin From arthur.van.hoff at gmail.com Thu Aug 13 07:31:51 2009 From: arthur.van.hoff at gmail.com (Arthur van Hoff) Date: Thu, 13 Aug 2009 09:31:51 +0200 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: On Thu, Aug 13, 2009 at 12:20 AM, Jeff Prucher wrote: > I disagree that word instances should be merged with the topic for the > thing > they represent. They are really not the same thing at all. The abstract > notion of the genus Rattus (which is what /en/rat represents) is not the > same thing as the English noun "rat", which is also definitely not the same > thing as the Spanish word "rata" or the German word "Ratten", which is what > this approach seems to imply. I agree. That is exactly the point I was trying to make. Thanks Jeff. -- Arthur van Hoff arthur.van.hoff at gmail.com 650-283-0842 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090813/f8bac80a/attachment.htm From williams.bruce at gmail.com Thu Aug 13 08:11:08 2009 From: williams.bruce at gmail.com (Bruce Williams) Date: Thu, 13 Aug 2009 01:11:08 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> > Hopefully we can build this semantic dictionary into a translation > engine, where a word in /lang/en is used to find a language > independent semantic concept (a freebase topic) and from that we can > select one of that topic's /lang/fr or /lang/de synonyms. (as a side, > we'd need a rating system for the synonyms, so that an obscure or > archaic word wasn't chosen) > This is called "Conceptual Dependency" and was tried in the 80's. Nobody got it to work. We now have much better statistical translation systems, which is how Google, etc do it I am working in the field of Information Extraction, so this thread is very interesting. Bruce Williams From iainsproat at gmail.com Thu Aug 13 10:12:08 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Thu, 13 Aug 2009 14:12:08 +0400 Subject: [Data-modeling] English Words In-Reply-To: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> Message-ID: I've created a diagram showing my understanding of the various options http://img.freebase.com/api/trans/raw/guid/9202a8c04000641f800000000e1c4f09 (modified the /common/symbol idea slight from described in the wiki - I added a 'definition' CVT, explained below) On Thu, Aug 13, 2009 at 12:20 AM, Jeff Prucher wrote: > > I disagree that word instances should be merged with the topic for the thing > they represent. My assumption was that each alias in the /common/topic/alias property could be treated as a separate word 'topic'. I don't think it's entirely wrong, but I'll agree is confusing. (plus using /type/text is very restrictive as further properties can't be easily included) +1 to a /common/symbol type. (would it also work with hieroglyphs and other logograms?) >Because words have many senses, it is difficult to reconcile topics to words. Agreed, very difficult without enough information. But further information might be derived from the word's use - such as emotional content, formality and subjectiveness; and these sorts of properties could also be modelled. e.g. the word "rat" is far more opinionated/negative than the word "informer" when used to describe someone who informs.* To allow for this sort of modelling, I've shown a "definition" CVT between /common/symbol and /common/topic in the diagram to help solve this sort of problem. ?This could have additional properties which assist in deciding what word is appropriate for a particular context. It might also allow a statistical/fuzzy relationship between symbols and their meanings. I think the CVT would also allow the language property to be removed from /common/symbol, making symbols language/culture independent and reusable. */user/bgoldenberg pointed out projects such as SentiWordNet which try to model these sorts of variables for different words. http://nmis.isti.cnr.it/sebastiani/Publications/LREC06.pdf Iain P.S. Turns out Stefano already has a really great translate app http://translate.stefanomazzocchi.user.dev.freebaseapps.com/ On Thu, Aug 13, 2009 at 12:11 PM, Bruce Williams wrote: > > > Hopefully we can build this semantic dictionary into a translation > > engine, where a word in /lang/en is used to find a language > > independent semantic concept (a freebase topic) and from that we can > > select one of that topic's /lang/fr or /lang/de synonyms. (as a side, > > we'd need a rating system for the synonyms, so that an obscure or > > archaic word wasn't chosen) > > > > This is called "Conceptual Dependency" and was tried in the 80's. > Nobody got it to work. We now have much better statistical translation > systems, which is how Google, etc do it > > I am working in the field of Information Extraction, so this thread is > very interesting. > > Bruce Williams > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling From arthur.van.hoff at gmail.com Thu Aug 13 07:44:53 2009 From: arthur.van.hoff at gmail.com (Arthur van Hoff) Date: Thu, 13 Aug 2009 09:44:53 +0200 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: This is a great document. I'm encouraged by the amount of thought that has gone into it. We are heavy users of wordnet data, but it is incomplete in many ways, and this would be a great platform to work with. We've had to extend the data with the common forms of verbs, as well as many other missing words that we found lying around on the web. FYI, I don't think many /type/text entities would match wordnet words. Wordnet is not very rich in nouns, and it does not contain all senses of verbs etc. However, over time I'm hoping that the Wordnet data set can be extended and so we can a more complete dictionary. One observation. After manually entering a few words using the Freebase UI using Iain's base it is obvious to me that this is a very error-prone process. In particular it will be very hard to detect and correct incorrect entries. A better approach would be to build an Acre front end for manual editing/correction of the data. Other useful types of words to include are titles (Mr. Mrs. Dr., etc.), abbreviations (OMG), people names (Smith, Johnson, ...), punctuation, and slang. This will make it a very useful data set for many applications. On Thu, Aug 13, 2009 at 12:45 AM, Reilly Hayes wrote: > Loading wordnet has been a topic of internal discussion for a while. I > think it is important to bring some of the fruits of that internal > discussion into this conversation. I've pasted in the document from our > internal Wiki. Don't click on the links, they go to our internal wiki. > As an aside, we've been told (by a leading NLP researcher who is also a fan > of Freebase) that the prolog version of the wordnet database is the best > place to start. http://wordnetcode.princeton.edu/3.0/WNprolog-3.0.tar.gz > > -r > > Finally Loading Wordnet From Metaweb > > Wordnet has a bit of a tortured history at > Metaweb. One of the earliest test cases for graphd, it seems to be one of > those obviously useful data sets which we have somehow never managed to > load. A number of early Freebase users (most notably Powerset) have made use > of it. In the general NLP community Wordnet seems to be heavily used. > > Basically, Wordnet is an machine readable description of the English > language carrying information roughly comparable to a dictionary and a > thesaurus. For example, Wordnet knows that leukemiais a form of cancer which is a type of sickness, and that there are several > words or phrases for specific types of leukemia (hyponyms) for example, > "myeloid leukemia." > Contents [hide] > > - 1 Summary > - 1.1 Reconciling Wordnet > - 1.1.1 Cotyping Topics and Wordnet Words > - 1.1.1.1 Usage > - 1.1.2 Separate Identities for Wordnet Words > - 1.1.2.1 Usage > - 1.2 String Values Revisited > > Summary > > I'm suggesting that we load Wordnet in a novel way: Each Wordnet word would > become an instance new top-level type, (/common/symbol?) analogous to > /common/topic. Wordnet words would never be typed as /common/topic but would > be related to topics the most common (only) relationship meaning roughly > "this symbol (word) can stand for this topic" > Reconciling Wordnet > > The Wordnet schema is mature and amenable to relatively direct translation > into Freebase schema. The principle problem to be resolved by a Wordnet load > is how to relate Wordnet words to existing Metaweb topics. Basically there > are two choices: > > - cotype existing topics as "Wordnet words" where we can reconcile > Wordnet to existing topics, create new new identites for words than we > cannot reconcile > - create new identities for all Wordnet words and relate those > identites to existing topics when they can be reconciled > > wordnet "synsets" may be better candidates for reconciliation with topics. > Nix 01:35, 1 August 2009 > (UTC) wikipedia has several structures that map words to topics, including > disambiguation pages and a template that refers to other notable uses of a > word. this should be easy to extract from wex. Nix01:35, 1 August 2009 (UTC) > Cotyping Topics and Wordnet Words > > Cotyping entities as Topics and Wordnet Words engenders a number of > problems: > > - The spelling of a word (name) is fundamental to identity, not so for > a topic > > For example someone might decide that a Wikipedia article entitled > "Leukemia" was really about "Myeloid Leukemia" and rename it. From the > standpoint of concepts, renaming is entirely correct, from the standpoint of > words, renaming is disastrous. > > - Creation of Topics which are just English words is going to lead to > multiple matches in autocomplete and may cause confusion. Usually, a word is > not what you want to see, you want the concept that the word represents. > > > - Translations for Concepts and Words are different. Typically, a > concept will have one "best" representation in every language. For words, > this is not true at all. Most words have many possible translations > depending on the concept they are intended to express. > > I think this one is pretty much a dealbreaker. Nix01:35, 1 August 2009 (UTC) > > - Information about a word, for example philology, may not apply to all > translations. For example the English word "gift" is Germanic, from > "mitgift", the French equivalent, "cadeau", is romantic. > > > - Because words have many senses, it is difficult to reconcile topics > to words. > > Usage > > Looking for topics which are special cases of a topic looks something like > this: > > { > ... > "type" : "/common/wordnet_word", > "hyponym" : [{ > "type" : "/common/topic", > "name" : "null" > "id" : null, > }], > ... > } > > Separate Identities for Wordnet Words > > An alternative to merging Wordnet Words and Freebase topics is to create a > new "top-level" type analogous to /common/topic to represent words. Where we > can reconcile them, instances of this new /common/wordnet_word type are > related to topics, typically those whose name has the same spelling. The > reconciliation problem is the same as in the co-typing case (hard) but since > reconciliation "failure" doesn't result in confusing pairs of words and > concepts with similar spellings solving it is much less urgent. The Wordnet > data is usable as is, and any connections between it and the existing world > of Freebase topics are added bonuses. > > At a very basic level, this representation makes a clear distinction > between a concept and the symbols used to represent that concept in various > human languages. > > Problems with separate Wordnet identities: > > - A type with implicit language binding such as /common/wordnet_word > fights with our existing /lang/* localization mechanism. Probably, we don't > want to allow a German name for an English word as this simple pairing is > woefully inadequate for the purposes of establishing a translation. > > i think this is ok. but see comment below on joining through a /type/text. > Nix 01:35, 1 August 2009 > (UTC) Usage > > Looking for topics which are special cases of a topic looks something like > this: > > { > ... > "symbol" : { > "lang" : "/lang/en", > "hyponym" : [{ > "topic" : [{ > "name" : null, > "id" : null, > }] > }], > } > ... > } > > where "symbol" is a new property of /common/topic which associates a topic > with a language symbol. associates a topic with with a language symbol > (wordnet word being the first example of such). > String Values Revisited > > All of this (and some past discussions with Warren) prompts me to revisit > the MQL value type /type/text. A MQL string value (instance of /type/text) > is represented as a single link whose left is the subject, right is the > language identity (for example /lang/en), and value is the string itself. > This representation was chosen to save primitives. MQL, at some > transformational expense, makes this single primitive look like an object > with properties like "value" and "lang". > > However, given the presence of an English Wordnet and the possibility of > other Wordnets, it would be quite natural to create actual identities to > represent words and phrases. As a the sole representation for text, this is > extremely expensive. For example, naming something "leukemia" would cost 5 > primitives: > > - the link from subject to "leukemia" > - the identity for "leukemia" > - the permission for the identity "leukemia" > - the value link carrying the rawstring "leukemia" > - a link from the identity "leukemia" to the language, /lang/en > > To a very slight degree, the primitive burn is offset by the fact that > naming any subsequent thing "leukemia" will cost one primitive, a link to > the identity. > > As our sole representation for strings, this excessively expensive. > However, if we're going to be loading Wordnet anyway, it becomes tempting to > allow this "expanded" form of /type/text in addition to the current compact > form because it resolves the tension between localized strings and language > symbols (words). > > Unfortunately, implementation of a hybrid scheme is going to be relatively > expensive: When asking for an object's name, we need to use a graph or to > ask for both types of name. Moreover, when writing English strings, we would > need to check for existing words with the same spelling and refer to those > instead of creating a literal string value. Lastly, when creating a new > Wordnet word, we would need to check for existing English strings with the > same spelling and replace them with links to the new word. Pretty daunting. > > It would be interesting to know how many English strings in the current OTG > would be candidates for replacement wit h a link to a Wordnet word. > > I don't propose that we actually do this, but it might be worth thinking > about. Certainly, if you're a fan of "reified strings" this is the sort of > thing that they're supposed to be good for. > Much of the need for this would go away if you could search "through" a > /type/text in mql using a reverse property. then the /type/text would lead > you straight to the "word" object. Nix01:35, 1 August 2009 (UTC) > On Aug 12, 2009, at 3:20 PM, Jeff Prucher wrote: > > > > -----Original Message----- > > From: data-modeling-bounces at freebase.com > > [mailto:data-modeling-bounces at freebase.com] > On Behalf Of Iain Sproat > > Sent: Wednesday, August 12, 2009 5:34 AM > > To: Freebase data modeling mailing list > > Subject: Re: [Data-modeling] English Words > > > Arthur, > > > Thanks for taking a look at it - I've since tweaked the > > schema (once again!). Your work made me realise that I was > > trying to be a bit too clever with a separate synonym > > property, and that synonyms are already taken care of by the > > omnipresent "also known as" /common/topic/alias > > property. I've changed the symset properties so that word morphology > > to have its own property/CVT, and am using the > > /common/topic/alias for all synonyms. The data you've added > > should now display correctly in http://dictionary.freebaseapps.com > > > I agree that each semantic meaning should be a completely > > separate topic from any other semantic meaning. e.g. a rat > > (animal) should be a separate topic from rat (informer). If > > different meanings have been merged together in the same > > topic, then please flag the topic for split. > > > I disagree that word instances should be merged with the topic for the > thing > they represent. They are really not the same thing at all. The abstract > notion of the genus Rattus (which is what /en/rat represents) is not the > same thing as the English noun "rat", which is also definitely not the same > thing as the Spanish word "rata" or the German word "Ratten", which is what > this approach seems to imply. > > Also, by relying on aliases for synonymy, we lose the ability to do > WordNet-y things like tell which sense of the word "rat" the topic for > "rattus" (or "informer") is synonymous with. > > We're lacking Dictionary data at the moment, so the most > > useful way to contribute would be to import dictionary > > definitions to Freebase (Wiktionary and WordNet would be good > > starting points). Also, working with Shawn's Alias app to > > improve topic aliases would definitely help. > > > One idea about WordNet that's been suggested is that a "word" type in > Freebase could be created that didn't include /common/topic. This would > prevent the client from becoming cluttered up with topics for words (which > users would obviously try to use in place of the topics for the thing the > words represent), but would be no less easy to use through the API. > > Jeff > > Iain > > > On Wed, Aug 12, 2009 at 1:29 PM, Arthur van > > Hoff wrote: > > Hi Ian, > > > Thanks for doing this. It looks very promising. I tried manually > > adding two synonyms for "rat" (verbs) from wordnet, I'm not > > sure I did > > it right. Can you check? > > > Have you considered how other languages feature in this schema? It > > would be great if it were possible to find synonyms for > > words in other > > human languages. We could scrape a lot of translations from > > Wikipedia > > if that is useful. > > > I noticed that for the noun "Rat" you have merged the > > concept of the > > Animal with the Noun. I'm not sure that this is the right > > approach. In > > my view the noun "Rat" is not the same as the animal "Rat". This > > approach might get confusing once there are nouns in other > > languages for the word "Rat". > > > Alternatively, you could model the noun Rat as a seperate > > topic with a > > property which refers to the defining topic (the animal). > > That way the > > animal topic would have reverse properties for all nouns in all > > languages (eventually). Perhaps that will work? > > > I'd like to contribute some, let me know if there is > > anything I can do. > > > Thanks. > > > > On Tue, Aug 11, 2009 at 9:30 PM, Iain Sproat > > wrote: > > > I made a few tweaks to the schema at http://writing.freebase.com, > > which meant that the WordNet stuff didn't work so well on > > freebase.com (you can't easily see all the symsets of a word). To > > compensate, I've created a freebase dictionary app at > > http://dictionary.freebaseapps.com which emulates the > > WordNet web interface. > > There's only a couple of dictionary examples in freebase (waiting > > until the schema is stable before importing WordNet) - and > > these can > > be seen at http://dictionary.freebaseapps.com/?word=rat & > > http://dictionary.freebaseapps.com/?word=red > > There's also a bleeding edge view (showing hypernyms and > > hyponyms) at > > > http://2.dictionary.sprocketonline.user.dev.freebaseapps.com/?word=ra > > t Finally, I've added a pronounciation type to the base > > but haven't > > filled in any data for that yet. > > > http://www.freebase.com/type/schema/base/writing/pronounciation?domai > > n=/base/writing > > Iain > > On Tue, Aug 11, 2009 at 1:39 AM, Iain Sproat > > wrote: > > > I've had a go at modelling this. My effort is primarily > > a synonym > > set type and a word CVT (linked to the synonym property > > of symset). > > see also > > http://www.freebase.com/view/guid/9202a8c04000641f8000000007cf5081 > > I went a bit overboard and also modelled glyphs, graphemes, > > diacritic, lexical categories, morphemes, Phonemes etc. - > > all in the > > (poorly > > named) writing base. There's a few things missing, particularly > > lemmas and word roots which would be useful if anyone is planning > > using freebase data with NLP. > > One of the things I noticed was that freebase only really > > has nouns. > > I assume that verbs, adjectives etc. are also suitable > > for freebase, > > but nobody's yet loaded them? > > Iain > > > On Fri, May 8, 2009 at 1:10 AM, spencer kelly > > > > wrote: > > > agree, i think the value of linguistic data is >= the > > value of any > > other data we have in freebase -- only more awkward to enter. > > with faith in the modelling power of the graph, i assume someone > > will figure out a good way to do it eventually. > > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > > > > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > > > > > -- > > Arthur van Hoff > > arthur.van.hoff at gmail.com > > 650-283-0842 > > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > > -- Arthur van Hoff arthur.van.hoff at gmail.com 650-283-0842 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090813/a0ddb66f/attachment-0001.htm From tfmorris at gmail.com Thu Aug 13 20:22:35 2009 From: tfmorris at gmail.com (Tom Morris) Date: Thu, 13 Aug 2009 16:22:35 -0400 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: On Thu, Aug 13, 2009 at 3:44 AM, Arthur van Hoff wrote: > Other useful types of words to include are titles (Mr. Mrs. Dr., etc.), Postnominals or name suffixes (e.g. Jr, Sr., MD, PhD, Esq., etc) as well as prefixes/titles. > abbreviations (OMG), people names (Smith, Johnson, ...), punctuation, and There's a Names base at http://givennames.freebase.com/ but I haven't been able to get an answer as to its provenance, so I have no idea how reliable it is. > slang. This will make it a very useful data set for many applications. A significant weakness of the current name schema is that there's no idea of telling what a name/alias represents, so you don't know if it's the official name, an abbreviation, a nickname, a maiden (birth) name, or what. Also there are no dates associated with the names for those instances where a name was changed at a particular point in time. For topics derived from Wikipedia, the main name is usually the common name as opposed to the official name (if they're different), while for most other entries the main name is usually the official name, so you can't even make any assumptions based on whether it's an alias or not. Tom From spatial.db at gmail.com Thu Aug 13 20:29:02 2009 From: spatial.db at gmail.com (Ed Laurent) Date: Thu, 13 Aug 2009 16:29:02 -0400 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: I've also done some name modeling in the LitCentral base: http://www.freebase.com/type/schema/base/litcentral/person_full_name http://www.freebase.com/type/schema/base/litcentral/person_nickname http://www.freebase.com/type/schema/base/litcentral/person_name_suffix http://www.freebase.com/type/schema/base/litcentral/person_name_honorary_title http://www.freebase.com/type/schema/base/litcentral/name_suffix_description I toyed around with making these date mediated but couldn't find a solution that I liked. -Ed On Thu, Aug 13, 2009 at 4:22 PM, Tom Morris wrote: > On Thu, Aug 13, 2009 at 3:44 AM, Arthur van > Hoff wrote: > > > Other useful types of words to include are titles (Mr. Mrs. Dr., etc.), > > Postnominals or name suffixes (e.g. Jr, Sr., MD, PhD, Esq., etc) as > well as prefixes/titles. > > > abbreviations (OMG), people names (Smith, Johnson, ...), punctuation, and > > There's a Names base at http://givennames.freebase.com/ but I haven't > been able to get an answer as to its provenance, so I have no idea how > reliable it is. > > > slang. This will make it a very useful data set for many applications. > > A significant weakness of the current name schema is that there's no > idea of telling what a name/alias represents, so you don't know if > it's the official name, an abbreviation, a nickname, a maiden (birth) > name, or what. Also there are no dates associated with the names for > those instances where a name was changed at a particular point in > time. For topics derived from Wikipedia, the main name is usually the > common name as opposed to the official name (if they're different), > while for most other entries the main name is usually the official > name, so you can't even make any assumptions based on whether it's an > alias or not. > > Tom > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090813/3e13ddb0/attachment.htm From tfmorris at gmail.com Thu Aug 13 21:15:47 2009 From: tfmorris at gmail.com (Tom Morris) Date: Thu, 13 Aug 2009 17:15:47 -0400 Subject: [Data-modeling] English Words In-Reply-To: References: Message-ID: Oops! Sorry Ed, I should have remember that since I looked at it not long ago. In revisiting the schema, I notice that you've used Sean's /user/narphorium/natural_language/abbreviated_topic so that's another thread to be pulled in here. Tom On Thu, Aug 13, 2009 at 4:29 PM, Ed Laurent wrote: > I've also done some name modeling in the LitCentral base: > > http://www.freebase.com/type/schema/base/litcentral/person_full_name > http://www.freebase.com/type/schema/base/litcentral/person_nickname > http://www.freebase.com/type/schema/base/litcentral/person_name_suffix > http://www.freebase.com/type/schema/base/litcentral/person_name_honorary_title > http://www.freebase.com/type/schema/base/litcentral/name_suffix_description > > I toyed around with making these date mediated but couldn't find a solution > that I liked. > > -Ed > > > On Thu, Aug 13, 2009 at 4:22 PM, Tom Morris wrote: >> >> On Thu, Aug 13, 2009 at 3:44 AM, Arthur van >> Hoff wrote: >> >> > Other useful types of words to include are titles (Mr. Mrs. Dr., etc.), >> >> Postnominals or name suffixes (e.g. Jr, Sr., MD, PhD, Esq., etc) as >> well as prefixes/titles. >> >> > abbreviations (OMG), people names (Smith, Johnson, ...), punctuation, >> > and >> >> There's a Names base at http://givennames.freebase.com/ but I haven't >> been able to get an answer as to its provenance, so I have no idea how >> reliable it is. >> >> > slang. This will make it a very useful data set for many applications. >> >> A significant weakness of the current name schema is that there's no >> idea of telling what a name/alias represents, so you don't know if >> it's the official name, an abbreviation, a nickname, a maiden (birth) >> name, or what. ?Also there are no dates associated with the names for >> those instances where a name was changed at a particular point in >> time. ?For topics derived from Wikipedia, the main name is usually the >> common name as opposed to the official name (if they're different), >> while for most other entries the main name is usually the official >> name, so you can't even make any assumptions based on whether it's an >> alias or not. >> >> Tom >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > > From sm at metaweb.com Thu Aug 13 23:28:10 2009 From: sm at metaweb.com (Scott Meyer) Date: Thu, 13 Aug 2009 16:28:10 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> Message-ID: <4A84A18A.7090109@metaweb.com> Ian, Do you have a source for Word Morphology/Morpheme data? I don't see this in Wordnet. For example Wordnet notes that "break" is a derivationally related form of "breakable" but doesn't seem to know that "breakable" is a composition of "break" and "-able". -Scott From stefano at metaweb.com Fri Aug 14 00:48:02 2009 From: stefano at metaweb.com (Stefano Mazzocchi) Date: Thu, 13 Aug 2009 17:48:02 -0700 Subject: [Data-modeling] What to do with 'cancelled' movies? Message-ID: <4A84B442.5050607@metaweb.com> I've been trying to add 'released dates' to movies but I keep running into movies that have not been released yet but have been canceled and their article deleted from wikipedia: http://www.freebase.com/view/en/ajnabee_shehr_mein http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 Should we follow wikipedia and delete, or should we be more lax and mark accordingly? if so, what is the proper way to mark a movie as "cancelled"? -- Stefano Mazzocchi Application Catalyst Metaweb Technologies, Inc. stefano at metaweb.com ------------------------------------------------------------------- From gordon at metaweb.com Fri Aug 14 01:10:19 2009 From: gordon at metaweb.com (Gordon Mackenzie) Date: Thu, 13 Aug 2009 18:10:19 -0700 Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: <4A84B442.5050607@metaweb.com> References: <4A84B442.5050607@metaweb.com> Message-ID: <7A7D6B22-12A5-4C57-BE66-89AA9FB79651@metaweb.com> I delete em. Unless we want to start a base of cancelled film projects and add properties/types for capturing who was involved when with a some sort of scale as to how solid the rumored/actual involvement of the principals with the doomed project. We could do the same with people who were going to work on a released film. As for sketchy film projects that haven't definitely been cancelled, no release date is better than some of the ones that have release dates that have already passed. ~ Gordon <<< gordon at metaweb.com >>> On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: > I've been trying to add 'released dates' to movies but I keep running > into movies that have not been released yet but have been canceled and > their article deleted from wikipedia: > > http://www.freebase.com/view/en/ajnabee_shehr_mein > > http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 > > Should we follow wikipedia and delete, or should we be more lax and > mark > accordingly? if so, what is the proper way to mark a movie as > "cancelled"? > > -- > Stefano Mazzocchi Application Catalyst > Metaweb Technologies, Inc. stefano at metaweb.com > ------------------------------------------------------------------- > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling From sm at metaweb.com Fri Aug 14 01:56:11 2009 From: sm at metaweb.com (Scott Meyer) Date: Thu, 13 Aug 2009 18:56:11 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> Message-ID: <4A84C43B.3040303@metaweb.com> Hi, I'm the "I" in the wiki page that Reilly posted... The schema suggested was quite vague, mostly because I just wanted to make the point that language symbols deserved their own identities. Sadly, I was quite vague about the schema, specifically, the distinction between a synset and a word. I'll try and correct this now. Iain Sproat wrote: >> Because words have many senses, it is difficult to reconcile topics to words. > > Agreed, very difficult without enough information. But further > information might be derived from the word's use - such as emotional > content, formality and subjectiveness; and these sorts of properties > could also be modelled. e.g. the word "rat" is far more > opinionated/negative than the word "informer" when used to describe > someone who informs.* > > To allow for this sort of modelling, I've shown a "definition" CVT > between /common/symbol and /common/topic in the diagram to help solve > this sort of problem. This could have additional properties which > assist in deciding what word is appropriate for a particular context. > It might also allow a statistical/fuzzy relationship between symbols > and their meanings. Ah, so you're proposing that we resolve the quandary between associating topics with words/symbols (corresponding to "name") or synsets (corresponding to "definition") by associating both. The other choices would be: 1. Reconcile topic to Synset, use value of name to locate symbol/word. This allows direct navigation to the Synset which was implied by this sort of usage: > { > ... > "symbol" : { > "lang" : "/lang/en", > "hyponym" : [{ > "topic" : [{ > "name" : null, > "id" : null, > }] > }], > } > ... > } The possibility of this sort of usage seems to get lots of people very excited; traversing through specific words has a much smaller fan club, and, as the last section of the document points out, is actually something that we explicitly decided not to do in the past. Moving from a topic to the symbol/word which represents it requires two queries as we don't do joins on value. We have to ask for a topic's name (say, "Rat") and then get its synset and ask for the word attached to that synset whose name is "Rat". Since we're associating a topic with a particular definition, the process of reconciliation is going to be hard: for any synset, find all topics whose name is is spelled the same as some word in the synset and show the whole mess to a human being. 2. Reconcile topic to word only. This makes it easy to locate the symbol, but, as each symbol is a member of multiple synsets, it makes the exciting query above much harder to write and to execute. The actual work of reconciliation could be as simple as "finding topics whose name or alias matches the spelling of a given word" But this doesn't seem to add much value. 3. Reconcile to both synset and word directly with two separate properties, say "stands for" and "means" With just one wordnet, this works fine, but as soon as we have multiple wordnets, figuring out which "stands for" goes with which "means" becomes cumbersome. 4. What you propose above, use a CVT to bundle symbol and meaning together. BTW, when I say "reconcile" (in any of its various morphologies) I mean "write the property relating a topic to a set of symbols which might represent it." Usually we use "reconcile" in the sense of "decide that two things are actually identical." OK, so I guess I favor #1. Given the existing name property, an explicit word-topic association seems largely redundant. > I think the CVT would also allow the language property to be removed > from /common/symbol, making symbols language/culture independent and > reusable. I'm a bit skeptical here. Language independence would seem to degenerate into one of two cases: 1. Translation via synsets This is essentially what a translating dictionary offers. The results are notoriously hit-or-miss. Also, it isn't clear that synsets are inherently multi-lingual. 2. Enumeration of individual "good" translations For example "the English word gift as translated by the French word cadeau and the German word Geschenk and..." I suspect that this is going to generate many, many symbols whose English spelling is identical and whose meaning is only penetrable to the patient polyglot. Conversely, I think that our existing topics form an excellent, maximally specific locus for translations. If I name a topic about an informer "Rat" I could reasonably expect that translations would be as accurate as possible ("mouchard") given the contents of the topic. -Scott From alecf at metaweb.com Fri Aug 14 17:25:00 2009 From: alecf at metaweb.com (Alec Flett) Date: Fri, 14 Aug 2009 10:25:00 -0700 Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: <7A7D6B22-12A5-4C57-BE66-89AA9FB79651@metaweb.com> References: <4A84B442.5050607@metaweb.com> <7A7D6B22-12A5-4C57-BE66-89AA9FB79651@metaweb.com> Message-ID: On Aug 13, 2009, at 6:10 PM, Gordon Mackenzie wrote: > I delete em. > > Unless we want to start a base of cancelled film projects and add > properties/types for capturing who was involved when with a some sort > of scale as to how solid the rumored/actual involvement of the > principals with the doomed project. I actually think that's pretty interesting information, maybe we could do something like Deceased Person - cotype it as a "Canceled Film" with a cancellation date? Alec > > We could do the same with people who were going to work on a released > film. > > As for sketchy film projects that haven't definitely been cancelled, > no release date is better than some of the ones that have release > dates that have already passed. > > ~ Gordon > > <<< gordon at metaweb.com >>> > > > > On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: > >> I've been trying to add 'released dates' to movies but I keep running >> into movies that have not been released yet but have been canceled >> and >> their article deleted from wikipedia: >> >> http://www.freebase.com/view/en/ajnabee_shehr_mein >> >> http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 >> >> Should we follow wikipedia and delete, or should we be more lax and >> mark >> accordingly? if so, what is the proper way to mark a movie as >> "cancelled"? >> >> -- >> Stefano Mazzocchi Application Catalyst >> Metaweb Technologies, Inc. stefano at metaweb.com >> ------------------------------------------------------------------- >> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling From spatial.db at gmail.com Fri Aug 14 17:33:22 2009 From: spatial.db at gmail.com (Ed Laurent) Date: Fri, 14 Aug 2009 13:33:22 -0400 Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: References: <4A84B442.5050607@metaweb.com> <7A7D6B22-12A5-4C57-BE66-89AA9FB79651@metaweb.com> Message-ID: +1. Don't know if there are typical cancellation reasons that could be used to populate an enumerated type, but a short text input property might be useful to collect that information for now. -Ed On Fri, Aug 14, 2009 at 1:25 PM, Alec Flett wrote: > > On Aug 13, 2009, at 6:10 PM, Gordon Mackenzie wrote: > > > I delete em. > > > > Unless we want to start a base of cancelled film projects and add > > properties/types for capturing who was involved when with a some sort > > of scale as to how solid the rumored/actual involvement of the > > principals with the doomed project. > > I actually think that's pretty interesting information, maybe we could > do something like Deceased Person - cotype it as a "Canceled Film" > with a cancellation date? > > Alec > > > > > We could do the same with people who were going to work on a released > > film. > > > > As for sketchy film projects that haven't definitely been cancelled, > > no release date is better than some of the ones that have release > > dates that have already passed. > > > > ~ Gordon > > > > <<< gordon at metaweb.com >>> > > > > > > > > On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: > > > >> I've been trying to add 'released dates' to movies but I keep running > >> into movies that have not been released yet but have been canceled > >> and > >> their article deleted from wikipedia: > >> > >> http://www.freebase.com/view/en/ajnabee_shehr_mein > >> > >> > http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 > >> > >> Should we follow wikipedia and delete, or should we be more lax and > >> mark > >> accordingly? if so, what is the proper way to mark a movie as > >> "cancelled"? > >> > >> -- > >> Stefano Mazzocchi Application Catalyst > >> Metaweb Technologies, Inc. stefano at metaweb.com > >> ------------------------------------------------------------------- > >> > >> _______________________________________________ > >> Data-modeling mailing list > >> Data-modeling at freebase.com > >> http://lists.freebase.com/mailman/listinfo/data-modeling > > > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090814/7a3c35a5/attachment.htm From jon at metaweb.com Fri Aug 14 18:57:30 2009 From: jon at metaweb.com (Jon Reitsma) Date: Fri, 14 Aug 2009 11:57:30 -0700 (PDT) Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: <2055763793.140481250276175773.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> Message-ID: <1264917225.140501250276250011.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> +1 on keeping the data too. I actually have been voting no on Gordon's flags (but losing I think). I think it might be useful in the popstra world too - projects on hold, dead, etc. j ----- Original Message ----- From: "Alec Flett" To: "Freebase data modeling mailing list" Sent: Friday, August 14, 2009 10:25:00 AM GMT -08:00 US/Canada Pacific Subject: Re: [Data-modeling] What to do with 'cancelled' movies? On Aug 13, 2009, at 6:10 PM, Gordon Mackenzie wrote: > I delete em. > > Unless we want to start a base of cancelled film projects and add > properties/types for capturing who was involved when with a some sort > of scale as to how solid the rumored/actual involvement of the > principals with the doomed project. I actually think that's pretty interesting information, maybe we could do something like Deceased Person - cotype it as a "Canceled Film" with a cancellation date? Alec > > We could do the same with people who were going to work on a released > film. > > As for sketchy film projects that haven't definitely been cancelled, > no release date is better than some of the ones that have release > dates that have already passed. > > ~ Gordon > > <<< gordon at metaweb.com >>> > > > > On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: > >> I've been trying to add 'released dates' to movies but I keep running >> into movies that have not been released yet but have been canceled >> and >> their article deleted from wikipedia: >> >> http://www.freebase.com/view/en/ajnabee_shehr_mein >> >> http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 >> >> Should we follow wikipedia and delete, or should we be more lax and >> mark >> accordingly? if so, what is the proper way to mark a movie as >> "cancelled"? >> >> -- >> Stefano Mazzocchi Application Catalyst >> Metaweb Technologies, Inc. stefano at metaweb.com >> ------------------------------------------------------------------- >> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling _______________________________________________ Data-modeling mailing list Data-modeling at freebase.com http://lists.freebase.com/mailman/listinfo/data-modeling From stefano at metaweb.com Fri Aug 14 19:12:24 2009 From: stefano at metaweb.com (Stefano Mazzocchi) Date: Fri, 14 Aug 2009 12:12:24 -0700 Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: <1264917225.140501250276250011.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> References: <1264917225.140501250276250011.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> Message-ID: <4A85B718.5050903@metaweb.com> Jon Reitsma wrote: > +1 on keeping the data too. I actually have been voting no on Gordon's flags (but losing I think). I think it might be useful in the popstra world too - projects on hold, dead, etc. Ok, so keeping it is. Now, how do we model this? > > j > ----- Original Message ----- > From: "Alec Flett" > To: "Freebase data modeling mailing list" > Sent: Friday, August 14, 2009 10:25:00 AM GMT -08:00 US/Canada Pacific > Subject: Re: [Data-modeling] What to do with 'cancelled' movies? > > > On Aug 13, 2009, at 6:10 PM, Gordon Mackenzie wrote: > >> I delete em. >> >> Unless we want to start a base of cancelled film projects and add >> properties/types for capturing who was involved when with a some sort >> of scale as to how solid the rumored/actual involvement of the >> principals with the doomed project. > > I actually think that's pretty interesting information, maybe we could > do something like Deceased Person - cotype it as a "Canceled Film" > with a cancellation date? > > Alec > >> We could do the same with people who were going to work on a released >> film. >> >> As for sketchy film projects that haven't definitely been cancelled, >> no release date is better than some of the ones that have release >> dates that have already passed. >> >> ~ Gordon >> >> <<< gordon at metaweb.com >>> >> >> >> >> On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: >> >>> I've been trying to add 'released dates' to movies but I keep running >>> into movies that have not been released yet but have been canceled >>> and >>> their article deleted from wikipedia: >>> >>> http://www.freebase.com/view/en/ajnabee_shehr_mein >>> >>> http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 >>> >>> Should we follow wikipedia and delete, or should we be more lax and >>> mark >>> accordingly? if so, what is the proper way to mark a movie as >>> "cancelled"? >>> >>> -- >>> Stefano Mazzocchi Application Catalyst >>> Metaweb Technologies, Inc. stefano at metaweb.com >>> ------------------------------------------------------------------- >>> >>> _______________________________________________ >>> Data-modeling mailing list >>> Data-modeling at freebase.com >>> http://lists.freebase.com/mailman/listinfo/data-modeling >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling -- Stefano Mazzocchi Application Catalyst Metaweb Technologies, Inc. stefano at metaweb.com ------------------------------------------------------------------- From rfh at metaweb.com Fri Aug 14 19:19:01 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Fri, 14 Aug 2009 12:19:01 -0700 Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: <4A85B718.5050903@metaweb.com> References: <1264917225.140501250276250011.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> <4A85B718.5050903@metaweb.com> Message-ID: <08257311-A2B3-4AEA-9EC5-C6AFE29870A9@metaweb.com> I suggest a blanket type for unreleased and incomplete films. On Aug 14, 2009, at 12:12 PM, Stefano Mazzocchi wrote: > Jon Reitsma wrote: >> +1 on keeping the data too. I actually have been voting no on >> Gordon's flags (but losing I think). I think it might be useful in >> the popstra world too - projects on hold, dead, etc. > > Ok, so keeping it is. > > Now, how do we model this? > >> >> j >> ----- Original Message ----- >> From: "Alec Flett" >> To: "Freebase data modeling mailing list" > modeling at freebase.com> >> Sent: Friday, August 14, 2009 10:25:00 AM GMT -08:00 US/Canada >> Pacific >> Subject: Re: [Data-modeling] What to do with 'cancelled' movies? >> >> >> On Aug 13, 2009, at 6:10 PM, Gordon Mackenzie wrote: >> >>> I delete em. >>> >>> Unless we want to start a base of cancelled film projects and add >>> properties/types for capturing who was involved when with a some >>> sort >>> of scale as to how solid the rumored/actual involvement of the >>> principals with the doomed project. >> >> I actually think that's pretty interesting information, maybe we >> could >> do something like Deceased Person - cotype it as a "Canceled Film" >> with a cancellation date? >> >> Alec >> >>> We could do the same with people who were going to work on a >>> released >>> film. >>> >>> As for sketchy film projects that haven't definitely been cancelled, >>> no release date is better than some of the ones that have release >>> dates that have already passed. >>> >>> ~ Gordon >>> >>> <<< gordon at metaweb.com >>> >>> >>> >>> >>> On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: >>> >>>> I've been trying to add 'released dates' to movies but I keep >>>> running >>>> into movies that have not been released yet but have been canceled >>>> and >>>> their article deleted from wikipedia: >>>> >>>> http://www.freebase.com/view/en/ajnabee_shehr_mein >>>> >>>> http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 >>>> >>>> Should we follow wikipedia and delete, or should we be more lax and >>>> mark >>>> accordingly? if so, what is the proper way to mark a movie as >>>> "cancelled"? >>>> >>>> -- >>>> Stefano Mazzocchi Application Catalyst >>>> Metaweb Technologies, Inc. stefano at metaweb.com >>>> ------------------------------------------------------------------- >>>> >>>> _______________________________________________ >>>> Data-modeling mailing list >>>> Data-modeling at freebase.com >>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>> _______________________________________________ >>> Data-modeling mailing list >>> Data-modeling at freebase.com >>> http://lists.freebase.com/mailman/listinfo/data-modeling >> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > > > -- > Stefano Mazzocchi Application Catalyst > Metaweb Technologies, Inc. stefano at metaweb.com > ------------------------------------------------------------------- > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090814/566e6f25/attachment-0001.bin From gordon at metaweb.com Fri Aug 14 21:23:38 2009 From: gordon at metaweb.com (Gordon Mackenzie) Date: Fri, 14 Aug 2009 14:23:38 -0700 Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: <08257311-A2B3-4AEA-9EC5-C6AFE29870A9@metaweb.com> References: <1264917225.140501250276250011.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> <4A85B718.5050903@metaweb.com> <08257311-A2B3-4AEA-9EC5-C6AFE29870A9@metaweb.com> Message-ID: Dead Media commons? Dead Movies, Dead Music Releases, Dead Written Works ~ Gordon <<< gordon at metaweb.com >>> On Aug 14, 2009, at 12:19 PM, Reilly Hayes wrote: > > I suggest a blanket type for unreleased and incomplete films. > > > > On Aug 14, 2009, at 12:12 PM, Stefano Mazzocchi wrote: > >> Jon Reitsma wrote: >>> +1 on keeping the data too. I actually have been voting no on >>> Gordon's flags (but losing I think). I think it might be useful >>> in the popstra world too - projects on hold, dead, etc. >> >> Ok, so keeping it is. >> >> Now, how do we model this? >> >>> >>> j >>> ----- Original Message ----- >>> From: "Alec Flett" >>> To: "Freebase data modeling mailing list" >> > >>> Sent: Friday, August 14, 2009 10:25:00 AM GMT -08:00 US/Canada >>> Pacific >>> Subject: Re: [Data-modeling] What to do with 'cancelled' movies? >>> >>> >>> On Aug 13, 2009, at 6:10 PM, Gordon Mackenzie wrote: >>> >>>> I delete em. >>>> >>>> Unless we want to start a base of cancelled film projects and add >>>> properties/types for capturing who was involved when with a some >>>> sort >>>> of scale as to how solid the rumored/actual involvement of the >>>> principals with the doomed project. >>> >>> I actually think that's pretty interesting information, maybe we >>> could >>> do something like Deceased Person - cotype it as a "Canceled Film" >>> with a cancellation date? >>> >>> Alec >>> >>>> We could do the same with people who were going to work on a >>>> released >>>> film. >>>> >>>> As for sketchy film projects that haven't definitely been >>>> cancelled, >>>> no release date is better than some of the ones that have release >>>> dates that have already passed. >>>> >>>> ~ Gordon >>>> >>>> <<< gordon at metaweb.com >>> >>>> >>>> >>>> >>>> On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: >>>> >>>>> I've been trying to add 'released dates' to movies but I keep >>>>> running >>>>> into movies that have not been released yet but have been canceled >>>>> and >>>>> their article deleted from wikipedia: >>>>> >>>>> http://www.freebase.com/view/en/ajnabee_shehr_mein >>>>> >>>>> http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 >>>>> >>>>> Should we follow wikipedia and delete, or should we be more lax >>>>> and >>>>> mark >>>>> accordingly? if so, what is the proper way to mark a movie as >>>>> "cancelled"? >>>>> >>>>> -- >>>>> Stefano Mazzocchi Application >>>>> Catalyst >>>>> Metaweb Technologies, Inc. >>>>> stefano at metaweb.com >>>>> ------------------------------------------------------------------- >>>>> >>>>> _______________________________________________ >>>>> Data-modeling mailing list >>>>> Data-modeling at freebase.com >>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>> _______________________________________________ >>>> Data-modeling mailing list >>>> Data-modeling at freebase.com >>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>> >>> _______________________________________________ >>> Data-modeling mailing list >>> Data-modeling at freebase.com >>> http://lists.freebase.com/mailman/listinfo/data-modeling >>> _______________________________________________ >>> Data-modeling mailing list >>> Data-modeling at freebase.com >>> http://lists.freebase.com/mailman/listinfo/data-modeling >> >> >> -- >> Stefano Mazzocchi Application Catalyst >> Metaweb Technologies, Inc. stefano at metaweb.com >> ------------------------------------------------------------------- >> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling From stefano at metaweb.com Fri Aug 14 22:00:40 2009 From: stefano at metaweb.com (Stefano Mazzocchi) Date: Fri, 14 Aug 2009 15:00:40 -0700 Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: References: <1264917225.140501250276250011.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> <4A85B718.5050903@metaweb.com> <08257311-A2B3-4AEA-9EC5-C6AFE29870A9@metaweb.com> Message-ID: <4A85DE88.9040902@metaweb.com> Gordon Mackenzie wrote: > Dead Media commons? > > Dead Movies, Dead Music Releases, Dead Written Works Hmmm, not sure about how general we should make this. I mean, a 'dead movie' is a movie that started production and then stoppped? or is a movie that finished production and never got released? or it's a movie that never went into production at all? music and written works (or buildings, or software) all seem to have different lifecycles, so I lean toward having a specialized type for each. Ok, so something like "Cancelled Film" extends "Film" and adds properties such as: reason_for_cancellation : text cancellation_date: date Thoughts? > > ~ Gordon > > <<< gordon at metaweb.com >>> > > > > On Aug 14, 2009, at 12:19 PM, Reilly Hayes wrote: > >> I suggest a blanket type for unreleased and incomplete films. >> >> >> >> On Aug 14, 2009, at 12:12 PM, Stefano Mazzocchi wrote: >> >>> Jon Reitsma wrote: >>>> +1 on keeping the data too. I actually have been voting no on >>>> Gordon's flags (but losing I think). I think it might be useful >>>> in the popstra world too - projects on hold, dead, etc. >>> Ok, so keeping it is. >>> >>> Now, how do we model this? >>> >>>> j >>>> ----- Original Message ----- >>>> From: "Alec Flett" >>>> To: "Freebase data modeling mailing list" >>> Sent: Friday, August 14, 2009 10:25:00 AM GMT -08:00 US/Canada >>>> Pacific >>>> Subject: Re: [Data-modeling] What to do with 'cancelled' movies? >>>> >>>> >>>> On Aug 13, 2009, at 6:10 PM, Gordon Mackenzie wrote: >>>> >>>>> I delete em. >>>>> >>>>> Unless we want to start a base of cancelled film projects and add >>>>> properties/types for capturing who was involved when with a some >>>>> sort >>>>> of scale as to how solid the rumored/actual involvement of the >>>>> principals with the doomed project. >>>> I actually think that's pretty interesting information, maybe we >>>> could >>>> do something like Deceased Person - cotype it as a "Canceled Film" >>>> with a cancellation date? >>>> >>>> Alec >>>> >>>>> We could do the same with people who were going to work on a >>>>> released >>>>> film. >>>>> >>>>> As for sketchy film projects that haven't definitely been >>>>> cancelled, >>>>> no release date is better than some of the ones that have release >>>>> dates that have already passed. >>>>> >>>>> ~ Gordon >>>>> >>>>> <<< gordon at metaweb.com >>> >>>>> >>>>> >>>>> >>>>> On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: >>>>> >>>>>> I've been trying to add 'released dates' to movies but I keep >>>>>> running >>>>>> into movies that have not been released yet but have been canceled >>>>>> and >>>>>> their article deleted from wikipedia: >>>>>> >>>>>> http://www.freebase.com/view/en/ajnabee_shehr_mein >>>>>> >>>>>> http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 >>>>>> >>>>>> Should we follow wikipedia and delete, or should we be more lax >>>>>> and >>>>>> mark >>>>>> accordingly? if so, what is the proper way to mark a movie as >>>>>> "cancelled"? >>>>>> >>>>>> -- >>>>>> Stefano Mazzocchi Application >>>>>> Catalyst >>>>>> Metaweb Technologies, Inc. >>>>>> stefano at metaweb.com >>>>>> ------------------------------------------------------------------- >>>>>> >>>>>> _______________________________________________ >>>>>> Data-modeling mailing list >>>>>> Data-modeling at freebase.com >>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>> _______________________________________________ >>>>> Data-modeling mailing list >>>>> Data-modeling at freebase.com >>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>> _______________________________________________ >>>> Data-modeling mailing list >>>> Data-modeling at freebase.com >>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>> _______________________________________________ >>>> Data-modeling mailing list >>>> Data-modeling at freebase.com >>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>> >>> -- >>> Stefano Mazzocchi Application Catalyst >>> Metaweb Technologies, Inc. stefano at metaweb.com >>> ------------------------------------------------------------------- >>> >>> _______________________________________________ >>> Data-modeling mailing list >>> Data-modeling at freebase.com >>> http://lists.freebase.com/mailman/listinfo/data-modeling >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling -- Stefano Mazzocchi Application Catalyst Metaweb Technologies, Inc. stefano at metaweb.com ------------------------------------------------------------------- From gordon at metaweb.com Fri Aug 14 22:12:35 2009 From: gordon at metaweb.com (Gordon Mackenzie) Date: Fri, 14 Aug 2009 15:12:35 -0700 Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: <4A85DE88.9040902@metaweb.com> References: <1264917225.140501250276250011.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> <4A85B718.5050903@metaweb.com> <08257311-A2B3-4AEA-9EC5-C6AFE29870A9@metaweb.com> <4A85DE88.9040902@metaweb.com> Message-ID: I really am reluctant to have populate /film/film with movies that got cancelled or just proposed films and want to keep it at least 1-ply distance so all the outside consumers of our film topics don't have to worry about topics based on sometimes trial balloons sent up by agents/ film makers trying to generate a deal. I'd want this to be under another domain, or maybe it could be a /film/ cancelled_film type with some properties to link in people and subject property (like video game, book, events that spawned this project). But not with an included type of /film/film. ~ Gordon <<< gordon at metaweb.com >>> On Aug 14, 2009, at 3:00 PM, Stefano Mazzocchi wrote: > Gordon Mackenzie wrote: >> Dead Media commons? >> >> Dead Movies, Dead Music Releases, Dead Written Works > > Hmmm, not sure about how general we should make this. I mean, a 'dead > movie' is a movie that started production and then stoppped? or is a > movie that finished production and never got released? or it's a movie > that never went into production at all? > > music and written works (or buildings, or software) all seem to have > different lifecycles, so I lean toward having a specialized type for > each. > > Ok, so something like "Cancelled Film" extends "Film" and adds > properties such as: > > reason_for_cancellation : text > cancellation_date: date > > Thoughts? > >> >> ~ Gordon >> >> <<< gordon at metaweb.com >>> >> >> >> >> On Aug 14, 2009, at 12:19 PM, Reilly Hayes wrote: >> >>> I suggest a blanket type for unreleased and incomplete films. >>> >>> >>> >>> On Aug 14, 2009, at 12:12 PM, Stefano Mazzocchi wrote: >>> >>>> Jon Reitsma wrote: >>>>> +1 on keeping the data too. I actually have been voting no on >>>>> Gordon's flags (but losing I think). I think it might be useful >>>>> in the popstra world too - projects on hold, dead, etc. >>>> Ok, so keeping it is. >>>> >>>> Now, how do we model this? >>>> >>>>> j >>>>> ----- Original Message ----- >>>>> From: "Alec Flett" >>>>> To: "Freebase data modeling mailing list" >>>> Sent: Friday, August 14, 2009 10:25:00 AM GMT -08:00 US/Canada >>>>> Pacific >>>>> Subject: Re: [Data-modeling] What to do with 'cancelled' movies? >>>>> >>>>> >>>>> On Aug 13, 2009, at 6:10 PM, Gordon Mackenzie wrote: >>>>> >>>>>> I delete em. >>>>>> >>>>>> Unless we want to start a base of cancelled film projects and add >>>>>> properties/types for capturing who was involved when with a some >>>>>> sort >>>>>> of scale as to how solid the rumored/actual involvement of the >>>>>> principals with the doomed project. >>>>> I actually think that's pretty interesting information, maybe we >>>>> could >>>>> do something like Deceased Person - cotype it as a "Canceled Film" >>>>> with a cancellation date? >>>>> >>>>> Alec >>>>> >>>>>> We could do the same with people who were going to work on a >>>>>> released >>>>>> film. >>>>>> >>>>>> As for sketchy film projects that haven't definitely been >>>>>> cancelled, >>>>>> no release date is better than some of the ones that have release >>>>>> dates that have already passed. >>>>>> >>>>>> ~ Gordon >>>>>> >>>>>> <<< gordon at metaweb.com >>> >>>>>> >>>>>> >>>>>> >>>>>> On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: >>>>>> >>>>>>> I've been trying to add 'released dates' to movies but I keep >>>>>>> running >>>>>>> into movies that have not been released yet but have been >>>>>>> canceled >>>>>>> and >>>>>>> their article deleted from wikipedia: >>>>>>> >>>>>>> http://www.freebase.com/view/en/ajnabee_shehr_mein >>>>>>> >>>>>>> http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 >>>>>>> >>>>>>> Should we follow wikipedia and delete, or should we be more lax >>>>>>> and >>>>>>> mark >>>>>>> accordingly? if so, what is the proper way to mark a movie as >>>>>>> "cancelled"? >>>>>>> >>>>>>> -- >>>>>>> Stefano Mazzocchi Application >>>>>>> Catalyst >>>>>>> Metaweb Technologies, Inc. >>>>>>> stefano at metaweb.com >>>>>>> ------------------------------------------------------------------- >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Data-modeling mailing list >>>>>>> Data-modeling at freebase.com >>>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>>> _______________________________________________ >>>>>> Data-modeling mailing list >>>>>> Data-modeling at freebase.com >>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>> _______________________________________________ >>>>> Data-modeling mailing list >>>>> Data-modeling at freebase.com >>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>> _______________________________________________ >>>>> Data-modeling mailing list >>>>> Data-modeling at freebase.com >>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>> >>>> -- >>>> Stefano Mazzocchi Application Catalyst >>>> Metaweb Technologies, Inc. stefano at metaweb.com >>>> ------------------------------------------------------------------- >>>> >>>> _______________________________________________ >>>> Data-modeling mailing list >>>> Data-modeling at freebase.com >>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>> _______________________________________________ >>> Data-modeling mailing list >>> Data-modeling at freebase.com >>> http://lists.freebase.com/mailman/listinfo/data-modeling >> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > > > -- > Stefano Mazzocchi Application Catalyst > Metaweb Technologies, Inc. stefano at metaweb.com > ------------------------------------------------------------------- > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling From bgoldenberg at gmail.com Fri Aug 14 22:13:25 2009 From: bgoldenberg at gmail.com (Benjamin Goldenberg) Date: Fri, 14 Aug 2009 15:13:25 -0700 Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: <4A85DE88.9040902@metaweb.com> References: <1264917225.140501250276250011.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> <4A85B718.5050903@metaweb.com> <08257311-A2B3-4AEA-9EC5-C6AFE29870A9@metaweb.com> <4A85DE88.9040902@metaweb.com> Message-ID: I'm not sure it makes sense to include the film type. Unlike a deceased person, who was a living person at one point, a cancelled film was never a released film. If I search for films staring a specific actor, I would expect to get back films I could actually go rent. Or, if I wanted to know how many films Steven Spielberg has directed, I don't think it's terribly obvious that I would have to exclude the cancelled films. Ben On Fri, Aug 14, 2009 at 3:00 PM, Stefano Mazzocchi wrote: > Gordon Mackenzie wrote: >> Dead Media commons? >> >> Dead Movies, Dead Music Releases, Dead Written Works > > Hmmm, not sure about how general we should make this. I mean, a 'dead > movie' is a movie that started production and then stoppped? or is a > movie that finished production and never got released? or it's a movie > that never went into production at all? > > music and written works (or buildings, or software) all seem to have > different lifecycles, so I lean toward having a specialized type for each. > > Ok, so something like "Cancelled Film" extends "Film" and adds > properties such as: > > ?reason_for_cancellation : text > ?cancellation_date: date > > Thoughts? > >> >> ~ Gordon >> >> <<< gordon at metaweb.com >>> >> >> >> >> On Aug 14, 2009, at 12:19 PM, Reilly Hayes wrote: >> >>> I suggest a blanket type for unreleased and incomplete films. >>> >>> >>> >>> On Aug 14, 2009, at 12:12 PM, Stefano Mazzocchi wrote: >>> >>>> Jon Reitsma wrote: >>>>> +1 on keeping the data too. ?I actually have been voting no on >>>>> Gordon's flags (but losing I think). ?I think it might be useful >>>>> in the popstra world too - projects on hold, dead, etc. >>>> Ok, so keeping it is. >>>> >>>> Now, how do we model this? >>>> >>>>> j >>>>> ----- Original Message ----- >>>>> From: "Alec Flett" >>>>> To: "Freebase data modeling mailing list" >>>> Sent: Friday, August 14, 2009 10:25:00 AM GMT -08:00 US/Canada >>>>> Pacific >>>>> Subject: Re: [Data-modeling] What to do with 'cancelled' movies? >>>>> >>>>> >>>>> On Aug 13, 2009, at 6:10 PM, Gordon Mackenzie wrote: >>>>> >>>>>> I delete em. >>>>>> >>>>>> Unless we want to start a base of cancelled film projects and add >>>>>> properties/types for capturing who was involved when with a some >>>>>> sort >>>>>> of scale as to how solid the rumored/actual involvement of the >>>>>> principals with the doomed project. >>>>> I actually think that's pretty interesting information, maybe we >>>>> could >>>>> do something like Deceased Person - cotype it as a "Canceled Film" >>>>> with a cancellation date? >>>>> >>>>> Alec >>>>> >>>>>> We could do the same with people who were going to work on a >>>>>> released >>>>>> film. >>>>>> >>>>>> As for sketchy film projects that haven't definitely been >>>>>> cancelled, >>>>>> no release date is better than some of the ones that have release >>>>>> dates that have already passed. >>>>>> >>>>>> ~ Gordon >>>>>> >>>>>> <<< gordon at metaweb.com >>> >>>>>> >>>>>> >>>>>> >>>>>> On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: >>>>>> >>>>>>> I've been trying to add 'released dates' to movies but I keep >>>>>>> running >>>>>>> into movies that have not been released yet but have been canceled >>>>>>> and >>>>>>> their article deleted from wikipedia: >>>>>>> >>>>>>> http://www.freebase.com/view/en/ajnabee_shehr_mein >>>>>>> >>>>>>> http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 >>>>>>> >>>>>>> Should we follow wikipedia and delete, or should we be more lax >>>>>>> and >>>>>>> mark >>>>>>> accordingly? if so, what is the proper way to mark a movie as >>>>>>> "cancelled"? >>>>>>> >>>>>>> -- >>>>>>> Stefano Mazzocchi ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Application >>>>>>> Catalyst >>>>>>> Metaweb Technologies, Inc. >>>>>>> stefano at metaweb.com >>>>>>> ------------------------------------------------------------------- >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Data-modeling mailing list >>>>>>> Data-modeling at freebase.com >>>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>>> _______________________________________________ >>>>>> Data-modeling mailing list >>>>>> Data-modeling at freebase.com >>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>> _______________________________________________ >>>>> Data-modeling mailing list >>>>> Data-modeling at freebase.com >>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>> _______________________________________________ >>>>> Data-modeling mailing list >>>>> Data-modeling at freebase.com >>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>> >>>> -- >>>> Stefano Mazzocchi ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Application Catalyst >>>> Metaweb Technologies, Inc. ? ? ? ? ? ? ? ? ? ? ?stefano at metaweb.com >>>> ------------------------------------------------------------------- >>>> >>>> _______________________________________________ >>>> Data-modeling mailing list >>>> Data-modeling at freebase.com >>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>> _______________________________________________ >>> Data-modeling mailing list >>> Data-modeling at freebase.com >>> http://lists.freebase.com/mailman/listinfo/data-modeling >> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > > > -- > Stefano Mazzocchi ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Application Catalyst > Metaweb Technologies, Inc. ? ? ? ? ? ? ? ? ? ? ?stefano at metaweb.com > ------------------------------------------------------------------- > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From spatial.db at gmail.com Fri Aug 14 22:25:35 2009 From: spatial.db at gmail.com (Ed Laurent) Date: Fri, 14 Aug 2009 18:25:35 -0400 Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: References: <1264917225.140501250276250011.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> <4A85B718.5050903@metaweb.com> <08257311-A2B3-4AEA-9EC5-C6AFE29870A9@metaweb.com> <4A85DE88.9040902@metaweb.com> Message-ID: lukeschubert and I have been having a similar discussion off list about Micronations . I think we're going to take a similar approach where we create a Micronation base and model schema that are similar to those of Location schema but don't include most of the Location types or delegate their properties. -Ed On Fri, Aug 14, 2009 at 6:13 PM, Benjamin Goldenberg wrote: > I'm not sure it makes sense to include the film type. Unlike a > deceased person, who was a living person at one point, a cancelled > film was never a released film. > > If I search for films staring a specific actor, I would expect to get > back films I could actually go rent. Or, if I wanted to know how many > films Steven Spielberg has directed, I don't think it's terribly > obvious that I would have to exclude the cancelled films. > > Ben > > On Fri, Aug 14, 2009 at 3:00 PM, Stefano Mazzocchi > wrote: > > Gordon Mackenzie wrote: > >> Dead Media commons? > >> > >> Dead Movies, Dead Music Releases, Dead Written Works > > > > Hmmm, not sure about how general we should make this. I mean, a 'dead > > movie' is a movie that started production and then stoppped? or is a > > movie that finished production and never got released? or it's a movie > > that never went into production at all? > > > > music and written works (or buildings, or software) all seem to have > > different lifecycles, so I lean toward having a specialized type for > each. > > > > Ok, so something like "Cancelled Film" extends "Film" and adds > > properties such as: > > > > reason_for_cancellation : text > > cancellation_date: date > > > > Thoughts? > > > >> > >> ~ Gordon > >> > >> <<< gordon at metaweb.com >>> > >> > >> > >> > >> On Aug 14, 2009, at 12:19 PM, Reilly Hayes wrote: > >> > >>> I suggest a blanket type for unreleased and incomplete films. > >>> > >>> > >>> > >>> On Aug 14, 2009, at 12:12 PM, Stefano Mazzocchi wrote: > >>> > >>>> Jon Reitsma wrote: > >>>>> +1 on keeping the data too. I actually have been voting no on > >>>>> Gordon's flags (but losing I think). I think it might be useful > >>>>> in the popstra world too - projects on hold, dead, etc. > >>>> Ok, so keeping it is. > >>>> > >>>> Now, how do we model this? > >>>> > >>>>> j > >>>>> ----- Original Message ----- > >>>>> From: "Alec Flett" > >>>>> To: "Freebase data modeling mailing list" < > data-modeling at freebase.com > >>>>> Sent: Friday, August 14, 2009 10:25:00 AM GMT -08:00 US/Canada > >>>>> Pacific > >>>>> Subject: Re: [Data-modeling] What to do with 'cancelled' movies? > >>>>> > >>>>> > >>>>> On Aug 13, 2009, at 6:10 PM, Gordon Mackenzie wrote: > >>>>> > >>>>>> I delete em. > >>>>>> > >>>>>> Unless we want to start a base of cancelled film projects and add > >>>>>> properties/types for capturing who was involved when with a some > >>>>>> sort > >>>>>> of scale as to how solid the rumored/actual involvement of the > >>>>>> principals with the doomed project. > >>>>> I actually think that's pretty interesting information, maybe we > >>>>> could > >>>>> do something like Deceased Person - cotype it as a "Canceled Film" > >>>>> with a cancellation date? > >>>>> > >>>>> Alec > >>>>> > >>>>>> We could do the same with people who were going to work on a > >>>>>> released > >>>>>> film. > >>>>>> > >>>>>> As for sketchy film projects that haven't definitely been > >>>>>> cancelled, > >>>>>> no release date is better than some of the ones that have release > >>>>>> dates that have already passed. > >>>>>> > >>>>>> ~ Gordon > >>>>>> > >>>>>> <<< gordon at metaweb.com >>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: > >>>>>> > >>>>>>> I've been trying to add 'released dates' to movies but I keep > >>>>>>> running > >>>>>>> into movies that have not been released yet but have been canceled > >>>>>>> and > >>>>>>> their article deleted from wikipedia: > >>>>>>> > >>>>>>> http://www.freebase.com/view/en/ajnabee_shehr_mein > >>>>>>> > >>>>>>> > http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 > >>>>>>> > >>>>>>> Should we follow wikipedia and delete, or should we be more lax > >>>>>>> and > >>>>>>> mark > >>>>>>> accordingly? if so, what is the proper way to mark a movie as > >>>>>>> "cancelled"? > >>>>>>> > >>>>>>> -- > >>>>>>> Stefano Mazzocchi Application > >>>>>>> Catalyst > >>>>>>> Metaweb Technologies, Inc. > >>>>>>> stefano at metaweb.com > >>>>>>> ------------------------------------------------------------------- > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Data-modeling mailing list > >>>>>>> Data-modeling at freebase.com > >>>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling > >>>>>> _______________________________________________ > >>>>>> Data-modeling mailing list > >>>>>> Data-modeling at freebase.com > >>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling > >>>>> _______________________________________________ > >>>>> Data-modeling mailing list > >>>>> Data-modeling at freebase.com > >>>>> http://lists.freebase.com/mailman/listinfo/data-modeling > >>>>> _______________________________________________ > >>>>> Data-modeling mailing list > >>>>> Data-modeling at freebase.com > >>>>> http://lists.freebase.com/mailman/listinfo/data-modeling > >>>> > >>>> -- > >>>> Stefano Mazzocchi Application Catalyst > >>>> Metaweb Technologies, Inc. stefano at metaweb.com > >>>> ------------------------------------------------------------------- > >>>> > >>>> _______________________________________________ > >>>> Data-modeling mailing list > >>>> Data-modeling at freebase.com > >>>> http://lists.freebase.com/mailman/listinfo/data-modeling > >>> _______________________________________________ > >>> Data-modeling mailing list > >>> Data-modeling at freebase.com > >>> http://lists.freebase.com/mailman/listinfo/data-modeling > >> > >> _______________________________________________ > >> Data-modeling mailing list > >> Data-modeling at freebase.com > >> http://lists.freebase.com/mailman/listinfo/data-modeling > > > > > > -- > > Stefano Mazzocchi Application Catalyst > > Metaweb Technologies, Inc. stefano at metaweb.com > > ------------------------------------------------------------------- > > > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090814/007b2de1/attachment.htm From faye at metaweb.com Fri Aug 14 22:55:09 2009 From: faye at metaweb.com (Faye Harris) Date: Fri, 14 Aug 2009 15:55:09 -0700 Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: References: <1264917225.140501250276250011.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> <4A85B718.5050903@metaweb.com> <08257311-A2B3-4AEA-9EC5-C6AFE29870A9@metaweb.com> <4A85DE88.9040902@metaweb.com> Message-ID: <4A85EB4D.1050006@metaweb.com> How about we use the "Unfinished Work" type: /media_common/unfinished_work? Jeff and I created it when I wanted to model F. Scott Fitzgerald's "The Love of the Last Tycoon". A project may be unfinished for a number of reasons. We can add a property to supply the reason that the work is unfinished if we want. Then just use cotypes to identify the medium involved and fill out properties. I.e. cotype "Film" for a canceled film, then use the properties of the "Film" type to populate the cancelled film's director, producers, etc., and cotype "Book" for a unfinished book, and use the properties of "Book" type to fill out the author. -- Faye Ed Laurent wrote: > lukeschubert and I have been having a similar discussion off list > about Micronations > . I think we're > going to take a similar approach where we create a Micronation base > and model schema that are similar to those of Location schema but > don't include most of the Location types or delegate their properties. > > -Ed > > On Fri, Aug 14, 2009 at 6:13 PM, Benjamin Goldenberg > > wrote: > > I'm not sure it makes sense to include the film type. Unlike a > deceased person, who was a living person at one point, a cancelled > film was never a released film. > > If I search for films staring a specific actor, I would expect to get > back films I could actually go rent. Or, if I wanted to know how many > films Steven Spielberg has directed, I don't think it's terribly > obvious that I would have to exclude the cancelled films. > > Ben > > On Fri, Aug 14, 2009 at 3:00 PM, Stefano > Mazzocchi> wrote: > > Gordon Mackenzie wrote: > >> Dead Media commons? > >> > >> Dead Movies, Dead Music Releases, Dead Written Works > > > > Hmmm, not sure about how general we should make this. I mean, a > 'dead > > movie' is a movie that started production and then stoppped? or is a > > movie that finished production and never got released? or it's a > movie > > that never went into production at all? > > > > music and written works (or buildings, or software) all seem to have > > different lifecycles, so I lean toward having a specialized type > for each. > > > > Ok, so something like "Cancelled Film" extends "Film" and adds > > properties such as: > > > > reason_for_cancellation : text > > cancellation_date: date > > > > Thoughts? > > > >> > >> ~ Gordon > >> > >> <<< gordon at metaweb.com >>> > >> > >> > >> > >> On Aug 14, 2009, at 12:19 PM, Reilly Hayes wrote: > >> > >>> I suggest a blanket type for unreleased and incomplete films. > >>> > >>> > >>> > >>> On Aug 14, 2009, at 12:12 PM, Stefano Mazzocchi wrote: > >>> > >>>> Jon Reitsma wrote: > >>>>> +1 on keeping the data too. I actually have been voting no on > >>>>> Gordon's flags (but losing I think). I think it might be useful > >>>>> in the popstra world too - projects on hold, dead, etc. > >>>> Ok, so keeping it is. > >>>> > >>>> Now, how do we model this? > >>>> > >>>>> j > >>>>> ----- Original Message ----- > >>>>> From: "Alec Flett" > > >>>>> To: "Freebase data modeling mailing list" > > >>>>> Sent: Friday, August 14, 2009 10:25:00 AM GMT -08:00 US/Canada > >>>>> Pacific > >>>>> Subject: Re: [Data-modeling] What to do with 'cancelled' movies? > >>>>> > >>>>> > >>>>> On Aug 13, 2009, at 6:10 PM, Gordon Mackenzie wrote: > >>>>> > >>>>>> I delete em. > >>>>>> > >>>>>> Unless we want to start a base of cancelled film projects > and add > >>>>>> properties/types for capturing who was involved when with a > some > >>>>>> sort > >>>>>> of scale as to how solid the rumored/actual involvement of the > >>>>>> principals with the doomed project. > >>>>> I actually think that's pretty interesting information, maybe we > >>>>> could > >>>>> do something like Deceased Person - cotype it as a "Canceled > Film" > >>>>> with a cancellation date? > >>>>> > >>>>> Alec > >>>>> > >>>>>> We could do the same with people who were going to work on a > >>>>>> released > >>>>>> film. > >>>>>> > >>>>>> As for sketchy film projects that haven't definitely been > >>>>>> cancelled, > >>>>>> no release date is better than some of the ones that have > release > >>>>>> dates that have already passed. > >>>>>> > >>>>>> ~ Gordon > >>>>>> > >>>>>> <<< gordon at metaweb.com >>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: > >>>>>> > >>>>>>> I've been trying to add 'released dates' to movies but I keep > >>>>>>> running > >>>>>>> into movies that have not been released yet but have been > canceled > >>>>>>> and > >>>>>>> their article deleted from wikipedia: > >>>>>>> > >>>>>>> http://www.freebase.com/view/en/ajnabee_shehr_mein > >>>>>>> > >>>>>>> > http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 > >>>>>>> > >>>>>>> Should we follow wikipedia and delete, or should we be > more lax > >>>>>>> and > >>>>>>> mark > >>>>>>> accordingly? if so, what is the proper way to mark a movie as > >>>>>>> "cancelled"? > >>>>>>> > >>>>>>> -- > >>>>>>> Stefano Mazzocchi Application > >>>>>>> Catalyst > >>>>>>> Metaweb Technologies, Inc. > >>>>>>> stefano at metaweb.com > >>>>>>> > ------------------------------------------------------------------- > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> Data-modeling mailing list > >>>>>>> Data-modeling at freebase.com > >>>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling > >>>>>> _______________________________________________ > >>>>>> Data-modeling mailing list > >>>>>> Data-modeling at freebase.com > >>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling > >>>>> _______________________________________________ > >>>>> Data-modeling mailing list > >>>>> Data-modeling at freebase.com > >>>>> http://lists.freebase.com/mailman/listinfo/data-modeling > >>>>> _______________________________________________ > >>>>> Data-modeling mailing list > >>>>> Data-modeling at freebase.com > >>>>> http://lists.freebase.com/mailman/listinfo/data-modeling > >>>> > >>>> -- > >>>> Stefano Mazzocchi Application > Catalyst > >>>> Metaweb Technologies, Inc. > stefano at metaweb.com > >>>> > ------------------------------------------------------------------- > >>>> > >>>> _______________________________________________ > >>>> Data-modeling mailing list > >>>> Data-modeling at freebase.com > >>>> http://lists.freebase.com/mailman/listinfo/data-modeling > >>> _______________________________________________ > >>> Data-modeling mailing list > >>> Data-modeling at freebase.com > >>> http://lists.freebase.com/mailman/listinfo/data-modeling > >> > >> _______________________________________________ > >> Data-modeling mailing list > >> Data-modeling at freebase.com > >> http://lists.freebase.com/mailman/listinfo/data-modeling > > > > > > -- > > Stefano Mazzocchi Application Catalyst > > Metaweb Technologies, Inc. > stefano at metaweb.com > > ------------------------------------------------------------------- > > > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > > > ------------------------------------------------------------------------ > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From rfh at metaweb.com Fri Aug 14 23:11:12 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Fri, 14 Aug 2009 16:11:12 -0700 Subject: [Data-modeling] What to do with 'cancelled' movies? In-Reply-To: <4A85EB4D.1050006@metaweb.com> References: <1264917225.140501250276250011.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> <4A85B718.5050903@metaweb.com> <08257311-A2B3-4AEA-9EC5-C6AFE29870A9@metaweb.com> <4A85DE88.9040902@metaweb.com> <4A85EB4D.1050006@metaweb.com> Message-ID: There is also the finished but not released issue to deal with here. Reasons why these should remain typed as Films 1) They may be finished/released at a later date. 2) They may be used as the basis for a derivative work 3) The relations between actors and directors is still pertinent (aren't the 6 degrees of kevin bacon the reason we built freebase?) -r On Aug 14, 2009, at 3:55 PM, Faye Harris wrote: > How about we use the "Unfinished Work" type: > /media_common/unfinished_work? Jeff and I created it when I wanted to > model F. Scott Fitzgerald's "The Love of the Last Tycoon". A project > may > be unfinished for a number of reasons. We can add a property to supply > the reason that the work is unfinished if we want. > > Then just use cotypes to identify the medium involved and fill out > properties. I.e. cotype "Film" for a canceled film, then use the > properties of the "Film" type to populate the cancelled film's > director, > producers, etc., and cotype "Book" for a unfinished book, and use the > properties of "Book" type to fill out the author. > > -- Faye > > > Ed Laurent wrote: >> lukeschubert and I have been having a similar discussion off list >> about Micronations >> . I think we're >> going to take a similar approach where we create a Micronation base >> and model schema that are similar to those of Location schema but >> don't include most of the Location types or delegate their >> properties. >> >> -Ed >> >> On Fri, Aug 14, 2009 at 6:13 PM, Benjamin Goldenberg >> > wrote: >> >> I'm not sure it makes sense to include the film type. Unlike a >> deceased person, who was a living person at one point, a cancelled >> film was never a released film. >> >> If I search for films staring a specific actor, I would expect >> to get >> back films I could actually go rent. Or, if I wanted to know how >> many >> films Steven Spielberg has directed, I don't think it's terribly >> obvious that I would have to exclude the cancelled films. >> >> Ben >> >> On Fri, Aug 14, 2009 at 3:00 PM, Stefano >> Mazzocchi> wrote: >>> Gordon Mackenzie wrote: >>>> Dead Media commons? >>>> >>>> Dead Movies, Dead Music Releases, Dead Written Works >>> >>> Hmmm, not sure about how general we should make this. I mean, a >> 'dead >>> movie' is a movie that started production and then stoppped? or is a >>> movie that finished production and never got released? or it's a >> movie >>> that never went into production at all? >>> >>> music and written works (or buildings, or software) all seem to have >>> different lifecycles, so I lean toward having a specialized type >> for each. >>> >>> Ok, so something like "Cancelled Film" extends "Film" and adds >>> properties such as: >>> >>> reason_for_cancellation : text >>> cancellation_date: date >>> >>> Thoughts? >>> >>>> >>>> ~ Gordon >>>> >>>> <<< gordon at metaweb.com >>> >>>> >>>> >>>> >>>> On Aug 14, 2009, at 12:19 PM, Reilly Hayes wrote: >>>> >>>>> I suggest a blanket type for unreleased and incomplete films. >>>>> >>>>> >>>>> >>>>> On Aug 14, 2009, at 12:12 PM, Stefano Mazzocchi wrote: >>>>> >>>>>> Jon Reitsma wrote: >>>>>>> +1 on keeping the data too. I actually have been voting no on >>>>>>> Gordon's flags (but losing I think). I think it might be useful >>>>>>> in the popstra world too - projects on hold, dead, etc. >>>>>> Ok, so keeping it is. >>>>>> >>>>>> Now, how do we model this? >>>>>> >>>>>>> j >>>>>>> ----- Original Message ----- >>>>>>> From: "Alec Flett" > > >>>>>>> To: "Freebase data modeling mailing list" >> >>>>>>> Sent: Friday, August 14, 2009 10:25:00 AM GMT -08:00 US/Canada >>>>>>> Pacific >>>>>>> Subject: Re: [Data-modeling] What to do with 'cancelled' movies? >>>>>>> >>>>>>> >>>>>>> On Aug 13, 2009, at 6:10 PM, Gordon Mackenzie wrote: >>>>>>> >>>>>>>> I delete em. >>>>>>>> >>>>>>>> Unless we want to start a base of cancelled film projects >> and add >>>>>>>> properties/types for capturing who was involved when with a >> some >>>>>>>> sort >>>>>>>> of scale as to how solid the rumored/actual involvement of the >>>>>>>> principals with the doomed project. >>>>>>> I actually think that's pretty interesting information, maybe we >>>>>>> could >>>>>>> do something like Deceased Person - cotype it as a "Canceled >> Film" >>>>>>> with a cancellation date? >>>>>>> >>>>>>> Alec >>>>>>> >>>>>>>> We could do the same with people who were going to work on a >>>>>>>> released >>>>>>>> film. >>>>>>>> >>>>>>>> As for sketchy film projects that haven't definitely been >>>>>>>> cancelled, >>>>>>>> no release date is better than some of the ones that have >> release >>>>>>>> dates that have already passed. >>>>>>>> >>>>>>>> ~ Gordon >>>>>>>> >>>>>>>> <<< gordon at metaweb.com >>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Aug 13, 2009, at 5:48 PM, Stefano Mazzocchi wrote: >>>>>>>> >>>>>>>>> I've been trying to add 'released dates' to movies but I keep >>>>>>>>> running >>>>>>>>> into movies that have not been released yet but have been >> canceled >>>>>>>>> and >>>>>>>>> their article deleted from wikipedia: >>>>>>>>> >>>>>>>>> http://www.freebase.com/view/en/ajnabee_shehr_mein >>>>>>>>> >>>>>>>>> >> http://deletionpedia.dbatley.com/w/index.php?title=Ajnabee_Shehr_Mein_%28deleted_12_Mar_2008_at_02:00%29 >>>>>>>>> >>>>>>>>> Should we follow wikipedia and delete, or should we be >> more lax >>>>>>>>> and >>>>>>>>> mark >>>>>>>>> accordingly? if so, what is the proper way to mark a movie as >>>>>>>>> "cancelled"? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Stefano Mazzocchi Application >>>>>>>>> Catalyst >>>>>>>>> Metaweb Technologies, Inc. >>>>>>>>> stefano at metaweb.com >>>>>>>>> >> >> ------------------------------------------------------------------- >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> Data-modeling mailing list >>>>>>>>> Data-modeling at freebase.com >>>>>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>>>>> _______________________________________________ >>>>>>>> Data-modeling mailing list >>>>>>>> Data-modeling at freebase.com >>>>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>>>> _______________________________________________ >>>>>>> Data-modeling mailing list >>>>>>> Data-modeling at freebase.com >>>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>>>> _______________________________________________ >>>>>>> Data-modeling mailing list >>>>>>> Data-modeling at freebase.com >>>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>>> >>>>>> -- >>>>>> Stefano Mazzocchi Application >> Catalyst >>>>>> Metaweb Technologies, Inc. >> stefano at metaweb.com >>>>>> >> >> ------------------------------------------------------------------- >>>>>> >>>>>> _______________________________________________ >>>>>> Data-modeling mailing list >>>>>> Data-modeling at freebase.com >>>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>>> _______________________________________________ >>>>> Data-modeling mailing list >>>>> Data-modeling at freebase.com >>>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>>> >>>> _______________________________________________ >>>> Data-modeling mailing list >>>> Data-modeling at freebase.com >>>> http://lists.freebase.com/mailman/listinfo/data-modeling >>> >>> >>> -- >>> Stefano Mazzocchi Application Catalyst >>> Metaweb Technologies, Inc. >> stefano at metaweb.com >>> ------------------------------------------------------------------- >>> >>> _______________________________________________ >>> Data-modeling mailing list >>> Data-modeling at freebase.com >>> http://lists.freebase.com/mailman/listinfo/data-modeling >>> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling >> >> >> ------------------------------------------------------------------------ >> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling >> > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090814/afacf299/attachment.bin From spencerkelly86 at gmail.com Mon Aug 17 01:55:40 2009 From: spencerkelly86 at gmail.com (Spencer Kelly) Date: Sun, 16 Aug 2009 21:55:40 -0400 Subject: [Data-modeling] Suggestion for addressing potentially offensive content In-Reply-To: <4A7DF010.3070408@gmail.com> References: <4A7DF010.3070408@gmail.com> Message-ID: ya, it would be better flagging just one offensive piece of data, not the whole topic. the offensive warning should just pop-up if this data is being immediately displayed. i'm thinking two things: my particularly vulgar draft type shouldn't bother the browse page, because it wasn't going to be displayed anyways. if a band is called the f***yous, its only their name thats offensive. so if a view doesn't include this property it wouldn't be offensive. (lets not let one type taint all the rest) +1 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090816/3bb8031b/attachment.htm From spencerkelly86 at gmail.com Mon Aug 17 17:39:54 2009 From: spencerkelly86 at gmail.com (Spencer Kelly) Date: Mon, 17 Aug 2009 13:39:54 -0400 Subject: [Data-modeling] English Words In-Reply-To: <4A84C43B.3040303@metaweb.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> Message-ID: +1 /common/symbol! fabulous idea. i understand /common/symbol as aliases with metadata - and they connect in some way to /common/topic for their meaning. I'm excited about adding pronunciation data, and word derivation data. so anything that allows this (without whacking out the graph) i'll be behind. dbpedia does #1 i think. hooray! From iainsproat at gmail.com Mon Aug 17 18:16:16 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Mon, 17 Aug 2009 22:16:16 +0400 Subject: [Data-modeling] English Words In-Reply-To: <4A84C43B.3040303@metaweb.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> Message-ID: Are we agreed that a freebase topic is a symset (and vice versa)? If so, the debate seems to be solely based on how best to model the relationship between a symbol (aka word) and the symset. Either way it wouldn't prevent us loading the wordnet symset data to freebase (particularly verbs and adjectives) and reconciling the nouns with existing freebase topics. I think that in itself would be incredibly beneficial. On Fri, Aug 14, 2009 at 5:56 AM, Scott Meyer wrote: > 1. Reconcile topic to Synset, use value of name to locate symbol/word. > > This allows direct navigation to the Synset which was implied by this > sort of usage: > >> { >> ? ?... >> ? ?"symbol" : { >> ? ? ? "lang" : "/lang/en", >> ? ? ? "hyponym" : [{ >> ? ? ? ? ?"topic" : [{ >> ? ? ? ? ? ? "name" : null, >> ? ? ? ? ? ? "id" : null, >> ? ? ? ? ?}] >> ? ? ? }], >> ? ?} >> ? ?... >> } I can't quite work out the MQL (hyponym seems to expect a CVT with a property /topic?), and I'm not sure it compares exactly with the description for #1. Should it instead, perhaps, be the following? (direct link between symbol and topic): > {. > "type" : "/common/symbol" > "/common/symbol/lang" : "/lang/en", > "/common/symbol/hyponym" : [{ > "type" : "/common/topic", > "name" : null, > "id" : null, > }], > ......... > } And using the same format to describe #4 (here hyponym expects a CVT /common/hyponym) > {. > "type" : "/common/symbol" > "/common/symbol/hyponym" : [{ > "topic": { > "name" : null, > "id" : null > }, > "lang" : "/lang/en", > "type" : "/common/hyponym", > ...... > }] > } #1 would seem to be less primitive hungry than #4, and is easier to query. I'm still a fan of #4 as it would normalize /common/symbol objects, but I could live with either. Iain From arthur.van.hoff at gmail.com Mon Aug 17 19:32:46 2009 From: arthur.van.hoff at gmail.com (Arthur van Hoff) Date: Mon, 17 Aug 2009 12:32:46 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> Message-ID: On Mon, Aug 17, 2009 at 11:16 AM, Iain Sproat wrote: > Are we agreed that a freebase topic is a symset (and vice versa)? > I do not think that is the right way to go. The noun "Rat", the animal "Rat", and the verb "rat" should all be distinct topics in my opinion. -- Arthur van Hoff arthur.van.hoff at gmail.com 650-283-0842 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090817/efbb82f1/attachment-0001.htm From spatial.db at gmail.com Mon Aug 17 20:51:10 2009 From: spatial.db at gmail.com (Ed Laurent) Date: Mon, 17 Aug 2009 16:51:10 -0400 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> Message-ID: This English word discussion could be really useful for linking hypernyms, hyponyms, and antonyms. What seems to be missing is the relationship among homonyms (e.g., rat, rat, rat), although maybe I just don't understand the jargon. I suggest adding a homonym type that includes "Synonym Set" and has a "Homonym(s)" property that links to a "Homonym relationship" CVT. If this proposal is accepted, some thought is needed about how the pronunciation properties should be linked because the same pronunciation (and possibly word source) data could be linked to multiple topics. -Ed On Mon, Aug 17, 2009 at 3:32 PM, Arthur van Hoff wrote: > On Mon, Aug 17, 2009 at 11:16 AM, Iain Sproat wrote: > >> Are we agreed that a freebase topic is a symset (and vice versa)? >> > > I do not think that is the right way to go. The noun "Rat", the animal > "Rat", and the verb "rat" should all be distinct topics in my opinion. > > -- > Arthur van Hoff > arthur.van.hoff at gmail.com > 650-283-0842 > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090817/424a0658/attachment.htm From paul at ontology2.com Mon Aug 17 21:10:50 2009 From: paul at ontology2.com (Paul Houle) Date: Mon, 17 Aug 2009 17:10:50 -0400 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> Message-ID: <4A89C75A.5070109@ontology2.com> Iain Sproat wrote: > Are we agreed that a freebase topic is a symset (and vice versa)? > I can imagine that a particular type might tag something that's a symset, but there are plenty of Fb types that represent a conflation of several related things. For instance, http://www.freebase.com/view/en/the_new_york_times represents both a "creative work" (something that would exist in a bibliographic database and that a library might have holdings of) and is also an "organization", as a business unit of : http://www.freebase.com/view/en/the_new_york_times_company In ordinary use, people conflate things like this, but a detailed model might separate them. From sm at metaweb.com Mon Aug 17 22:07:39 2009 From: sm at metaweb.com (Scott Meyer) Date: Mon, 17 Aug 2009 15:07:39 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> Message-ID: <4A89D4AB.2050403@metaweb.com> Iain Sproat wrote: > Are we agreed that a freebase topic is a symset (and vice versa)? I don't think so. I'm suggesting that WN words and synsets be cleanly separated from topics. Topics are "things" and words and synsets are language-dependent symbolic representations of things. Is there any reason to believe that, say, Chinese synsets should correspond to English synsets? If we had half a dozen wordnets, we could answer that question. As we don't, I think that it behooves us not to constrain foreign language wordnets. So concretely two fundamental types: /common/symbol - corresponding to Wordnet Words /common/synset - corresponding to Wordnet Synsets With properties lifted directly from Wordnet 3.0, pretty much the structure you've proposed without the reconciliation of topics. The linkage between topic and synset would be a synset property: /common/synset/represents - expected type of /common/topic with the meaning: "some or all of the words in this synset may be used to represent this topic" The preferred symbol is the one with the same spelling as the topic's name. I'd call the reverse property: /common/topic/symbols - expected type of /common/synset Symbols and synsets would be monolingual either by convention with language being given by a property: /common/symbol/language /common/synset/language or implicitly with a language specific type name. For example: /common/symbol_en /common/synset_en I favor the former. > If so, the debate seems to be solely based on how best to model the > relationship between a symbol (aka word) and the symset. Either way > it wouldn't prevent us loading the wordnet symset data to freebase > (particularly verbs and adjectives) and reconciling the nouns with > existing freebase topics. I think that in itself would be incredibly > beneficial. I agree completely about the benefits of loading wordnet. My contention is twofold: 1. We can load Wordnet rapidly only if we don't reconcile Wordnet identities to Freebase topics. Out of 380,000 entries in /usr/share/dict/words, there are about 90,000 which match exactly the name or alias of a topic, given that all we have to go on is the spelling of words, synsets, and the text of definitions, this looks like a substantial reconciliation problem, one that has, in the past prevented us from loading Wordnet. 2. There are structural reasons why a separate meta-level for language constructs is a good idea: basically, symbols are different from the things they represent. > On Fri, Aug 14, 2009 at 5:56 AM, Scott Meyer wrote: >> 1. Reconcile topic to Synset, use value of name to locate symbol/word. >> >> This allows direct navigation to the Synset which was implied by this >> sort of usage: >> >>> { >>> ... >>> "symbol" : { >>> "lang" : "/lang/en", >>> "hyponym" : [{ >>> "topic" : [{ >>> "name" : null, >>> "id" : null, >>> }] >>> }], >>> } >>> ... >>> } > > I can't quite work out the MQL (hyponym seems to expect a CVT with a > property /topic?), and I'm not sure it compares exactly with the > description for #1. Should it instead, perhaps, be the following? > (direct link between symbol and topic): >> {. >> "type" : "/common/symbol" >> "/common/symbol/lang" : "/lang/en", >> "/common/symbol/hyponym" : [{ >> "type" : "/common/topic", >> "name" : null, >> "id" : null, >> }], >> ......... >> } This is reconciling synset and topic. In my example above, the property "hyponym" was a property of synset and "topic" was the property linking a synset and a topic. Here's that example in the vocabulary I proposed above: { # we start with some topic ... "/common/topic/symbols" : [{ # find synsets used to represent it "language" : "/lang/en", "hyponym" : [{ # find hyponyms of those "/common/synset/represents" : [{ find topics which those represent "name" : null, "id" : null }] }] }] } Does that make more sense? -Scott From iainsproat at gmail.com Tue Aug 18 08:39:55 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Tue, 18 Aug 2009 12:39:55 +0400 Subject: [Data-modeling] English Words In-Reply-To: <4A89D4AB.2050403@metaweb.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> Message-ID: On Tue, Aug 18, 2009 at 12:51 AM, Ed Laurent wrote: >What seems to be missing is the relationship among homonyms (e.g., rat, rat, rat), although maybe I just don't understand the >jargon. Ed, the relationship between homonyms, rat, rat & rat, is that they will be the same /common/symbol - I think there would actually be a different /common/symbol for each language; a denormalization of the data, but no less incorrect. On Tue, Aug 18, 2009 at 1:10 AM, Paul Houle wrote: > there are plenty of Fb types that represent a conflation of > several related things In theory these should be separate topics, although that's not always practical. On Tue, Aug 18, 2009 at 2:07 AM, Scott Meyer wrote: > Is there any reason to believe that, say, Chinese synsets should > correspond to English synsets? The definition of a a synonym set, according to WordNet, is "a set of words that are interchangeable in some context without changing the truth value of the preposition in which they are embedded." Does the language alter the context of a preposition? Probably so, and you'd be correct in saying an English synset would be different from a Chinese synset. > So concretely two fundamental types: > > /common/symbol - corresponding to Wordnet Words > /common/synset - corresponding to Wordnet Synsets > > With properties lifted directly from Wordnet 3.0, pretty much the > structure you've proposed without the reconciliation of topics. > The linkage between topic and synset would be a synset property: > > /common/synset/represents - expected type of /common/topic > > with the meaning: "some or all of the words in this synset may > be used to represent this topic" ?The preferred symbol is the > one with the same spelling as the topic's name. ?I'd call the > reverse property: > > /common/topic/symbol - expected type of /common/synset > > Symbols and synsets would be monolingual either by convention > with language being given by a property: > > /common/symbol/lang > /common/synset/lang +1. Looks good. (note that I changed the key of property language to lang, and removed the plural on /common/topic/symbol to match convention) Let's load WordNet! Iain From paul at ontology2.com Tue Aug 18 14:01:21 2009 From: paul at ontology2.com (Paul Houle) Date: Tue, 18 Aug 2009 10:01:21 -0400 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> Message-ID: <4A8AB431.1080507@ontology2.com> Iain Sproat wrote: > Are we agreed that a freebase topic is a symset (and vice versa)? > Here's a better example of a topic that has two meanings, http://www.freebase.com/view/en/oxygen or http://en.wikipedia.org/wiki/Oxygen Wikipedia goes right out and says it..."This article is about the chemical element and its most stable form, O_2 or dioxygen. For other forms of this element, see Allotropes of Oxygen." This is annoying because you can't make entirely truthful statements about "Oxygen" if you conflate the element and the diatomic gas. For instance, most of the mass of the ocean (water) is the ElementOxygen. If a system also understood that people breathe "Oxygen" it could come to the wrong conclusion that people can breathe in the ocean. Note that freebase treats oxygen as a "Chemical Element", but also a "Medical Treatment"; the Medical Treatment is the use of the diatomic gas, which doesn't appear to be otherwise documented in Freebase. The text in wikipedia does a good job at explaining the taxonomy of substances: reading it, it makes clear distinctions between elements, compounds, allotropes, etc. So far, generic databases have done a poor job of taxonomizing "stuff". It seems to me that the problem is tractable, but people have stopped short of the work it takes to do it: an introductory chemistry textbook does a good job of explaining it that bypasses the "representational thorns" that Cyc and other efforts have gotten caught up on. From tfmorris at gmail.com Tue Aug 18 14:36:07 2009 From: tfmorris at gmail.com (Tom Morris) Date: Tue, 18 Aug 2009 10:36:07 -0400 Subject: [Data-modeling] English Words In-Reply-To: <4A89D4AB.2050403@metaweb.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> Message-ID: Thanks for doing a comprehensive writeup. On Mon, Aug 17, 2009 at 6:07 PM, Scott Meyer wrote: > Iain Sproat wrote: >> Are we agreed that a freebase topic is a symset (and vice versa)? > > I don't think so. ?I'm suggesting that WN words and synsets be > cleanly separated from topics. ?Topics are "things" and words and > synsets are language-dependent symbolic representations of things. > Is there any reason to believe that, say, Chinese synsets should > correspond to English synsets? ?If we had half a dozen wordnets, > we could answer that question. ?As we don't, I think that it > behooves us not to constrain foreign language wordnets. If you did have the other language data, I'm convinced it would show that concepts/synsets actually *are* language/culture-dependent. > Does that make more sense? Yes, I much prefer keeping the words/symbols and their meanings/concepts separate and linking them with appropriately typed links. Tom From tfmorris at gmail.com Tue Aug 18 14:54:08 2009 From: tfmorris at gmail.com (Tom Morris) Date: Tue, 18 Aug 2009 10:54:08 -0400 Subject: [Data-modeling] [Dbpedia-discussion] English Words In-Reply-To: <4A8AB431.1080507@ontology2.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A8AB431.1080507@ontology2.com> Message-ID: [I left dbpedia-discuss on the distribution, but I'm not sure why they got tacked on at the very end of the conversation. They'll probably need to go check the data-modeling archives to get caught up on what the conversation was about.] On Tue, Aug 18, 2009 at 10:01 AM, Paul Houle wrote: > Iain Sproat wrote: >> Are we agreed that a freebase topic is a symset (and vice versa)? >> > ? ?Here's a better example of a topic that has two meanings, > > http://www.freebase.com/view/en/oxygen > > or > > http://en.wikipedia.org/wiki/Oxygen > > ? ?Wikipedia goes right out and says it..."This article is about the > chemical element and its most stable form, O_2 or dioxygen. For other > forms of this element, see Allotropes of Oxygen." I think Iain was referring to the defined intent of a Freebase topic and a synset. The example that you've identified represents a bug where the difference between the definition of a Wikipedia article (whatever the editors wants it to be) and a Freebase topic (a single concept) hasn't yet been cleaned up. > ? ?This is annoying because you can't make entirely truthful statements > about "Oxygen" if you conflate the element and the diatomic gas. Wikipedia is what it is, so it's really up to the users of structured data to decide how much effort they want to put into tidying things up. Unfortunately, there are lots of folks who would be happy to make use of a well structured data set when it's done, but not a lot of folks (so far?) who are interested in contributing time and effort to making it better structured. Unless someone figures out how to break the circle, we'll all be left bemoaning the inadequate and "annoying" state of the data. > It seems to me that the problem is tractable, > but people have stopped short of the work it takes to do it: So did you click the little "split" flag on the edit page? Better yet, did you use one of the split tools to cleave the properties into two sets and distribute them among the appropriate topics for the gas and the chemical element? I agree that "people" need to deal with this, but "people" includes everyone with a vested interest. If you need better data for your app, that includes you. Tom From kurt at spaceship.com Tue Aug 18 22:39:58 2009 From: kurt at spaceship.com (Kurt Bollacker) Date: Tue, 18 Aug 2009 15:39:58 -0700 Subject: [Data-modeling] English Words In-Reply-To: <4A89D4AB.2050403@metaweb.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> Message-ID: <20090818223958.GM6305@spaceship.com> I beleive that Scott's opinion represents a practical and useful approach to Freebase's ingest of Wordnet. However, without substantial reconciliation, then Wordnet is just a parallel semantic network that leverages the very nice graph store that also hosts Freebase. So my understanding is then that the main advantage of having Wordnet in Freebase is to essentially fork it to allow for its collaboratively driven growth and evolution. Am I understanding all of this correctly? Kurt :-) On Mon, Aug 17, 2009 at 03:07:39PM -0700, Scott Meyer wrote: > Iain Sproat wrote: > > Are we agreed that a freebase topic is a symset (and vice versa)? > > I don't think so. I'm suggesting that WN words and synsets be > cleanly separated from topics. Topics are "things" and words and > synsets are language-dependent symbolic representations of things. > Is there any reason to believe that, say, Chinese synsets should > correspond to English synsets? If we had half a dozen wordnets, > we could answer that question. As we don't, I think that it > behooves us not to constrain foreign language wordnets. > > So concretely two fundamental types: > > /common/symbol - corresponding to Wordnet Words > /common/synset - corresponding to Wordnet Synsets > > With properties lifted directly from Wordnet 3.0, pretty much the > structure you've proposed without the reconciliation of topics. > The linkage between topic and synset would be a synset property: > > /common/synset/represents - expected type of /common/topic > > with the meaning: "some or all of the words in this synset may > be used to represent this topic" The preferred symbol is the > one with the same spelling as the topic's name. I'd call the > reverse property: > > /common/topic/symbols - expected type of /common/synset > > Symbols and synsets would be monolingual either by convention > with language being given by a property: > > /common/symbol/language > /common/synset/language > > or implicitly with a language specific type name. For example: > > /common/symbol_en > /common/synset_en > > I favor the former. > > > If so, the debate seems to be solely based on how best to model the > > relationship between a symbol (aka word) and the symset. Either way > > it wouldn't prevent us loading the wordnet symset data to freebase > > (particularly verbs and adjectives) and reconciling the nouns with > > existing freebase topics. I think that in itself would be incredibly > > beneficial. > > I agree completely about the benefits of loading wordnet. My contention > is twofold: > > 1. We can load Wordnet rapidly only if we don't reconcile Wordnet identities > to Freebase topics. Out of 380,000 entries in /usr/share/dict/words, > there are about 90,000 which match exactly the name or alias of a topic, > given that all we have to go on is the spelling of words, synsets, and > the text of definitions, this looks like a substantial reconciliation > problem, one that has, in the past prevented us from loading Wordnet. > > 2. There are structural reasons why a separate meta-level for language > constructs is a good idea: basically, symbols are different from the things > they represent. > > > On Fri, Aug 14, 2009 at 5:56 AM, Scott Meyer wrote: > >> 1. Reconcile topic to Synset, use value of name to locate symbol/word. > >> > >> This allows direct navigation to the Synset which was implied by this > >> sort of usage: > >> > >>> { > >>> ... > >>> "symbol" : { > >>> "lang" : "/lang/en", > >>> "hyponym" : [{ > >>> "topic" : [{ > >>> "name" : null, > >>> "id" : null, > >>> }] > >>> }], > >>> } > >>> ... > >>> } > > > > I can't quite work out the MQL (hyponym seems to expect a CVT with a > > property /topic?), and I'm not sure it compares exactly with the > > description for #1. Should it instead, perhaps, be the following? > > (direct link between symbol and topic): > >> {. > >> "type" : "/common/symbol" > >> "/common/symbol/lang" : "/lang/en", > >> "/common/symbol/hyponym" : [{ > >> "type" : "/common/topic", > >> "name" : null, > >> "id" : null, > >> }], > >> ......... > >> } > > This is reconciling synset and topic. In my example above, > the property "hyponym" was a property of synset and "topic" > was the property linking a synset and a topic. Here's that > example in the vocabulary I proposed above: > > { # we start with some topic > ... > "/common/topic/symbols" : [{ # find synsets used to represent it > "language" : "/lang/en", > "hyponym" : [{ # find hyponyms of those > "/common/synset/represents" : [{ find topics which those represent > "name" : null, > "id" : null > }] > }] > }] > } > > Does that make more sense? > > -Scott > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling From rfh at metaweb.com Tue Aug 18 23:26:31 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Tue, 18 Aug 2009 16:26:31 -0700 Subject: [Data-modeling] English Words In-Reply-To: <20090818223958.GM6305@spaceship.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> Message-ID: <2E3991A6-75AC-4AFF-9BC6-659C450F138B@metaweb.com> Linking the Wordnet space to freebase Topics, Types, and Properties would be an important goal. "Our" wordnet would represent the relations built into wordnet, plus a set of relations between wordnet entities and freebase topics. I'm not sure how important "forking" wordnet would be. -r On Aug 18, 2009, at 3:39 PM, Kurt Bollacker wrote: > > I beleive that Scott's opinion represents a practical and useful > approach to Freebase's ingest of Wordnet. However, without > substantial reconciliation, then Wordnet is just a parallel semantic > network that leverages the very nice graph store that also hosts > Freebase. So my understanding is then that the main advantage of > having Wordnet in Freebase is to essentially fork it to allow for its > collaboratively driven growth and evolution. > > Am I understanding all of this correctly? > > Kurt :-) > > > > > > On Mon, Aug 17, 2009 at 03:07:39PM -0700, Scott Meyer wrote: >> Iain Sproat wrote: >>> Are we agreed that a freebase topic is a symset (and vice versa)? >> >> I don't think so. I'm suggesting that WN words and synsets be >> cleanly separated from topics. Topics are "things" and words and >> synsets are language-dependent symbolic representations of things. >> Is there any reason to believe that, say, Chinese synsets should >> correspond to English synsets? If we had half a dozen wordnets, >> we could answer that question. As we don't, I think that it >> behooves us not to constrain foreign language wordnets. >> >> So concretely two fundamental types: >> >> /common/symbol - corresponding to Wordnet Words >> /common/synset - corresponding to Wordnet Synsets >> >> With properties lifted directly from Wordnet 3.0, pretty much the >> structure you've proposed without the reconciliation of topics. >> The linkage between topic and synset would be a synset property: >> >> /common/synset/represents - expected type of /common/topic >> >> with the meaning: "some or all of the words in this synset may >> be used to represent this topic" The preferred symbol is the >> one with the same spelling as the topic's name. I'd call the >> reverse property: >> >> /common/topic/symbols - expected type of /common/synset >> >> Symbols and synsets would be monolingual either by convention >> with language being given by a property: >> >> /common/symbol/language >> /common/synset/language >> >> or implicitly with a language specific type name. For example: >> >> /common/symbol_en >> /common/synset_en >> >> I favor the former. >> >>> If so, the debate seems to be solely based on how best to model the >>> relationship between a symbol (aka word) and the symset. Either way >>> it wouldn't prevent us loading the wordnet symset data to freebase >>> (particularly verbs and adjectives) and reconciling the nouns with >>> existing freebase topics. I think that in itself would be >>> incredibly >>> beneficial. >> >> I agree completely about the benefits of loading wordnet. My >> contention >> is twofold: >> >> 1. We can load Wordnet rapidly only if we don't reconcile Wordnet >> identities >> to Freebase topics. Out of 380,000 entries in /usr/share/dict/words, >> there are about 90,000 which match exactly the name or alias of a >> topic, >> given that all we have to go on is the spelling of words, synsets, >> and >> the text of definitions, this looks like a substantial reconciliation >> problem, one that has, in the past prevented us from loading Wordnet. >> >> 2. There are structural reasons why a separate meta-level for >> language >> constructs is a good idea: basically, symbols are different from >> the things >> they represent. >> >>> On Fri, Aug 14, 2009 at 5:56 AM, Scott Meyer wrote: >>>> 1. Reconcile topic to Synset, use value of name to locate symbol/ >>>> word. >>>> >>>> This allows direct navigation to the Synset which was implied by >>>> this >>>> sort of usage: >>>> >>>>> { >>>>> ... >>>>> "symbol" : { >>>>> "lang" : "/lang/en", >>>>> "hyponym" : [{ >>>>> "topic" : [{ >>>>> "name" : null, >>>>> "id" : null, >>>>> }] >>>>> }], >>>>> } >>>>> ... >>>>> } >>> >>> I can't quite work out the MQL (hyponym seems to expect a CVT with a >>> property /topic?), and I'm not sure it compares exactly with the >>> description for #1. Should it instead, perhaps, be the following? >>> (direct link between symbol and topic): >>>> {. >>>> "type" : "/common/symbol" >>>> "/common/symbol/lang" : "/lang/en", >>>> "/common/symbol/hyponym" : [{ >>>> "type" : "/common/topic", >>>> "name" : null, >>>> "id" : null, >>>> }], >>>> ......... >>>> } >> >> This is reconciling synset and topic. In my example above, >> the property "hyponym" was a property of synset and "topic" >> was the property linking a synset and a topic. Here's that >> example in the vocabulary I proposed above: >> >> { # we start with some topic >> ... >> "/common/topic/symbols" : [{ # find synsets used to represent it >> "language" : "/lang/en", >> "hyponym" : [{ # find hyponyms of those >> "/common/synset/represents" : [{ find topics which those >> represent >> "name" : null, >> "id" : null >> }] >> }] >> }] >> } >> >> Does that make more sense? >> >> -Scott >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090818/3e8bc846/attachment.bin From kurt at spaceship.com Tue Aug 18 23:53:50 2009 From: kurt at spaceship.com (Kurt Bollacker) Date: Tue, 18 Aug 2009 16:53:50 -0700 Subject: [Data-modeling] English Words In-Reply-To: <2E3991A6-75AC-4AFF-9BC6-659C450F138B@metaweb.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <2E3991A6-75AC-4AFF-9BC6-659C450F138B@metaweb.com> Message-ID: <20090818235350.GO6305@spaceship.com> On Tue, Aug 18, 2009 at 04:26:31PM -0700, Reilly Hayes wrote: > > Linking the Wordnet space to freebase Topics, Types, and Properties > would be an important goal. "Our" wordnet would represent the > relations built into wordnet, plus a set of relations between wordnet > entities and freebase topics. That makes me happy. It's just not what Mr. Meyer was saying. > I'm not sure how important "forking" wordnet would be. I suspect forking would just be an eventuality. For example, I know of linguists trying to build a multiligual Wordnet. So far, the lack a good technological foundation is one of the blocks. If Wordnet is imported into Freebase, then the path for other language wordnets would be made much clearer. In many cases, the grounding of synsets into first-class Freebase topics may provide a "translation gateway" between the different languages. Kurt :-) > -r > > > On Aug 18, 2009, at 3:39 PM, Kurt Bollacker wrote: > > > > >I beleive that Scott's opinion represents a practical and useful > >approach to Freebase's ingest of Wordnet. However, without > >substantial reconciliation, then Wordnet is just a parallel semantic > >network that leverages the very nice graph store that also hosts > >Freebase. So my understanding is then that the main advantage of > >having Wordnet in Freebase is to essentially fork it to allow for its > >collaboratively driven growth and evolution. > > > >Am I understanding all of this correctly? > > > > Kurt :-) > > > > > > > > > > > >On Mon, Aug 17, 2009 at 03:07:39PM -0700, Scott Meyer wrote: > >>Iain Sproat wrote: > >>>Are we agreed that a freebase topic is a symset (and vice versa)? > >> > >>I don't think so. I'm suggesting that WN words and synsets be > >>cleanly separated from topics. Topics are "things" and words and > >>synsets are language-dependent symbolic representations of things. > >>Is there any reason to believe that, say, Chinese synsets should > >>correspond to English synsets? If we had half a dozen wordnets, > >>we could answer that question. As we don't, I think that it > >>behooves us not to constrain foreign language wordnets. > >> > >>So concretely two fundamental types: > >> > >>/common/symbol - corresponding to Wordnet Words > >>/common/synset - corresponding to Wordnet Synsets > >> > >>With properties lifted directly from Wordnet 3.0, pretty much the > >>structure you've proposed without the reconciliation of topics. > >>The linkage between topic and synset would be a synset property: > >> > >>/common/synset/represents - expected type of /common/topic > >> > >>with the meaning: "some or all of the words in this synset may > >>be used to represent this topic" The preferred symbol is the > >>one with the same spelling as the topic's name. I'd call the > >>reverse property: > >> > >>/common/topic/symbols - expected type of /common/synset > >> > >>Symbols and synsets would be monolingual either by convention > >>with language being given by a property: > >> > >>/common/symbol/language > >>/common/synset/language > >> > >>or implicitly with a language specific type name. For example: > >> > >>/common/symbol_en > >>/common/synset_en > >> > >>I favor the former. > >> > >>>If so, the debate seems to be solely based on how best to model the > >>>relationship between a symbol (aka word) and the symset. Either way > >>>it wouldn't prevent us loading the wordnet symset data to freebase > >>>(particularly verbs and adjectives) and reconciling the nouns with > >>>existing freebase topics. I think that in itself would be > >>>incredibly > >>>beneficial. > >> > >>I agree completely about the benefits of loading wordnet. My > >>contention > >>is twofold: > >> > >>1. We can load Wordnet rapidly only if we don't reconcile Wordnet > >>identities > >>to Freebase topics. Out of 380,000 entries in /usr/share/dict/words, > >>there are about 90,000 which match exactly the name or alias of a > >>topic, > >>given that all we have to go on is the spelling of words, synsets, > >>and > >>the text of definitions, this looks like a substantial reconciliation > >>problem, one that has, in the past prevented us from loading Wordnet. > >> > >>2. There are structural reasons why a separate meta-level for > >>language > >>constructs is a good idea: basically, symbols are different from > >>the things > >>they represent. > >> > >>>On Fri, Aug 14, 2009 at 5:56 AM, Scott Meyer wrote: > >>>>1. Reconcile topic to Synset, use value of name to locate symbol/ > >>>>word. > >>>> > >>>>This allows direct navigation to the Synset which was implied by > >>>>this > >>>>sort of usage: > >>>> > >>>>>{ > >>>>> ... > >>>>> "symbol" : { > >>>>> "lang" : "/lang/en", > >>>>> "hyponym" : [{ > >>>>> "topic" : [{ > >>>>> "name" : null, > >>>>> "id" : null, > >>>>> }] > >>>>> }], > >>>>> } > >>>>> ... > >>>>>} > >>> > >>>I can't quite work out the MQL (hyponym seems to expect a CVT with a > >>>property /topic?), and I'm not sure it compares exactly with the > >>>description for #1. Should it instead, perhaps, be the following? > >>>(direct link between symbol and topic): > >>>>{. > >>>> "type" : "/common/symbol" > >>>> "/common/symbol/lang" : "/lang/en", > >>>> "/common/symbol/hyponym" : [{ > >>>> "type" : "/common/topic", > >>>> "name" : null, > >>>> "id" : null, > >>>> }], > >>>> ......... > >>>>} > >> > >>This is reconciling synset and topic. In my example above, > >>the property "hyponym" was a property of synset and "topic" > >>was the property linking a synset and a topic. Here's that > >>example in the vocabulary I proposed above: > >> > >>{ # we start with some topic > >> ... > >> "/common/topic/symbols" : [{ # find synsets used to represent it > >> "language" : "/lang/en", > >> "hyponym" : [{ # find hyponyms of those > >> "/common/synset/represents" : [{ find topics which those > >>represent > >> "name" : null, > >> "id" : null > >> }] > >> }] > >> }] > >>} > >> > >>Does that make more sense? > >> > >>-Scott > >>_______________________________________________ > >>Data-modeling mailing list > >>Data-modeling at freebase.com > >>http://lists.freebase.com/mailman/listinfo/data-modeling > >_______________________________________________ > >Data-modeling mailing list > >Data-modeling at freebase.com > >http://lists.freebase.com/mailman/listinfo/data-modeling > From jfry at metaweb.com Wed Aug 19 00:40:40 2009 From: jfry at metaweb.com (Jeff Fry) Date: Tue, 18 Aug 2009 17:40:40 -0700 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists Message-ID: <4A8B4A08.1090307@metaweb.com> I know this has been discussed before, and that it's not easy to solve under our current schema...but it is still quite jarring when I encounter someone like Chomsky listed as a musical artist for having, say recorded a speech or read a book aloud. (In Chomsky's case it's catchy ditties such as 'Excerpt From "A Hard Choice"' and 'Pacification') It can be particularly jarring since musical artist often ends up as their most notable type, e.g. http://www.sandbox-freebase.com/view/en/angela_davis http://www.sandbox-freebase.com/view/en/assata_shakur Any thoughts as to how this could be improved, either in the data model or in the client? Jeff From iainsproat at gmail.com Wed Aug 19 09:19:27 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Wed, 19 Aug 2009 13:19:27 +0400 Subject: [Data-modeling] English Words In-Reply-To: <2E3991A6-75AC-4AFF-9BC6-659C450F138B@metaweb.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <2E3991A6-75AC-4AFF-9BC6-659C450F138B@metaweb.com> Message-ID: On Wed, Aug 19, 2009 at 3:26 AM, Reilly Hayes wrote: >On Aug 18, 2009, at 3:39 PM, Kurt Bollacker wrote: >> However, without substantial reconciliation, then Wordnet is just a parallel semantic >> network that leverages the very nice graph store that also hosts >> Freebase. The key point is the /common/synset/represents property. This reconciles the WordNet synsets with the Freebase topics. >> I'm not sure how important "forking" wordnet would be. > I suspect forking would just be an eventuality. It's the same with every data source imported into Freebase, they are all essentially a fork of the original source. The freebase API and the RDF availability allows fairly unencumbered access to any software to generate Diffs and carry out Merges. The biggest problems are the matching of schemas between freebase and the data source (I believe there's a few freebase folks working on an RDF ontology mapping base); and also a lack of API on the data source side for writing back. (WordNet's only method of contribution seems to be suggestions via email). > In many cases, the grounding of synsets > into first-class Freebase topics may provide a "translation gateway" > between the different languages. There's also slang dictionaries and fictional languages. (Elfish, Klingon etc..) Iain From iainsproat at gmail.com Wed Aug 19 10:26:53 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Wed, 19 Aug 2009 14:26:53 +0400 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: <4A8B4A08.1090307@metaweb.com> References: <4A8B4A08.1090307@metaweb.com> Message-ID: An interesting problem, and is relevant also to participants in recordings such as http://www.freebase.com/view/en/watergate_tapes If we're to solve this using the schema, I'd be hesitant to touch the /music/artist as it is a fairly prevalent type and would probably break quite a lot of things. An option would be a new non-breaking type. We could create a new /music/recorded_speaker type, and tweak the client to display the /music/artist further down the page if the /music/recorded_speaker is present. Rather than being a 'bucket' type, the /music/recorded_speaker type could link to /music/track through a mediating CVT which asserts property values such as narration, speech, debate, interview, conversation, poetry recital etc.. Iain On Wed, Aug 19, 2009 at 4:40 AM, Jeff Fry wrote: > I know this has been discussed before, and that it's not easy to solve > under our current schema...but it is still quite jarring when I > encounter someone like Chomsky listed as a musical artist for having, > say recorded a speech or read a book aloud. (In Chomsky's case it's > catchy ditties such as 'Excerpt From "A Hard Choice"' and 'Pacification') > > It can be particularly jarring since musical artist often ends up as > their most notable type, e.g. > http://www.sandbox-freebase.com/view/en/angela_davis > http://www.sandbox-freebase.com/view/en/assata_shakur > > Any thoughts as to how this could be improved, either in the data model > or in the client? > > Jeff > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From paul at ontology2.com Wed Aug 19 14:23:19 2009 From: paul at ontology2.com (Paul Houle) Date: Wed, 19 Aug 2009 10:23:19 -0400 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: References: <4A8B4A08.1090307@metaweb.com> Message-ID: <4A8C0AD7.2040204@ontology2.com> Iain Sproat wrote: > An interesting problem, and is relevant also to participants in > recordings such as http://www.freebase.com/view/en/watergate_tapes > > If we're to solve this using the schema, I'd be hesitant to touch the > /music/artist as it is a fairly prevalent type and would probably > break quite a lot of things. > > An option would be a new non-breaking type. We could create a new > /music/recorded_speaker type, and tweak the client to display the > /music/artist further down the page if the /music/recorded_speaker is > present. > It's the baggage that comes along with duck typing. It would be nice to have something that recognizes the commonalities of "/music/recorded_speaker" and "/music/artist" however... Is there a way to distinguish the "Singer" role from the "Musical Artist" role? It would be nice to have a way to tag who did what on each track, which would be interesting for groups like http://www.freebase.com/view/en/the_beatles or http://www.freebase.com/view/en/shonen_knife where different vocalists are on different tracks. Speaking of which, the record for http://www.freebase.com/view/en/naoko_yamano has her name in Japanese kanji as a primary name, which isn't going to be useful at all for non-Japanese speakers (I see the "ko" at the end and that's it.) I've done some work modelling Asian names for digital library systems and it's a pain in the butt. Obviously the system already has her name ('Also known as: Naoko Yamano') but it's choosing to display the Kanji form of the name. It's also good to keep a Kana form of Japanese proper names around too, since Japanese proper names are sorted alphabetically by kana. Converting kanji -> kana is a tricky task that I've never seen automated in production systems. (I suspect it's more tractable than people think...) Converting "Naoko Yamano"@en to kana would be easier, but there's still the detail that "Naoko Yamano"@en is expressed in the western format that puts the family name at the end, but you'll probably find an occasional well-meaning otaku who would write her name "Yamano Naoko". A knowledge-based system could probably deal with this since there aren't that many distinct Japanese names anyway... Immediately I'd be happier to see the headline of the page be ???? (Naoko Yamano) or (Naoko Yamano) ???? From kurt at spaceship.com Wed Aug 19 14:35:24 2009 From: kurt at spaceship.com (Kurt Bollacker) Date: Wed, 19 Aug 2009 07:35:24 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <2E3991A6-75AC-4AFF-9BC6-659C450F138B@metaweb.com> Message-ID: <20090819143524.GQ6305@spaceship.com> On Wed, Aug 19, 2009 at 01:19:27PM +0400, Iain Sproat wrote: > On Wed, Aug 19, 2009 at 3:26 AM, Reilly Hayes wrote: > >On Aug 18, 2009, at 3:39 PM, Kurt Bollacker wrote: > >> I'm not sure how important "forking" wordnet would be. > > I suspect forking would just be an eventuality. > > It's the same with every data source imported into Freebase, they are > all essentially a fork of the original source. Not necessarily. You have a choice. When you bring in a new dataset you can choose to either make it a one-off snapshot or an ongoing maintenance effort to keep up with changes. If bringing in Wordnet is a one-off, then the fork will happen. If it is a maintenance effort (like Metaweb's fabulous data team does with Wikipedia updates), then you simply have a different version (something close to a curated superset) of the original source data. I believe that if this is Wordnet fork, then it has a chance to grow more quickly and in richer ways that the original. Kurt :-) From spencerkelly86 at gmail.com Wed Aug 19 14:45:00 2009 From: spencerkelly86 at gmail.com (Spencer Kelly) Date: Wed, 19 Aug 2009 10:45:00 -0400 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: References: <4A8B4A08.1090307@metaweb.com> Message-ID: i think we need a property on musical track: uses sample then the distinction between a 'track contribution' and a 'used sample', will be voluntary/non-voluntary usage. From tfmorris at gmail.com Wed Aug 19 15:19:39 2009 From: tfmorris at gmail.com (Tom Morris) Date: Wed, 19 Aug 2009 11:19:39 -0400 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: References: <4A8B4A08.1090307@metaweb.com> Message-ID: The problem extends to the recordings themselves since spoken word performance which is typed as "musical track" isn't really correct. I agree that we need to be careful with the perturbing the music domain, but the problem will just get worse over time, so it's better to fix it now than later. There are also recordings of things which aren't music or spoken word, such as animal calls, sound effects, ambient sounds, security tapes, etc. Tom On Wed, Aug 19, 2009 at 6:26 AM, Iain Sproat wrote: > An interesting problem, and is relevant also to participants in > recordings such as http://www.freebase.com/view/en/watergate_tapes > > If we're to solve this using the schema, I'd be hesitant to touch the > /music/artist as it is a fairly prevalent type and would probably > break quite a lot of things. > > An option would be a new non-breaking type. ?We could create a new > /music/recorded_speaker type, and tweak the client to display the > /music/artist further down the page if the /music/recorded_speaker is > present. > > Rather than being a 'bucket' type, ?the /music/recorded_speaker type > could link to /music/track through a mediating CVT which asserts > property values such as narration, speech, debate, interview, > conversation, poetry recital etc.. > > Iain > > On Wed, Aug 19, 2009 at 4:40 AM, Jeff Fry wrote: >> I know this has been discussed before, and that it's not easy to solve >> under our current schema...but it is still quite jarring when I >> encounter someone like Chomsky listed as a musical artist for having, >> say recorded a speech or read a book aloud. (In Chomsky's case it's >> catchy ditties such as 'Excerpt From "A Hard Choice"' and 'Pacification') >> >> It can be particularly jarring since musical artist often ends up as >> their most notable type, e.g. >> http://www.sandbox-freebase.com/view/en/angela_davis >> http://www.sandbox-freebase.com/view/en/assata_shakur >> >> Any thoughts as to how this could be improved, either in the data model >> or in the client? >> >> Jeff >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling >> > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From tfmorris at gmail.com Wed Aug 19 15:23:02 2009 From: tfmorris at gmail.com (Tom Morris) Date: Wed, 19 Aug 2009 11:23:02 -0400 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: <4A8C0AD7.2040204@ontology2.com> References: <4A8B4A08.1090307@metaweb.com> <4A8C0AD7.2040204@ontology2.com> Message-ID: On Wed, Aug 19, 2009 at 10:23 AM, Paul Houle wrote: > Speaking of which, the record for > > http://www.freebase.com/view/en/naoko_yamano > > has her name in Japanese kanji as a primary name, which isn't going to > be useful at all for non-Japanese speakers (I see the "ko" at the end > and that's it.) I've done some work modelling Asian names for digital > library systems and it's a pain in the butt. Obviously the system > already has her name ('Also known as: Naoko Yamano') but it's choosing > to display the Kanji form of the name. That's a bug. The Japanese name is mistyped as English. http://www.freebase.com/api/service/mqlread?query={%20%22query%22%3A%20[{%20%22guid%22%3A%20%22%239202a8c04000641f8000000003b306c3%22%2C%20%22name%22%3A%20[{}]%2C%20%22%2Fcommon%2Ftopic%2Falias%22%3A%20[{}]%20}]%20} You've correctly identified that the primary name should be in English, per the Freebase guidelines, so all you have to do is click the "Edit" button and make the change. You won't be able to type the Kanji name correctly as Japanese through the web client, so just move it to an alias (where it will continue to be mistyped as English until someone fixes it up via MQL). Tom From rfh at metaweb.com Wed Aug 19 15:25:24 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Wed, 19 Aug 2009 08:25:24 -0700 Subject: [Data-modeling] English Words In-Reply-To: <20090819143524.GQ6305@spaceship.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <2E3991A6-75AC-4AFF-9BC6-659C450F138B@metaweb.com> <20090819143524.GQ6305@spaceship.com> Message-ID: I must admit that I don't know if wordnet facilitates this by maintaing keys between versions. If they do, there is not a need to "fork". If they don't, we could still build a transition process, but it wouldn't be fun. -r On Aug 19, 2009, at 7:35 AM, Kurt Bollacker wrote: > > On Wed, Aug 19, 2009 at 01:19:27PM +0400, Iain Sproat wrote: >> On Wed, Aug 19, 2009 at 3:26 AM, Reilly Hayes wrote: >>> On Aug 18, 2009, at 3:39 PM, Kurt Bollacker wrote: >>>> I'm not sure how important "forking" wordnet would be. >>> I suspect forking would just be an eventuality. >> >> It's the same with every data source imported into Freebase, they are >> all essentially a fork of the original source. > > Not necessarily. You have a choice. When you bring in a new dataset > you can choose to either make it a one-off snapshot or an ongoing > maintenance effort to keep up with changes. If bringing in Wordnet is > a one-off, then the fork will happen. If it is a maintenance effort > (like Metaweb's fabulous data team does with Wikipedia updates), then > you simply have a different version (something close to a curated > superset) of the original source data. > > I believe that if this is Wordnet fork, then it has a chance to grow > more quickly and in richer ways that the original. > > Kurt :-) > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090819/a7ee642a/attachment.bin From jeff at metaweb.com Wed Aug 19 16:51:15 2009 From: jeff at metaweb.com (Jeff Prucher) Date: Wed, 19 Aug 2009 09:51:15 -0700 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: <4A8C0AD7.2040204@ontology2.com> References: <4A8B4A08.1090307@metaweb.com> <4A8C0AD7.2040204@ontology2.com> Message-ID: <0CF787396A5C40C3B5CF17669AC27C26@amd> > Is there a way to distinguish the > "Singer" role from the "Musical Artist" role? It would be > nice to have a way to tag who did what on each track, which > would be interesting for groups like > > http://www.freebase.com/view/en/the_beatles > > or > > http://www.freebase.com/view/en/shonen_knife > > where different vocalists are on different tracks. The property /music/track/contributions is for this purpose. Jeff From sm at metaweb.com Wed Aug 19 16:19:01 2009 From: sm at metaweb.com (Scott Meyer) Date: Wed, 19 Aug 2009 09:19:01 -0700 Subject: [Data-modeling] English Words In-Reply-To: <20090818223958.GM6305@spaceship.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> Message-ID: <4A8C25F5.1090602@metaweb.com> Kurt Bollacker wrote: > I beleive that Scott's opinion represents a practical and useful > approach to Freebase's ingest of Wordnet. However, without > substantial reconciliation, then Wordnet is just a parallel semantic > network that leverages the very nice graph store that also hosts > Freebase. So my understanding is then that the main advantage of > having Wordnet in Freebase is to essentially fork it to allow for its > collaboratively driven growth and evolution. You're right that simply loading wordnet as a parallel hierarchy is of limited benefit. My hope is that people who use Wordnet and Freebase might be inclined to publish some of the reconciliation work implicit in "using Wordnet and Freebase". In addition, I'd expect Metaweb to contribute, either via direct onslaught of the data team, circuitously with a dictionary data game, or in other creative ways. As for an explicit "fork" of Wordnet, I'm pretty much agnostic. If Freebase Wordnet becomes a center of growth and evolution, that's great. If Freebase just publishes Wordnet as it comes from Princeton, that's fine too. > Am I understanding all of this correctly? Of course! :-) -Scott From arthur.van.hoff at gmail.com Wed Aug 19 17:39:28 2009 From: arthur.van.hoff at gmail.com (Arthur van Hoff) Date: Wed, 19 Aug 2009 10:39:28 -0700 Subject: [Data-modeling] English Words In-Reply-To: <4A8C25F5.1090602@metaweb.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> Message-ID: On Wed, Aug 19, 2009 at 9:19 AM, Scott Meyer wrote: > You're right that simply loading wordnet as a parallel hierarchy > is of limited benefit. My hope is that people who use > Wordnet and Freebase might be inclined to publish some of the > reconciliation work implicit in "using Wordnet and Freebase". It would be extremely valuable to us to have a version of wordnet in Freebase, specially if it is maintained and extended by the community. Freebase is great at defining topics, but it lacks data on nouns, plurals, verbs, verb-tenses, hyphenation, adjectives, punctuation, person names, titles, slang, abbreviations, etc. All of which are instrumental in parsing and understanding text. The loading of Wordnet is a good start... We've cobbled together some data from various sources, but it would be great if it could be all linked together in once place, and against multiple languages. This would simplify the task of maintaining the data greatly. We're very motivated to make this work, and we're willing to contribute a whole bunch of data that we've gathered statistically by parsing the web. -- Arthur van Hoff - Grand Master of Alphabetical Order The Ellerdale Project, Menlo Park, CA -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090819/e86b256b/attachment.htm From bgoldenberg at gmail.com Wed Aug 19 18:01:59 2009 From: bgoldenberg at gmail.com (Benjamin Goldenberg) Date: Wed, 19 Aug 2009 11:01:59 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <2E3991A6-75AC-4AFF-9BC6-659C450F138B@metaweb.com> <20090819143524.GQ6305@spaceship.com> Message-ID: On Wed, Aug 19, 2009 at 8:25 AM, Reilly Hayes wrote: > > I must admit that I don't know if wordnet facilitates this by maintaing keys > between versions. ?If they do, there is not a need to "fork". ?If they > don't, we could still build a transition process, but it wouldn't be fun. As I understand it, the synset offset numbers are not stable between versions, but the word#pos#form identifiers are. Ben > > -r > > > On Aug 19, 2009, at 7:35 AM, Kurt Bollacker wrote: > >> >> On Wed, Aug 19, 2009 at 01:19:27PM +0400, Iain Sproat wrote: >>> >>> On Wed, Aug 19, 2009 at 3:26 AM, Reilly Hayes wrote: >>>> >>>> On Aug 18, 2009, at 3:39 PM, Kurt Bollacker wrote: >>>>> >>>>> I'm not sure how important "forking" wordnet would be. >>>> >>>> I suspect forking would just be an eventuality. >>> >>> It's the same with every data source imported into Freebase, they are >>> all essentially a fork of the original source. >> >> Not necessarily. You have a choice. ?When you bring in a new dataset >> you can choose to either make it a one-off snapshot or an ongoing >> maintenance effort to keep up with changes. ?If bringing in Wordnet is >> a one-off, then the fork will happen. ?If it is a maintenance effort >> (like Metaweb's fabulous data team does with Wikipedia updates), then >> you simply have a different version (something close to a curated >> superset) of the original source data. >> >> I believe that if this is Wordnet fork, then it has a chance to grow >> more quickly and in richer ways that the original. >> >> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?Kurt :-) >> >> >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > > From iainsproat at gmail.com Wed Aug 19 18:10:36 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Wed, 19 Aug 2009 22:10:36 +0400 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> Message-ID: Do we have a consensus on the below schema? /common/symbol /common/symbol/lang expecting /type/lang /common/symbol/synset reciprocated by /common/synset/synonym /common/synset /common/synset/lang expecting /type/lang /common/synset/synonym reciprocated by /common/symbol/synset /common/synset/represents expecting /common/topic and, finally, a new property in topic: /common/topic/synset reciprocated by /common/synset/represents I think that's the bare minimum to get this started. We can look at more features/properties later once this has got off the ground. Iain From arthur.van.hoff at gmail.com Wed Aug 19 18:20:46 2009 From: arthur.van.hoff at gmail.com (Arthur van Hoff) Date: Wed, 19 Aug 2009 11:20:46 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> Message-ID: +1 On Wed, Aug 19, 2009 at 11:10 AM, Iain Sproat wrote: > Do we have a consensus on the below schema? > > /common/symbol > /common/symbol/lang expecting /type/lang > /common/symbol/synset reciprocated by /common/synset/synonym > > /common/synset > /common/synset/lang expecting /type/lang > /common/synset/synonym reciprocated by /common/symbol/synset > /common/synset/represents expecting /common/topic > > and, finally, a new property in topic: > /common/topic/synset reciprocated by /common/synset/represents > > I think that's the bare minimum to get this started. We can look at > more features/properties later once this has got off the ground. > > Iain > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > -- Arthur van Hoff arthur.van.hoff at gmail.com 650-283-0842 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090819/718e3b62/attachment-0001.htm From paul at ontology2.com Wed Aug 19 18:33:44 2009 From: paul at ontology2.com (Paul Houle) Date: Wed, 19 Aug 2009 14:33:44 -0400 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: References: <4A8B4A08.1090307@metaweb.com> <4A8C0AD7.2040204@ontology2.com> Message-ID: <4A8C4588.9050200@ontology2.com> Tom Morris wrote: > > That's a bug. The Japanese name is mistyped as English. > > http://www.freebase.com/api/service/mqlread?query={%20%22query%22%3A%20[{%20%22guid%22%3A%20%22%239202a8c04000641f8000000003b306c3%22%2C%20%22name%22%3A%20[{}]%2C%20%22%2Fcommon%2Ftopic%2Falias%22%3A%20[{}]%20}]%20} > > You've correctly identified that the primary name should be in > English, per the Freebase guidelines, so all you have to do is click > the "Edit" button and make the change. You won't be able to type the > Kanji name correctly as Japanese through the web client, so just move > it to an alias (where it will continue to be mistyped as English until > someone fixes it up via MQL). > Well, I don't think it's entirely realistic to expect users to fix problems like this manually, and as you point out, the current interface doesn't support it. It appears this error was introduced in an automated import process and it ought to be fixed by an automated gardening process. One trouble with manual fixes is that they're often going to be wrong. For instance, I got into japanese pop culture long before I got into data modeling; i had (and still have) language skills that are probably typical of the kind of gaijin who'd care about Shonen Knife: I can read kana, maybe 50 kanji, and know a few hundred words of the spoken language and something about particles and grammar. With great concentration I can put together a stilted and formal sentence, but I can't conjugate verbs correctly to speak informally. Trouble is, that's enough to be dangerous. All the time I hear and read names spoken family names first and I went through a phrase where it was "cool" and "correct" to give names in the Japanese format that puts family names first. If you have a population of anime otaku university students, you'll get 50% of names entered wrong. I know better now because I've worked on a number of projects where we spent entirely too much time thinking about Icelandic patronymics (Eric's-son) and other problems in properly esegmenting names. More than once we came to the conclusion that overly complex user interfaces for name entry would increase the error rate, not decrease it -- particularly because people who have names coded by non-English conventions would have a hard time reading instructions in English. Looking at the problem today I'd take a knowledge-based approach; I'm certain that heuristics could be derived that would segment names better than any population of non-experts and better than names would be segmented by their owners. Of course, the existence of Freebase goes a way towards making that possible. From tfmorris at gmail.com Wed Aug 19 19:14:06 2009 From: tfmorris at gmail.com (Tom Morris) Date: Wed, 19 Aug 2009 15:14:06 -0400 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: <4A8C4588.9050200@ontology2.com> References: <4A8B4A08.1090307@metaweb.com> <4A8C0AD7.2040204@ontology2.com> <4A8C4588.9050200@ontology2.com> Message-ID: On Wed, Aug 19, 2009 at 2:33 PM, Paul Houle wrote: > Tom Morris wrote: >> >> That's a bug. ?The Japanese name is mistyped as English. >> >> http://www.freebase.com/api/service/mqlread?query={%20%22query%22%3A%20[{%20%22guid%22%3A%20%22%239202a8c04000641f8000000003b306c3%22%2C%20%22name%22%3A%20[{}]%2C%20%22%2Fcommon%2Ftopic%2Falias%22%3A%20[{}]%20}]%20} >> >> You've correctly identified that the primary name should be in >> English, per the Freebase guidelines, so all you have to do is click >> the "Edit" button and make the change. ?You won't be able to type the >> Kanji name correctly as Japanese through the web client, so just move >> it to an alias (where it will continue to be mistyped as English until >> someone fixes it up via MQL). >> > ? ?Well, ?I don't think it's entirely realistic to expect users to fix > problems like this manually, ?and as you point out, ?the current > interface doesn't support it. ?It appears this error was introduced in > an automated import process and it ought to be fixed by an automated > gardening process. OK, let me revise my suggestion. Go to http://bugs.freebase.com and file a bug report with as much detail as you can muster and sit back and wait for Metaweb to prioritize and, eventually, fix the problem. Although the data was originally imported by an automated process, the process of choosing the Kanji name was a manual one, although users are asked to vote on the "best" topic overall and may have had some reason for preferring the version with the Kanji name. The topic with the Kanji came from a MusicBrainz import, but it's not clear that they know what language the names are in. One thing the importer could probably do is give preference to aliases which are in a Western character set, although that might have other undesirable side effects. Tom From iainsproat at gmail.com Wed Aug 19 19:18:53 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Wed, 19 Aug 2009 23:18:53 +0400 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: References: <4A8B4A08.1090307@metaweb.com> <4A8C0AD7.2040204@ontology2.com> <4A8C4588.9050200@ontology2.com> Message-ID: There's also the names base http://givennames.freebase.com/ for anyone interested in names. Iain On Wed, Aug 19, 2009 at 11:14 PM, Tom Morris wrote: > On Wed, Aug 19, 2009 at 2:33 PM, Paul Houle wrote: >> Tom Morris wrote: >>> >>> That's a bug. ?The Japanese name is mistyped as English. >>> >>> http://www.freebase.com/api/service/mqlread?query={%20%22query%22%3A%20[{%20%22guid%22%3A%20%22%239202a8c04000641f8000000003b306c3%22%2C%20%22name%22%3A%20[{}]%2C%20%22%2Fcommon%2Ftopic%2Falias%22%3A%20[{}]%20}]%20} >>> >>> You've correctly identified that the primary name should be in >>> English, per the Freebase guidelines, so all you have to do is click >>> the "Edit" button and make the change. ?You won't be able to type the >>> Kanji name correctly as Japanese through the web client, so just move >>> it to an alias (where it will continue to be mistyped as English until >>> someone fixes it up via MQL). >>> >> ? ?Well, ?I don't think it's entirely realistic to expect users to fix >> problems like this manually, ?and as you point out, ?the current >> interface doesn't support it. ?It appears this error was introduced in >> an automated import process and it ought to be fixed by an automated >> gardening process. > > OK, let me revise my suggestion. ?Go to http://bugs.freebase.com and > file a bug report with as much detail as you can muster and sit back > and wait for Metaweb to prioritize and, eventually, fix the problem. > > Although the data was originally imported by an automated process, the > process of choosing the Kanji name was a manual one, although users > are asked to vote on the "best" topic overall and may have had some > reason for preferring the version with the Kanji name. > > The topic with the Kanji came from a MusicBrainz import, but it's not > clear that they know what language the names are in. ?One thing the > importer could probably do is give preference to aliases which are in > a Western character set, although that might have other undesirable > side effects. > > Tom > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From jason at metaweb.com Wed Aug 19 19:25:34 2009 From: jason at metaweb.com (Jason Douglas) Date: Wed, 19 Aug 2009 12:25:34 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> Message-ID: <6C940FC3-6490-4413-A2C4-1C6308D2CA68@metaweb.com> On Aug 19, 2009, at 10:39 AM, Arthur van Hoff wrote: > On Wed, Aug 19, 2009 at 9:19 AM, Scott Meyer wrote: > You're right that simply loading wordnet as a parallel hierarchy > is of limited benefit. My hope is that people who use > Wordnet and Freebase might be inclined to publish some of the > reconciliation work implicit in "using Wordnet and Freebase". > > It would be extremely valuable to us to have a version of wordnet in > Freebase, specially if it is maintained and extended by the > community. Freebase is great at defining topics, but it lacks data > on nouns, plurals, verbs, verb-tenses, hyphenation, adjectives, > punctuation, person names, titles, slang, abbreviations, etc. All of > which are instrumental in parsing and understanding text. The > loading of Wordnet is a good start... This is all very experimental, but we have been trying some things in this area. There's this type: http://www.freebase.com/type/schema/freebase/linguistic_hint That we've been applying as "bare properties" (no type assertion) to various objects. David has even written a quick-and-dirty acre app for editing these values: http://linguistics.dfhuynh.user.dev.freebaseapps.com/?type=/location/country This app can't handle types with lots of instances and could certainly be improved (please feel free to clone it and do that!), but I thought it was worth sharing as we're also interested in seeing all of those questions you posed addressed. We were also just talking yesterday about how if WordNet does get loaded in Freebase, that it might make more sense for the expected type of these properties to be synsets rather than strings. -jason From rfh at metaweb.com Wed Aug 19 19:46:25 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Wed, 19 Aug 2009 12:46:25 -0700 Subject: [Data-modeling] English Words In-Reply-To: <6C940FC3-6490-4413-A2C4-1C6308D2CA68@metaweb.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> <6C940FC3-6490-4413-A2C4-1C6308D2CA68@metaweb.com> Message-ID: FYI, http://www.image-net.org/ On Aug 19, 2009, at 12:25 PM, Jason Douglas wrote: > On Aug 19, 2009, at 10:39 AM, Arthur van Hoff wrote: > >> On Wed, Aug 19, 2009 at 9:19 AM, Scott Meyer wrote: >> You're right that simply loading wordnet as a parallel hierarchy >> is of limited benefit. My hope is that people who use >> Wordnet and Freebase might be inclined to publish some of the >> reconciliation work implicit in "using Wordnet and Freebase". >> >> It would be extremely valuable to us to have a version of wordnet in >> Freebase, specially if it is maintained and extended by the >> community. Freebase is great at defining topics, but it lacks data >> on nouns, plurals, verbs, verb-tenses, hyphenation, adjectives, >> punctuation, person names, titles, slang, abbreviations, etc. All of >> which are instrumental in parsing and understanding text. The >> loading of Wordnet is a good start... > > This is all very experimental, but we have been trying some things in > this area. There's this type: > > http://www.freebase.com/type/schema/freebase/linguistic_hint > > That we've been applying as "bare properties" (no type assertion) to > various objects. David has even written a quick-and-dirty acre app > for editing these values: > > http://linguistics.dfhuynh.user.dev.freebaseapps.com/?type=/location/country > > This app can't handle types with lots of instances and could certainly > be improved (please feel free to clone it and do that!), but I thought > it was worth sharing as we're also interested in seeing all of those > questions you posed addressed. We were also just talking yesterday > about how if WordNet does get loaded in Freebase, that it might make > more sense for the expected type of these properties to be synsets > rather than strings. > > -jason > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090819/0f1c2e03/attachment.bin From iainsproat at gmail.com Wed Aug 19 20:12:27 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Thu, 20 Aug 2009 00:12:27 +0400 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: References: <4A8B4A08.1090307@metaweb.com> <4A8C0AD7.2040204@ontology2.com> <4A8C4588.9050200@ontology2.com> Message-ID: On Wed, Aug 19, 2009 at 7:19 PM, Tom Morris wrote: > There are also recordings of things which aren't > music or spoken word, such as animal calls, sound effects, ambient > sounds, security tapes, etc. I've cobbled together a schema in a new base base http://sounds.freebase.com/ There are types for audio recording, sampled audio recording and vocal recording. Including audio recording contributions (non-vocal), and vocal recording contributions. The use of this to solve the /music/artist problem would mean that /music/track/contributions and music/track/artist could move from /music/track to /base/sounds/audio_recording Iain On Wed, Aug 19, 2009 at 11:18 PM, Iain Sproat wrote: > There's also the names base http://givennames.freebase.com/ for anyone > interested in names. > > Iain > > On Wed, Aug 19, 2009 at 11:14 PM, Tom Morris wrote: >> On Wed, Aug 19, 2009 at 2:33 PM, Paul Houle wrote: >>> Tom Morris wrote: >>>> >>>> That's a bug. ?The Japanese name is mistyped as English. >>>> >>>> http://www.freebase.com/api/service/mqlread?query={%20%22query%22%3A%20[{%20%22guid%22%3A%20%22%239202a8c04000641f8000000003b306c3%22%2C%20%22name%22%3A%20[{}]%2C%20%22%2Fcommon%2Ftopic%2Falias%22%3A%20[{}]%20}]%20} >>>> >>>> You've correctly identified that the primary name should be in >>>> English, per the Freebase guidelines, so all you have to do is click >>>> the "Edit" button and make the change. ?You won't be able to type the >>>> Kanji name correctly as Japanese through the web client, so just move >>>> it to an alias (where it will continue to be mistyped as English until >>>> someone fixes it up via MQL). >>>> >>> ? ?Well, ?I don't think it's entirely realistic to expect users to fix >>> problems like this manually, ?and as you point out, ?the current >>> interface doesn't support it. ?It appears this error was introduced in >>> an automated import process and it ought to be fixed by an automated >>> gardening process. >> >> OK, let me revise my suggestion. ?Go to http://bugs.freebase.com and >> file a bug report with as much detail as you can muster and sit back >> and wait for Metaweb to prioritize and, eventually, fix the problem. >> >> Although the data was originally imported by an automated process, the >> process of choosing the Kanji name was a manual one, although users >> are asked to vote on the "best" topic overall and may have had some >> reason for preferring the version with the Kanji name. >> >> The topic with the Kanji came from a MusicBrainz import, but it's not >> clear that they know what language the names are in. ?One thing the >> importer could probably do is give preference to aliases which are in >> a Western character set, although that might have other undesirable >> side effects. >> >> Tom >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling >> > From iainsproat at gmail.com Wed Aug 19 20:52:33 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Thu, 20 Aug 2009 00:52:33 +0400 Subject: [Data-modeling] English Words In-Reply-To: References: <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> <6C940FC3-6490-4413-A2C4-1C6308D2CA68@metaweb.com> Message-ID: https://bugs.freebase.com/browse/DA-899 On Wed, Aug 19, 2009 at 11:46 PM, Reilly Hayes wrote: > > FYI, http://www.image-net.org/ > > > On Aug 19, 2009, at 12:25 PM, Jason Douglas wrote: > >> On Aug 19, 2009, at 10:39 AM, Arthur van Hoff wrote: >> >>> On Wed, Aug 19, 2009 at 9:19 AM, Scott Meyer wrote: >>> You're right that simply loading wordnet as a parallel hierarchy >>> is of limited benefit. ?My hope is that people who use >>> Wordnet and Freebase might be inclined to publish some of the >>> reconciliation work implicit in "using Wordnet and Freebase". >>> >>> It would be extremely valuable to us to have a version of wordnet in >>> Freebase, specially if it is maintained and extended by the >>> community. Freebase is great at defining topics, but it lacks data >>> on nouns, plurals, verbs, verb-tenses, hyphenation, adjectives, >>> punctuation, person names, titles, slang, abbreviations, etc. All of >>> which are instrumental in parsing and understanding text. The >>> loading of Wordnet is a good start... >> >> This is all very experimental, but we have been trying some things in >> this area. ?There's this type: >> >> ? ? ? ?http://www.freebase.com/type/schema/freebase/linguistic_hint >> >> That we've been applying as "bare properties" (no type assertion) to >> various objects. ?David has even written a quick-and-dirty acre app >> for editing these values: >> >> >> ?http://linguistics.dfhuynh.user.dev.freebaseapps.com/?type=/location/country >> >> This app can't handle types with lots of instances and could certainly >> be improved (please feel free to clone it and do that!), but I thought >> it was worth sharing as we're also interested in seeing all of those >> questions you posed addressed. ?We were also just talking yesterday >> about how if WordNet does get loaded in Freebase, that it might make >> more sense for the expected type of these properties to be synsets >> rather than strings. >> >> -jason >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > > From arthur.van.hoff at gmail.com Wed Aug 19 21:43:06 2009 From: arthur.van.hoff at gmail.com (Arthur van Hoff) Date: Wed, 19 Aug 2009 14:43:06 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> <6C940FC3-6490-4413-A2C4-1C6308D2CA68@metaweb.com> Message-ID: How are the various word types (noun, verb, ...) going to be distinguished? On Wed, Aug 19, 2009 at 1:52 PM, Iain Sproat wrote: > https://bugs.freebase.com/browse/DA-899 > > On Wed, Aug 19, 2009 at 11:46 PM, Reilly Hayes wrote: > > > > FYI, http://www.image-net.org/ > > > > > > On Aug 19, 2009, at 12:25 PM, Jason Douglas wrote: > > > >> On Aug 19, 2009, at 10:39 AM, Arthur van Hoff wrote: > >> > >>> On Wed, Aug 19, 2009 at 9:19 AM, Scott Meyer wrote: > >>> You're right that simply loading wordnet as a parallel hierarchy > >>> is of limited benefit. My hope is that people who use > >>> Wordnet and Freebase might be inclined to publish some of the > >>> reconciliation work implicit in "using Wordnet and Freebase". > >>> > >>> It would be extremely valuable to us to have a version of wordnet in > >>> Freebase, specially if it is maintained and extended by the > >>> community. Freebase is great at defining topics, but it lacks data > >>> on nouns, plurals, verbs, verb-tenses, hyphenation, adjectives, > >>> punctuation, person names, titles, slang, abbreviations, etc. All of > >>> which are instrumental in parsing and understanding text. The > >>> loading of Wordnet is a good start... > >> > >> This is all very experimental, but we have been trying some things in > >> this area. There's this type: > >> > >> http://www.freebase.com/type/schema/freebase/linguistic_hint > >> > >> That we've been applying as "bare properties" (no type assertion) to > >> various objects. David has even written a quick-and-dirty acre app > >> for editing these values: > >> > >> > >> > http://linguistics.dfhuynh.user.dev.freebaseapps.com/?type=/location/country > >> > >> This app can't handle types with lots of instances and could certainly > >> be improved (please feel free to clone it and do that!), but I thought > >> it was worth sharing as we're also interested in seeing all of those > >> questions you posed addressed. We were also just talking yesterday > >> about how if WordNet does get loaded in Freebase, that it might make > >> more sense for the expected type of these properties to be synsets > >> rather than strings. > >> > >> -jason > >> _______________________________________________ > >> Data-modeling mailing list > >> Data-modeling at freebase.com > >> http://lists.freebase.com/mailman/listinfo/data-modeling > > > > > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > -- Arthur van Hoff arthur.van.hoff at gmail.com 650-283-0842 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090819/03c807d4/attachment.htm From sm at metaweb.com Wed Aug 19 22:01:10 2009 From: sm at metaweb.com (Scott Meyer) Date: Wed, 19 Aug 2009 15:01:10 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> Message-ID: <4A8C7626.7000105@metaweb.com> Iain Sproat wrote: > Do we have a consensus on the below schema? Looks like an excellent start! Can anyone suggest a couple of "interesting" synsets to try this out on? -Scott > /common/symbol > /common/symbol/lang expecting /type/lang > /common/symbol/synset reciprocated by /common/synset/synonym > > /common/synset > /common/synset/lang expecting /type/lang > /common/synset/synonym reciprocated by /common/symbol/synset > /common/synset/represents expecting /common/topic > > and, finally, a new property in topic: > /common/topic/synset reciprocated by /common/synset/represents > > I think that's the bare minimum to get this started. We can look at > more features/properties later once this has got off the ground. > > Iain > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From paritosh at metaweb.com Thu Aug 20 00:58:10 2009 From: paritosh at metaweb.com (Praveen Paritosh) Date: Wed, 19 Aug 2009 17:58:10 -0700 (PDT) Subject: [Data-modeling] English Words In-Reply-To: <1159393406.19391250729712348.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> Message-ID: <316768215.19581250729890564.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> ----- "Scott Meyer" wrote: | Iain Sproat wrote: | > Do we have a consensus on the below schema? | | Looks like an excellent start! | | Can anyone suggest a couple of "interesting" synsets to | try this out on? The English word "set" is said to have the largest entry in OED and maps to a large number synsets. http://wordnetweb.princeton.edu/perl/webwn?s=set From rfh at metaweb.com Thu Aug 20 02:43:04 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Wed, 19 Aug 2009 19:43:04 -0700 Subject: [Data-modeling] English Words In-Reply-To: <316768215.19581250729890564.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> References: <316768215.19581250729890564.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> Message-ID: <5AF7A5E9-D241-432E-B554-A1B7890EAFC7@metaweb.com> Given that Freebase schema is weak on abstract & non-material concepts, this will not link to a large number of Freebase topics. -r On Aug 19, 2009, at 5:58 PM, Praveen Paritosh wrote: > > ----- "Scott Meyer" wrote: > > | Iain Sproat wrote: > | > Do we have a consensus on the below schema? > | > | Looks like an excellent start! > | > | Can anyone suggest a couple of "interesting" synsets to > | try this out on? > > The English word "set" is said to have the largest entry in OED and > maps to a large number synsets. > > http://wordnetweb.princeton.edu/perl/webwn?s=set > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090819/9af1a3c3/attachment.bin From rfh at metaweb.com Thu Aug 20 02:48:04 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Wed, 19 Aug 2009 19:48:04 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> Message-ID: <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> Represents may imply a stronger link than would be desirable. Also, the wordnet synset may correspond to a piece of schema (type or topic) rather than a topic. I'm eager to use this to discover candidate assertion templates (a la SnowBall), which will only work with predicate (property) mapping. -r On Aug 19, 2009, at 11:10 AM, Iain Sproat wrote: > Do we have a consensus on the below schema? > > /common/symbol > /common/symbol/lang expecting /type/lang > /common/symbol/synset reciprocated by /common/synset/synonym > > /common/synset > /common/synset/lang expecting /type/lang > /common/synset/synonym reciprocated by /common/symbol/synset > /common/synset/represents expecting /common/topic > > and, finally, a new property in topic: > /common/topic/synset reciprocated by /common/synset/represents > > I think that's the bare minimum to get this started. We can look at > more features/properties later once this has got off the ground. > > Iain > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090819/6c1bafe6/attachment-0001.bin From jason at metaweb.com Thu Aug 20 03:37:49 2009 From: jason at metaweb.com (Jason Douglas) Date: Wed, 19 Aug 2009 20:37:49 -0700 Subject: [Data-modeling] English Words In-Reply-To: <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> Message-ID: <5D58799D-7E10-4F40-B9B5-A244570D6EC9@metaweb.com> On Aug 19, 2009, at 7:48 PM, Reilly Hayes wrote: > Also, the wordnet synset may correspond to a piece of schema (type > or topic) rather than a topic. +1. I think supporting schema objects (in both directions) is important. -jason From rfh at metaweb.com Thu Aug 20 03:45:59 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Wed, 19 Aug 2009 20:45:59 -0700 Subject: [Data-modeling] English Words In-Reply-To: <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com> <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> Message-ID: <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> On Aug 19, 2009, at 7:48 PM, Reilly Hayes wrote: > > Represents may imply a stronger link than would be desirable. Also, > the wordnet synset may correspond to a piece of schema (type or > topic) rather than a topic. This was supposed to read: may correspond to a piece of schema (type or property) rather than a topic. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090819/eb01fa85/attachment.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090819/eb01fa85/attachment.bin From iainsproat at gmail.com Thu Aug 20 09:05:52 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Thu, 20 Aug 2009 13:05:52 +0400 Subject: [Data-modeling] English Words In-Reply-To: <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> References: <4A84C43B.3040303@metaweb.com> <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> Message-ID: On Thu, Aug 20, 2009 at 1:43 AM, Arthur van Hoff wrote: > How are the various word types (noun, verb, ...) going to be distinguished? This would be an additional feature to add on later. I would like to see the basic schema get put in place and tested before we begin adding more functionality. A property on synset would probably work. On Thu, Aug 20, 2009 at 7:45 AM, Reilly Hayes wrote: > Represents may imply a stronger link than would be desirable. ?Also, the > wordnet synset may correspond to a piece of schema (type or property) rather > than a topic. Would "equivalent" be a better word? or "associates"? To allow for types or properties, we'd have to alter the /common/synset/represents (or equivalent, or associates) property to expect /type/object. Iain From philip-freebase at shadowmagic.org.uk Thu Aug 20 09:52:25 2009 From: philip-freebase at shadowmagic.org.uk (Philip Kendall) Date: Thu, 20 Aug 2009 10:52:25 +0100 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: References: <4A8B4A08.1090307@metaweb.com> <4A8C0AD7.2040204@ontology2.com> <4A8C4588.9050200@ontology2.com> Message-ID: <20090820095224.GA13823@sphinx.mythic-beasts.com> On Wed, Aug 19, 2009 at 03:14:06PM -0400, Tom Morris wrote: > On Wed, Aug 19, 2009 at 2:33 PM, Paul Houle wrote: > > > ? ?Well, ?I don't think it's entirely realistic to expect users to fix > > problems like this manually, ?and as you point out, ?the current > > interface doesn't support it. ?It appears this error was introduced in > > an automated import process and it ought to be fixed by an automated > > gardening process. [ ... ] > Although the data was originally imported by an automated process, the > process of choosing the Kanji name was a manual one, although users > are asked to vote on the "best" topic overall and may have had some > reason for preferring the version with the Kanji name. I think it's worth noting here that the merger between the English-named and Kanji-named topics happened on 21 August 200*7*. We've all learnt a lot about this sort of thing since then -- I don't think it's worth persuing this one example too much. I've fixed things up so the English name is the /lang/en name, the Kanji name is the /lang/ja name and the Kanji name is an alias (so that it's displayed in the client). Cheers, Phil -- Philip Kendall http://www.shadowmagic.org.uk/ From crism at maden.org Thu Aug 20 13:18:31 2009 From: crism at maden.org (Christopher R. Maden) Date: Thu, 20 Aug 2009 09:18:31 -0400 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: <4A8B4A08.1090307@metaweb.com> References: <4A8B4A08.1090307@metaweb.com> Message-ID: <4A8D4D27.4070405@maden.org> Jeff Fry wrote: > I know this has been discussed before, and that it's not easy to solve > under our current schema...but it is still quite jarring when I > encounter someone like Chomsky listed as a musical artist for having, > say recorded a speech or read a book aloud. (In Chomsky's case it's > catchy ditties such as 'Excerpt From "A Hard Choice"' and 'Pacification') > > Any thoughts as to how this could be improved, either in the data model > or in the client? There are two problems here. One is that MusicBrainz contains audiobooks and music (and really should be called AudioBrainz). When audiobooks were marked as such, they were not imported, and their credited artists were not marked as musical artists. However, not all audiobooks were correctly marked, and the import process really had no choice but to believe MusicBrainz that they were musical albums. The only fix for this is to de-type or delete things as needed. The other is the general conflation of recorded sound with music. There is an extensive proposal for addressing that: DA-598. It was deemed inexpedient previously. That said, Cornel West did, I believe, record a musical album. ~Chris -- Chris Maden, text nerd ?If Buzz Aldrin really went to the moon, why is he so afraid to show his real long form birth certificate?? ?the Internet GnuPG Fingerprint: C6E4 E2A9 C9F8 71AC 9724 CAA3 19F8 6677 0077 C319 From tfmorris at gmail.com Thu Aug 20 13:59:31 2009 From: tfmorris at gmail.com (Tom Morris) Date: Thu, 20 Aug 2009 09:59:31 -0400 Subject: [Data-modeling] English Words In-Reply-To: References: <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> Message-ID: The amount of discussion this has generated clearly indicates that there's enough interest to move forward. One practical concern is licensing. The WordNet license has specific requirements on redistribution which would, by my interpretation, require modification to the Freebase licenses. It's a simple and straightforward matter of including/referencing the WordNet license, but it needs to be done. It might be worth considering a general mechanism to deal with this kind of simple MIT/BSD style license where all you really need to do is include a copy of the license. If we had a specific type for licenses and linked the imported data to the appropriate license, it would get incorporated into the TSV dumps automatically without having to modify the licenses by hand each time. Another thing which might be useful to folks working on this is to review previous work done by people attempting to incorporate/extend WordNet. One potential issue would seem to be the lack of strong identifiers. There's a report at http://www.aclweb.org/anthology/W/W08/W08-0507.pdf on the experiences of one team that attempted some work in this space. I don't have time to review it in detail, but it might be illuminating for those working on this little project. Tom From crism at maden.org Thu Aug 20 14:22:13 2009 From: crism at maden.org (Christopher R. Maden) Date: Thu, 20 Aug 2009 10:22:13 -0400 Subject: [Data-modeling] Cornel West, Noam Chomsky and Angela Davis: Musical Artists In-Reply-To: <4A8C0AD7.2040204@ontology2.com> References: <4A8B4A08.1090307@metaweb.com> <4A8C0AD7.2040204@ontology2.com> Message-ID: <4A8D5C15.5050102@maden.org> Paul Houle wrote: > It would be nice to have something that recognizes the commonalities of > "/music/recorded_speaker" and "/music/artist" however... Is there a way > to distinguish the "Singer" role from the "Musical Artist" role? It > would be nice to have a way to tag who did what on each track, which > would be interesting for groups like > > http://www.freebase.com/view/en/the_beatles > > or > > http://www.freebase.com/view/en/shonen_knife > > where different vocalists are on different tracks. Jeff P. has already observed that there are relationships for this. Within a group, the membership CVT can indicate the role, and albums and tracks have specifics. I?ve done some detailed data entry for a few Tori Amos albums (part of a very slow OCD project on my CD collection), by way of example. > Speaking of which, the record for > > http://www.freebase.com/view/en/naoko_yamano > > has her name in Japanese kanji as a primary name, which isn't going to > be useful at all for non-Japanese speakers (I see the "ko" at the end > and that's it.) More MusicBrainz fun. MusicBrainz only contains one name for any artist, and I decided it was better for an English user to see an incomprehensible name ? which might actually be comprehensible to some users ? than none at all. A project to transliterate the non-Latin names never got out of my imagination, in part because the lack of any linguistic information at all made it difficult or impossible to know what transliteration to use for hanzi. ~Chris -- Chris Maden, text nerd ?If Buzz Aldrin really went to the moon, why is he so afraid to show his real long form birth certificate?? ?the Internet GnuPG Fingerprint: C6E4 E2A9 C9F8 71AC 9724 CAA3 19F8 6677 0077 C319 From arthur.van.hoff at gmail.com Thu Aug 20 15:22:35 2009 From: arthur.van.hoff at gmail.com (Arthur van Hoff) Date: Thu, 20 Aug 2009 08:22:35 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> Message-ID: On Thu, Aug 20, 2009 at 2:05 AM, Iain Sproat wrote: > On Thu, Aug 20, 2009 at 1:43 AM, Arthur van > Hoff wrote: > > How are the various word types (noun, verb, ...) going to be > distinguished? > > This would be an additional feature to add on later. I would like to > see the basic schema get put in place and tested before we begin > adding more functionality. A property on synset would probably work. > That is a shame. Nouns in wordnet are the least interesting, they are incomplete, and they are often abstract, and in general won't directly correspond with Freebase topics. I'm hoping that we can capture most/all of Wordnet data, so that the data can be seen as an alternative source. A subset is not useful to us. It would seem easy enough to add a property on synset to get started with other word types at the same time. -- Arthur van Hoff arthur.van.hoff at gmail.com 650-283-0842 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090820/c104bc61/attachment.htm From iainsproat at gmail.com Thu Aug 20 16:11:41 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Thu, 20 Aug 2009 20:11:41 +0400 Subject: [Data-modeling] English Words In-Reply-To: References: <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> Message-ID: On Thu, Aug 20, 2009 at 7:22 PM, Arthur van Hoff wrote: > That is a shame. Just until https://bugs.freebase.com/browse/DA-899 get's addressed. I'd hate to see it stall through feature-creep. > It would seem easy enough to add a property on synset to get started with other word types at the same time. Perhaps a property /common/synset/lexical_category expecting a new type /language/lexical_category? > I'm hoping that we can capture most/all of > Wordnet data, so that the data can be seen as an alternative source. A > subset is not useful to us. +1. Hopefully the entire WordNet lexicon will be imported. A few additional data points might be left out initially, but we can develop the schema and they can be added as we progress. Iain From iainsproat at gmail.com Thu Aug 20 17:39:44 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Thu, 20 Aug 2009 21:39:44 +0400 Subject: [Data-modeling] English Words In-Reply-To: References: <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> Message-ID: On Thu, Aug 20, 2009 at 5:59 PM, Tom Morris wrote: > The amount of discussion this has generated clearly indicates that > there's enough interest to move forward. >..... > Another thing which might be useful to folks working on this is to > review previous work done by people attempting to incorporate/extend > WordNet. ?One potential issue would seem to be the lack of strong > identifiers. ?There's a report at > http://www.aclweb.org/anthology/W/W08/W08-0507.pdf on the experiences > of one team that attempted some work in this space. Had a quick read, and here's my exec summary: They used WordNet's accompanying "outdated legacy" software and "out-of-date" text-based data structure as the technical framework for their research and failed. The software limits to only 16 synsets for a word, synset files are restricted to "988 direct hyponymous synsets", it uses a custom character escaping format, the length of a word is restricted to 425 characters, the software doesn't use GUID's and instead uses the word string, which is an issue where homonyms are considered. The software also throws "poor, cryptic error messages" and is "insufficiently documented". Provided Freebase doesn't suffer from these issues, we have a greater chance of a positive outcome. Someone mentioned previously in this thread that the WordNet data had been parsed in the very early days of Freebase - I'd assume that any problems with WordNet's "rather idiosyncratic, fully text-based data structure" have been overcome by the data importing folks on Freebase's staff already. The paper does mention the Open Biomedical Ontologies Foundry - http://www.obofoundry.org/. A collection of biomedical-based ontologies - worth a look for schema ideas and data. Selected references of interest cited in the paper (and pdf links): Boyd-Graber et. al. 2006 - proposal to add weighted connections between synsets http://wordnet.cs.princeton.edu/papers/wordnetplusintro.pdf Oltramari et al. 2002 - suggestion to remodel WordNet's taxonomical structure http://www.loa-cnr.it/Papers/Oltramari2.pdf Bodenreider and Burgun 2002 Characterizing the definitions of anatomical concepts in WordNet http://mor.nlm.nih.gov/pubs/pdf/2002-wordnet-ob.pdf Bentivolgi et. al Integrating WordNet with domain-specific knowledge http://www.fi.muni.cz/gwc2004/proc/101.pdf Iain From rfh at metaweb.com Thu Aug 20 19:35:03 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Thu, 20 Aug 2009 12:35:03 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> Message-ID: <648A0754-B8AF-4CEF-9F08-59B97D20275A@metaweb.com> On Aug 20, 2009, at 10:39 AM, Iain Sproat wrote: > I'd assume that any problems with WordNet's "rather > idiosyncratic, fully text-based data structure" have been overcome by > the data importing folks on Freebase's staff already. Please look at the Prolog version of the WordNet db on the WordNet site. This is already a solved problem. -r -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090820/a0021025/attachment.bin From jeff at metaweb.com Thu Aug 20 19:37:27 2009 From: jeff at metaweb.com (Jeff Prucher) Date: Thu, 20 Aug 2009 12:37:27 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com><4A84C43B.3040303@metaweb.com><4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com><4A8C25F5.1090602@metaweb.com> Message-ID: <16D5A88B78354829B432058C589D7836@amd> My one concern about this is that, if I understand us correctly, /common/symbol will be the equivalent of WordNet's "base word", and the CVT /common/synset will represent the various senses of each base word. When we start modeling synonyms, hypernyms, etc., these will be linked to the synsets, which will mean linkages to CVTs, rather than standard types. I'm not sure how big a problem this will -- one of the reasons not to include /common/topic is to allow this modeling to happen without mucking up the client too much, so that fact that CVTs linking to CVTs is a big no-no in the client may not affect this. (It's not a no-no in MQL, but anyone building an interface around this is going to have to deal with this.) So that's probably one reason to try out a bare-bones model to make sure it even does what we want it to before tacking on additional properties. Jeff > -----Original Message----- > From: data-modeling-bounces at freebase.com > [mailto:data-modeling-bounces at freebase.com] On Behalf Of Iain Sproat > Sent: Wednesday, August 19, 2009 11:11 AM > To: Freebase data modeling mailing list > Subject: Re: [Data-modeling] English Words > > Do we have a consensus on the below schema? > > /common/symbol > /common/symbol/lang expecting /type/lang > /common/symbol/synset reciprocated by /common/synset/synonym > > /common/synset > /common/synset/lang expecting /type/lang > /common/synset/synonym reciprocated by /common/symbol/synset > /common/synset/represents expecting /common/topic > > and, finally, a new property in topic: > /common/topic/synset reciprocated by /common/synset/represents > > I think that's the bare minimum to get this started. We can > look at more features/properties later once this has got off > the ground. > > Iain > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From jeff at metaweb.com Thu Aug 20 19:42:54 2009 From: jeff at metaweb.com (Jeff Prucher) Date: Thu, 20 Aug 2009 12:42:54 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <4A84C43B.3040303@metaweb.com><4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com><4A8C25F5.1090602@metaweb.com><3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com><2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> Message-ID: > -----Original Message----- > From: data-modeling-bounces at freebase.com > [mailto:data-modeling-bounces at freebase.com] On Behalf Of Iain Sproat > Sent: Thursday, August 20, 2009 2:06 AM > To: Freebase data modeling mailing list > Subject: Re: [Data-modeling] English Words > > On Thu, Aug 20, 2009 at 1:43 AM, Arthur van > Hoff wrote: > > How are the various word types (noun, verb, ...) going to > be distinguished? > > This would be an additional feature to add on later. I would > like to see the basic schema get put in place and tested > before we begin adding more functionality. A property on > synset would probably work. This depends on our implementation. If "rat (n.)" and "rat (v.)" are two different instances of /common/symbol, then a property for part of speech should go on /common/symbol. If, however, we think the noun and verb share the same /common/symbol node, synset would be the right place. I'd vote for the former, especially since we almost certainly are going to want to be able to included inflected forms -- which vary between parts of speech, but we should see how WordNet does this, too. > On Thu, Aug 20, 2009 at 7:45 AM, Reilly Hayes wrote: > > Represents may imply a stronger link than would be > desirable. ?Also, > > the wordnet synset may correspond to a piece of schema (type or > > property) rather than a topic. Rather than say "rather", I'd say "as well as". The synset for "person" that means an individual human being corresponds both to /people/person and /en/person. > Would "equivalent" be a better word? or "associates"? > To allow for types or properties, we'd have to alter the > /common/synset/represents (or equivalent, or associates) > property to expect /type/object. The technical term is "designatum", but nobody knows what that means, so I From jeff at metaweb.com Thu Aug 20 22:00:28 2009 From: jeff at metaweb.com (Jeff Prucher) Date: Thu, 20 Aug 2009 15:00:28 -0700 Subject: [Data-modeling] English Words In-Reply-To: <16D5A88B78354829B432058C589D7836@amd> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com><4A84C43B.3040303@metaweb.com><4A89D4AB.2050403@metaweb.com><20090818223958.GM6305@spaceship.com><4A8C25F5.1090602@metaweb.com> <16D5A88B78354829B432058C589D7836@amd> Message-ID: <61459498606B4147B388DBCBC85AB50B@amd> I've loaded a schema to sandbox: I linked synset to /type/object, per recent discussion, and omitted "lang" from the synset, since lang is already on symbol. Please add some data and see how it works! Jeff > -----Original Message----- > From: data-modeling-bounces at freebase.com > [mailto:data-modeling-bounces at freebase.com] On Behalf Of Jeff Prucher > Sent: Thursday, August 20, 2009 12:37 PM > To: 'Freebase data modeling mailing list' > Subject: Re: [Data-modeling] English Words > > My one concern about this is that, if I understand us > correctly, /common/symbol will be the equivalent of WordNet's > "base word", and the CVT /common/synset will represent the > various senses of each base word. When we start modeling > synonyms, hypernyms, etc., these will be linked to the > synsets, which will mean linkages to CVTs, rather than > standard types. I'm not sure how big a problem this will -- > one of the reasons not to include /common/topic is to allow > this modeling to happen without mucking up the client too > much, so that fact that CVTs linking to CVTs is a big no-no > in the client may not affect this. (It's not a no-no in MQL, > but anyone building an interface around this is going to have > to deal with this.) > > So that's probably one reason to try out a bare-bones model > to make sure it even does what we want it to before tacking > on additional properties. > > Jeff > > > -----Original Message----- > > From: data-modeling-bounces at freebase.com > > [mailto:data-modeling-bounces at freebase.com] On Behalf Of Iain Sproat > > Sent: Wednesday, August 19, 2009 11:11 AM > > To: Freebase data modeling mailing list > > Subject: Re: [Data-modeling] English Words > > > > Do we have a consensus on the below schema? > > > > /common/symbol > > /common/symbol/lang expecting /type/lang /common/symbol/synset > > reciprocated by /common/synset/synonym > > > > /common/synset > > /common/synset/lang expecting /type/lang /common/synset/synonym > > reciprocated by /common/symbol/synset /common/synset/represents > > expecting /common/topic > > > > and, finally, a new property in topic: > > /common/topic/synset reciprocated by /common/synset/represents > > > > I think that's the bare minimum to get this started. We > can look at > > more features/properties later once this has got off the ground. > > > > Iain > > _______________________________________________ > > Data-modeling mailing list > > Data-modeling at freebase.com > > http://lists.freebase.com/mailman/listinfo/data-modeling > > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From sm at metaweb.com Thu Aug 20 21:40:58 2009 From: sm at metaweb.com (Scott Meyer) Date: Thu, 20 Aug 2009 14:40:58 -0700 Subject: [Data-modeling] English Words In-Reply-To: <16D5A88B78354829B432058C589D7836@amd> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com><4A84C43B.3040303@metaweb.com><4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com><4A8C25F5.1090602@metaweb.com> <16D5A88B78354829B432058C589D7836@amd> Message-ID: <4A8DC2EA.5020100@metaweb.com> Jeff Prucher wrote: > My one concern about this is that, if I understand us correctly, > /common/symbol will be the equivalent of WordNet's "base word", and the CVT > /common/synset will represent the various senses of each base word. When we > start modeling synonyms, hypernyms, etc., these will be linked to the > synsets, which will mean linkages to CVTs, rather than standard types. I'm > not sure how big a problem this will -- one of the reasons not to include > /common/topic is to allow this modeling to happen without mucking up the > client too much, so that fact that CVTs linking to CVTs is a big no-no in > the client may not affect this. (It's not a no-no in MQL, but anyone > building an interface around this is going to have to deal with this.) > > So that's probably one reason to try out a bare-bones model to make sure it > even does what we want it to before tacking on additional properties. I don't think that synsets should be CVTs. Ian didn't model them that way: http://www.freebase.com/type/schema/base/writing/symset?domain=%2Fbase%2Fwriting All of the linkages between synsets are direct. Certainly hypernym/hyponym are no problem. Antonym is bidirectional, already modeled as a CVT. -Scott From jeff at metaweb.com Thu Aug 20 22:28:54 2009 From: jeff at metaweb.com (Jeff Prucher) Date: Thu, 20 Aug 2009 15:28:54 -0700 Subject: [Data-modeling] English Words In-Reply-To: <4A8DC2EA.5020100@metaweb.com> References: <775e001a0908130111x50dcc92fn561c449b16d25a70@mail.gmail.com><4A84C43B.3040303@metaweb.com><4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com><4A8C25F5.1090602@metaweb.com> <16D5A88B78354829B432058C589D7836@amd> <4A8DC2EA.5020100@metaweb.com> Message-ID: > -----Original Message----- > From: data-modeling-bounces at freebase.com > [mailto:data-modeling-bounces at freebase.com] On Behalf Of Scott Meyer > Sent: Thursday, August 20, 2009 2:41 PM > To: Freebase data modeling mailing list > Subject: Re: [Data-modeling] English Words > > Jeff Prucher wrote: > > My one concern about this is that, if I understand us correctly, > > /common/symbol will be the equivalent of WordNet's "base word", and > > the CVT /common/synset will represent the various senses of > each base > > word. When we start modeling synonyms, hypernyms, etc., > these will be > > linked to the synsets, which will mean linkages to CVTs, > rather than > > standard types. I'm not sure how big a problem this will -- one of > > the reasons not to include /common/topic is to allow this > modeling to > > happen without mucking up the client too much, so that fact > that CVTs > > linking to CVTs is a big no-no in the client may not affect this. > > (It's not a no-no in MQL, but anyone building an interface > around this > > is going to have to deal with this.) > > > > So that's probably one reason to try out a bare-bones model to make > > sure it even does what we want it to before tacking on > additional properties. > > I don't think that synsets should be CVTs. Ian didn't model > them that way: > > http://www.freebase.com/type/schema/base/writing/symset?domain =%2Fbase%2Fwriting > > All of the linkages between synsets are direct. Certainly > hypernym/hyponym are no problem. Antonym is bidirectional, > already modeled as a CVT. Ah -- I see now; don't know how I missed that. I've updated the schema on sandbox to reflect that. Jeff From sm at metaweb.com Thu Aug 20 22:33:10 2009 From: sm at metaweb.com (Scott Meyer) Date: Thu, 20 Aug 2009 15:33:10 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <4A84C43B.3040303@metaweb.com><4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com><4A8C25F5.1090602@metaweb.com><3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com><2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> Message-ID: <4A8DCF26.3080909@metaweb.com> Jeff Prucher wrote: > > >> -----Original Message----- >> From: data-modeling-bounces at freebase.com >> [mailto:data-modeling-bounces at freebase.com] On Behalf Of Iain Sproat >> Sent: Thursday, August 20, 2009 2:06 AM >> To: Freebase data modeling mailing list >> Subject: Re: [Data-modeling] English Words >> >> On Thu, Aug 20, 2009 at 1:43 AM, Arthur van >> Hoff wrote: >>> How are the various word types (noun, verb, ...) going to >> be distinguished? >> >> This would be an additional feature to add on later. I would >> like to see the basic schema get put in place and tested >> before we begin adding more functionality. A property on >> synset would probably work. > > This depends on our implementation. If "rat (n.)" and "rat (v.)" are two > different instances of /common/symbol, then a property for part of speech > should go on /common/symbol. If, however, we think the noun and verb share > the same /common/symbol node, synset would be the right place. I'd vote for > the former, especially since we almost certainly are going to want to be > able to included inflected forms -- which vary between parts of speech, but > we should see how WordNet does this, too. Looking at the prolog schema, it appears that a "word" is just a synset with one member. Part-of-speach, encoded as a byte in front of the 8-byte synset id, applies equally to words and synsets. So, I think that the natural translation into Freebase is to have /common/symbol include /common/synset, with the addition of the /common/symbol type meaning: "the name of this object is an English word" with the definition as given by the /common/symset/definition" This seems to handle the existing "rat" example nicely, "rat" is both a symbol and a synset containing hyponyms: pocket rat, brown rat, etc. We will have topic<->synset mappings for "rodent", "rat", "brown rat", etc. http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=1&s=rat&i=11&h=1101000000000100000100000000#c There doesn't appear to be any generic symbol for "rat-in-any-part-of-speach" Searching for rat yields two separate synset hierarchies, one for nouns, one for verbs. So back to /common/symbol, does it add any properties to synset? Ian's model, http://www.freebase.com/type/schema/base/writing/word adds morpheme and stem, but those don't seem to come from Wordnet. Entomology would be another candidate. -Scott From rfh at metaweb.com Fri Aug 21 15:38:21 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Fri, 21 Aug 2009 08:38:21 -0700 Subject: [Data-modeling] English Words In-Reply-To: <4A8DCF26.3080909@metaweb.com> References: <4A84C43B.3040303@metaweb.com><4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com><4A8C25F5.1090602@metaweb.com><3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com><2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> <4A8DCF26.3080909@metaweb.com> Message-ID: <7AAD8D69-A32B-4251-907C-086EDA2CD03D@metaweb.com> > > adds morpheme and stem, but those don't seem to come from Wordnet. > Entomology > would be another candidate. I don't mean to bug you, but I think you meant etymology. -r -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090821/da7d7a57/attachment.bin From sm at metaweb.com Fri Aug 21 17:14:05 2009 From: sm at metaweb.com (Scott Meyer) Date: Fri, 21 Aug 2009 10:14:05 -0700 Subject: [Data-modeling] English Words In-Reply-To: <7AAD8D69-A32B-4251-907C-086EDA2CD03D@metaweb.com> References: <4A84C43B.3040303@metaweb.com><4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com><4A8C25F5.1090602@metaweb.com><3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com><2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> <4A8DCF26.3080909@metaweb.com> <7AAD8D69-A32B-4251-907C-086EDA2CD03D@metaweb.com> Message-ID: <4A8ED5DD.1040206@metaweb.com> Reilly Hayes wrote: > >> >> adds morpheme and stem, but those don't seem to come from Wordnet. >> Entomology >> would be another candidate. > > > I don't mean to bug you, but I think you meant etymology. Oh, these insect-level concerns about words and what they mean... My bad. -Scott From iainsproat at gmail.com Fri Aug 21 19:01:15 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Fri, 21 Aug 2009 23:01:15 +0400 Subject: [Data-modeling] English Words In-Reply-To: <4A8DCF26.3080909@metaweb.com> References: <20090818223958.GM6305@spaceship.com> <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> <4A8DCF26.3080909@metaweb.com> Message-ID: On Fri, Aug 21, 2009 at 2:33 AM, Scott Meyer wrote: > Looking at the prolog schema, it appears that a "word" is just a synset with > one member. ?Part-of-speach, encoded as a byte in front of the 8-byte > synset id, applies equally to words and synsets. Am I right in thinking this assumes that each homograph is a unique /common/symbol? > So, I think that the natural translation into Freebase is to have > /common/symbol include /common/synset, with the addition of the > /common/symbol type meaning: ?"the name of this object is an English > word" with the definition as given by the /common/symset/definition" Seems to be breaking away from synset as a synonym *set* i.e. predominantly a group of synonyms, rather than a single word. Not that it's wrong, but perhaps counter-intuitive. If all homographs of a word have a separate /common/symbol object, I'd agree that it would be better if /common/synset was included in /common/symbol rather than the current unique /common/synset/parent property. > This seems to handle the existing "rat" example nicely, "rat" is both > a symbol and a synset containing hyponyms: pocket rat, brown rat, etc. > We will have topic<->synset mappings for "rodent", "rat", "brown rat", > etc. Synonyms - We'd also need a new property for noting synonymous synsets; i.e. those symbols that are interchangeable into a phrase but still retain the semantics of that phrase in its context. > So back to /common/symbol, does it add any properties to synset? ?Ian's model, > > http://www.freebase.com/type/schema/base/writing/word > > adds morpheme and stem, but those don't seem to come from Wordnet. ?Etymology > would be another candidate. Verbal pronounciation - where does this fit in? What about hieroglyphics? On Fri, Aug 21, 2009 at 2:00 AM, Jeff Prucher wrote: >I linked synset to /type/object, per recent discussion, and omitted "lang" >from the synset, since lang is already on symbol. Silly question, but how are 'borrowed' words modelled? e.g. "Et Cetera". Is this a latin word in an english synset (i.e. both a /common/symbol/lang & /common/synset/lang are required). Or is it an english word in an english synset but with a latin source in its etymology? (so only one of the /lang properties are required, as you've suggested). Iain From spatial.db at gmail.com Fri Aug 21 20:06:57 2009 From: spatial.db at gmail.com (Ed Laurent) Date: Fri, 21 Aug 2009 16:06:57 -0400 Subject: [Data-modeling] English Words In-Reply-To: References: <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> <4A8DCF26.3080909@metaweb.com> Message-ID: On Fri, Aug 21, 2009 at 3:01 PM, Iain Sproat wrote: > > > > So, I think that the natural translation into Freebase is to have > > /common/symbol include /common/synset, with the addition of the > > /common/symbol type meaning: "the name of this object is an English > > word" with the definition as given by the /common/symset/definition" > > Seems to be breaking away from synset as a synonym *set* i.e. > predominantly a group of synonyms, rather than a single word. > Not that it's wrong, but perhaps counter-intuitive. I agree that this discussion has been wandering in and out of symbols vs. semantics and that has been the source of my confusion. If we are only modeling symbols, then I can understand why hyper-, hypo-, and morphonyms are included in the model but not synonyms and antonyms. I also don't understand why homonyms have been dismissed from the discussion because those seem more relevant to the symbol than synonyms or antonyms. If we are talking about symbols and synonyms, then these should be modeled separately IMO. That would allow the symbol properties to be applied to homonyms (maybe not always?), and allow synonyms and antonyms topics to have different relationships with different homonym topics. Again, I plead ignorance on this discussion subject but, in general, lumping symbol and semantic properties in the same schema doesn't make sense to me. -Ed -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090821/e9b2beba/attachment.htm From rfh at metaweb.com Fri Aug 21 23:34:26 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Fri, 21 Aug 2009 16:34:26 -0700 Subject: [Data-modeling] The Curse of the ISBN Message-ID: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> Hello All -- One of the challenges of loading books is dealing with ISBNs. Both the ISO and Wikipedia claim that they are unique identifiers for book editions. Because of this, we'd really like ISBNs to act as keys within Freebase. Ideally, we'd like to have an /isbn/ namespace so that people can externally reference book editions in Freebase with an ISBN-based URI. However, experience in the field shows that ISBNs aren't guaranteed to be unique. Publishers can and do reuse ISBNs. Sometimes they are reused for a completely different book. More commonly, they are reused for the same book but with differences in format or binding. This is still a small subset of cases, but it is common enough that we can't ignore or skip these cases. But Freebase keys can point to one and only one Freebase topic. Once a value wants to point to two or more topics, it can no longer be used as a key. So, we're left with a paradox. ISBNs should act like keys, allowing external users to reference freebase entities by ISBNs -- but ISBNs can't be keys, since we can't guarantee uniqueness. And note that ISBNs are the only identifiers that have this problem: UPC codes are also notoriously reused. Freebase needs some way to deal with these "weak keys" that somehow solves all of these constraints in a general way. Specifically, a "weak key" should: Provide a consistent pattern that can be used across all weak keys Provide a mechanism to pretend the key is strong by returning a single "best" item Clearly demarcate that the semantics in the keyspace are different from "normal" keys Allow identification of all entities that share the weak key We've spent quite a bit of time over the last few months discussing ways to resolve this conundrum, and we think we've finally come up with an acceptable solution that we'd like to get your feedback on. The basic idea is that ISBNs should point to their own dedicated nodes of type /book/isbn. Then, instead of having a /book/book_edition/isbn be a /type/rawstring value, it will instead be a property link to the / book/isbn node. A root-level namespace ("/weak/") will be created that holds all namespaces with the weak key nature. Keys in the weak namespace point to weak key containers. For example "/ weak/isbn/9780670063260" will point to the "container node" for that ISBN. Weak key containers for ISBN will be typed as /book/isbn. The /book/book_edition/isbn13 will be created as a property that points to nodes with an expected type of /book/isbn. Add a property to the key value type reversing the property from the target type (/book/isbn/items.) (Note that, because of permissioning it is essential that the master property be FROM /book/book_edition TO /book/isbn.) Containers will be named with the ISBN (for client display purposes). For example, container node "/weak/isbn/9780670063260" will be named "9780670063260". The container node is cotyped as namespace, containing the single key "best" that points to the object that "best". For example, "/weak/isbn/ 9780670063260/best" would resolve to http://www.freebase.com/edit/topic/guid/9202a8c04000641f80000000099fe6b6 Gardening tasks will be created that will look for /book/isbn nodes that don't fit these rules, and create all necessary links so that the rules are fulfilled. We've thought through all the consequences of this proposal, and we're fairly certain that this proposal gives us the desired behavior, without too many adverse side effects. We can go into the details in follow-up emails if you're interested. Please let us know you're thoughts. We'd like to implement this proposal (along with ISBN13 normalization, remember that?) before the end of the month. -r -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090821/e5e79ac3/attachment.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090821/e5e79ac3/attachment.bin From stefano at metaweb.com Sat Aug 22 17:43:34 2009 From: stefano at metaweb.com (Stefano Mazzocchi) Date: Sat, 22 Aug 2009 10:43:34 -0700 Subject: [Data-modeling] The Curse of the ISBN In-Reply-To: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> References: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> Message-ID: <4A902E46.5000205@metaweb.com> Reilly Hayes wrote: > > Hello All -- > > One of the challenges of loading books is dealing with ISBNs. Both the > ISO and Wikipedia claim that they are unique identifiers for book > editions. Because of this, we'd really like ISBNs to act as keys within > Freebase. Ideally, we'd like to have an /isbn/ namespace so that people > can externally reference book editions in Freebase with an ISBN-based URI. > > However, experience in the field shows that ISBNs aren't guaranteed to > be unique. Publishers can and do reuse ISBNs. Sometimes they are > reused for a completely different book. More commonly, they are reused > for the same book but with differences in format or binding. This is > still a small subset of cases, but it is common enough that we can't > ignore or skip these cases. But Freebase keys can point to one and only > one Freebase topic. Once a value wants to point to two or more topics, > it can no longer be used as a key. > > So, we're left with a paradox. ISBNs should act like keys, allowing > external users to reference freebase entities by ISBNs -- but ISBNs > can't be keys, since we can't guarantee uniqueness. And note that ISBNs > are the only identifiers that have this problem: UPC codes are also > notoriously reused. Freebase needs some way to deal with these "weak > keys" that somehow solves all of these constraints in a general way. > Specifically, a "weak key" should: > > 1. Provide a consistent pattern that can be used across all weak keys > 2. Provide a mechanism to pretend the key is strong by returning a > single "best" item > 3. Clearly demarcate that the semantics in the keyspace are different > from "normal" keys > 4. Allow identification of all entities that share the weak key > > We've spent quite a bit of time over the last few months discussing ways > to resolve this conundrum, and we think we've finally come up with an acceptable solution that we'd like to get your feedback on. The basic idea is > that ISBNs should point to their own dedicated nodes of type /book/isbn. > Then, instead of having a /book/book_edition/isbn be a /type/rawstring > value, it will instead be a property link to the /book/isbn node. > > 1. A root-level namespace ("/weak/") will be created that holds all > namespaces with the weak key nature. > 2. Keys in the weak namespace point to weak key containers. For > example "/weak/isbn/9780670063260" will point to the "container > node" for that ISBN. > 3. Weak key containers for ISBN will be typed as /book/isbn. > 4. The /book/book_edition/isbn13 will be created as a property that > points to nodes with an expected type of /book/isbn. > 5. Add a property to the key value type reversing the property from > the target type (/book/isbn/items.) (Note that, because of > permissioning it is essential that the master property be FROM > /book/book_edition TO /book/isbn.) > 6. Containers will be named with the ISBN (for client display > purposes). For example, container node "/weak/isbn/9780670063260" > will be named "9780670063260". > 7. The container node is cotyped as namespace, containing the single > key "best" that points to the object that "best". For example, > "/weak/isbn/9780670063260/best" would resolve > to http://www.freebase.com/edit/topic/guid/9202a8c04000641f80000000099fe6b6 > 8. Gardening tasks will be created that will look for /book/isbn > nodes that don't fit these rules, and create all necessary links > so that the rules are fulfilled. > > We've thought through all the consequences of this proposal, and we're > fairly certain that this proposal gives us the desired behavior, without > too many adverse side effects. We can go into the details in follow-up > emails if you're interested. > > Please let us know you're thoughts. We'd like to implement this > proposal (along with ISBN13 normalization, remember that?) before the > end of the month. I like the approach, it looks very reasonable considering how complex the issue is (most libraries still do this day gloriously pretend this issue doesn't exist and reveal it to each other under their breath not to be overheard much like pitagoreans did with irrational numbers ;-) I like the fact that if referenced, a weak key will look like a strong one, since that's what statistically people would expect. But I like also the fact that we're not simply ignoring the issue but we're tooling for this concept of 'weak identifiers'. The only concern I have is really about perception and it's the use of the word '/weak/' to classify it. Calling the rational numbers 'incomplete' would cause pitagoreans to cut off your head in ancient Greece, accusing ISBN to be 'weak' might induce a sense of "who do you think you are to call our identifiers weak?" perception that might hurt our ability to attract volunteers from the libraries in this space (librarians tend to be very proud people, especially with classification and identification systems) I understand that 'weak' is a very precise definition of what this identification scheme is, and can apply to others just as well, but maybe there is a word we can use that has the same meaning but doesn't inspire a "our identifiers are better than yours" undertone to the casual observer. Thoughts? -- Stefano Mazzocchi Application Catalyst Metaweb Technologies, Inc. stefano at metaweb.com ------------------------------------------------------------------- From rnewman at twinql.com Sat Aug 22 18:17:39 2009 From: rnewman at twinql.com (Richard Newman) Date: Sat, 22 Aug 2009 11:17:39 -0700 Subject: [Data-modeling] The Curse of the ISBN In-Reply-To: <4A902E46.5000205@metaweb.com> References: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> <4A902E46.5000205@metaweb.com> Message-ID: > I understand that 'weak' is a very precise definition of what this > identification scheme is, and can apply to others just as well, but > maybe there is a word we can use that has the same meaning but doesn't > inspire a "our identifiers are better than yours" undertone to the > casual observer. > > Thoughts? Other word ideas to spawn discussion: * Partial * Inconclusive * Pseudo-unique * Approximate * Reused/reissued (might not be general enough) then there are some which are harder to abbreviate, e.g.: * "Usually sufficient" :) -R From rnewman at twinql.com Sat Aug 22 18:26:37 2009 From: rnewman at twinql.com (Richard Newman) Date: Sat, 22 Aug 2009 11:26:37 -0700 Subject: [Data-modeling] The Curse of the ISBN In-Reply-To: References: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> <4A902E46.5000205@metaweb.com> Message-ID: <8A5DE5FA-6BE3-4CCF-B0B2-ED1D98139526@twinql.com> > I understand that 'weak' is a very precise definition of what this > identification scheme is, and can apply to others just as well, but > maybe there is a word we can use that has the same meaning but doesn't > inspire a "our identifiers are better than yours" undertone to the > casual observer. > > Thoughts? Another thread came to mind: "contextual", "context-dependent", etc. ISBNs are treated as unambiguous within a certain context (such as a bookshop ? it would be a very rare shop indeed whose computer systems were aware of reuse of ISBNs). If you search for an ISBN in Barnes and Noble, you're not going to get multiple results. In the general sense ISBNs can be rendered unambiguous by the introduction of (varying degrees of) additional information: "the hardback edition", "the book titled 'Foo'", "the work first published in the sixties". -R From iainsproat at gmail.com Sat Aug 22 19:06:27 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Sat, 22 Aug 2009 23:06:27 +0400 Subject: [Data-modeling] The Curse of the ISBN In-Reply-To: <8A5DE5FA-6BE3-4CCF-B0B2-ED1D98139526@twinql.com> References: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> <4A902E46.5000205@metaweb.com> <8A5DE5FA-6BE3-4CCF-B0B2-ED1D98139526@twinql.com> Message-ID: I like this implementation of having a /best and /items. However, Stefano makes a good point with weak being a bad name, but I think it goes further than naming perceptions. I'd argue that *all* keys are "weak" to some degree. By calling something "strong" is saying that the original datasource is perfect and has absolutely no possibility of any conflated data elements. Very unlikely IMHO. Any Wikipedia imported article with a freebase 'split' flag is the proof here. It shows that the wikipedia key is in fact actually 2 semantic ideas, and thus "weak" to some extent. The correct functionality would be that after splitting a freebase topic a wikipedia key should then point to both resultant Freebase topics. As far as I'm aware this isn't currently happening. We neatly sidestep it by moving the key to only one of the topics, the other is keyless. Would it not be better to add a /best and /items properties to *all* keys? Providing extra/redundant functionality in "strong" keys wouldn't break them, but will provide a consistent interface to anyone querying keys. This would also remove the need for a /weak namespace. Iain On Sat, Aug 22, 2009 at 10:26 PM, Richard Newman wrote: >> I understand that 'weak' is a very precise definition of what this >> identification scheme is, and can apply to others just as well, but >> maybe there is a word we can use that has the same meaning but doesn't >> inspire a "our identifiers are better than yours" undertone to the >> casual observer. >> >> Thoughts? > > > Another thread came to mind: "contextual", "context-dependent", etc. > > ISBNs are treated as unambiguous within a certain context (such as a > bookshop ? it would be a very rare shop indeed whose computer systems > were aware of reuse of ISBNs). If you search for an ISBN in Barnes and > Noble, you're not going to get multiple results. > > In the general sense ISBNs can be rendered unambiguous by the > introduction of (varying degrees of) additional information: "the > hardback edition", "the book titled 'Foo'", "the work first published > in the sixties". > > -R > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling > From iainsproat at gmail.com Sun Aug 23 03:10:27 2009 From: iainsproat at gmail.com (Iain Sproat) Date: Sun, 23 Aug 2009 07:10:27 +0400 Subject: [Data-modeling] The Curse of the ISBN In-Reply-To: References: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> <4A902E46.5000205@metaweb.com> <8A5DE5FA-6BE3-4CCF-B0B2-ED1D98139526@twinql.com> Message-ID: On Sat, Aug 22, 2009 at 11:06 PM, Iain Sproat wrote: > Any Wikipedia imported article with a freebase 'split' flag is the > proof here. Actually that's a bad example; I'll instead use the openlibrary "et al" authors who have been cropping up in the delete queue. These are in fact more than one author, so the openlibrary key would in theory relate to multiple freebase topics. Iain From philip-freebase at shadowmagic.org.uk Mon Aug 24 11:18:55 2009 From: philip-freebase at shadowmagic.org.uk (Philip Kendall) Date: Mon, 24 Aug 2009 12:18:55 +0100 Subject: [Data-modeling] The Curse of the ISBN In-Reply-To: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> References: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> Message-ID: <20090824111854.GG13823@sphinx.mythic-beasts.com> On Fri, Aug 21, 2009 at 04:34:26PM -0700, Reilly Hayes wrote: > > One of the challenges of loading books is dealing with ISBNs. [ ... ] > The basic idea is that ISBNs should point to their own dedicated nodes > of type /book/isbn. Then, instead of having a /book/book_edition/isbn > be a /type/rawstring value, it will instead be a property link to the / > book/isbn node. > A root-level namespace ("/weak/") will be created that holds all > namespaces with the weak key nature. > Keys in the weak namespace point to weak key containers. For example "/ > weak/isbn/9780670063260" will point to the "container node" for that > ISBN. > Weak key containers for ISBN will be typed as /book/isbn. > The /book/book_edition/isbn13 will be created as a property that > points to nodes with an expected type of /book/isbn. > Add a property to the key value type reversing the property from the > target type (/book/isbn/items.) (Note that, because of permissioning > it is essential that the master property be FROM /book/book_edition > TO /book/isbn.) I'd have thought that both the /book/book_edition and /book/isbn nodes would be editable by anyone, but this comment implies this may not be the case, or is this just planning for situations which don't exist yet? > Containers will be named with the ISBN (for client display purposes). > For example, container node "/weak/isbn/9780670063260" will be named > "9780670063260". > The container node is cotyped as namespace, containing the single key > "best" that points to the object that "best". For example, "/weak/isbn/ > 9780670063260/best" would resolve to > http://www.freebase.com/edit/topic/guid/9202a8c04000641f80000000099fe6b6 > Gardening tasks will be created that will look for /book/isbn nodes > that don't fit these rules, and create all necessary links so that the > rules are fulfilled. The data structure looks as good as it can to me, but I'm slightly concerned about what support the client is going to have for editing any of this. A few use cases: * Creating a new book with a new ISBN * Creating a new book reusing an existing ISBN * Changing which of two books the "best" key for a particular ISBN points to If any of these involve "use the Query Editor", I think there's a problem here. > Please let us know you're thoughts. We'd like to implement this > proposal (along with ISBN13 normalization, remember that?) before the > end of the month. What is happening with ISBN13 normalization? Are we getting a new "ISBN13" field, or are we going to replace all the ISBN10s with ISBN13s? If the latter, I do think it's important that we record what's actually written on the book somewhere. Cheers, Phil -- Philip Kendall http://www.shadowmagic.org.uk/ From rfh at metaweb.com Mon Aug 24 16:52:01 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Mon, 24 Aug 2009 09:52:01 -0700 Subject: [Data-modeling] The Curse of the ISBN In-Reply-To: <20090824111854.GG13823@sphinx.mythic-beasts.com> References: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> <20090824111854.GG13823@sphinx.mythic-beasts.com> Message-ID: On Aug 24, 2009, at 4:18 AM, Philip Kendall wrote: > I'd have thought that both the /book/book_edition and /book/isbn nodes > would be editable by anyone, but this comment implies this may not be > the case, or is this just planning for situations which don't exist > yet? The intention is that /book/isbn nodes will be locked down by a regular automated gardening task. This task will also ensure that the key path and name property of the node are in sync. New nodes created by data team processes will automatically be created "locked down". New nodes created by users will be locked down by the gardening task. > > > The data structure looks as good as it can to me, but I'm slightly > concerned about what support the client is going to have for editing > any > of this. A few use cases: > > * Creating a new book with a new ISBN This will require the user to press "create new" when entering the field. > * Creating a new book reusing an existing ISBN This will just work, as the client will match on the name. > * Changing which of two books the "best" key for a particular ISBN > points to This will be automated and determined by the publication date (newest wins). We'll make sure the domain admins can change this as well. > > If any of these involve "use the Query Editor", I think there's a > problem here. No query editor required. > What is happening with ISBN13 normalization? Are we getting a new > "ISBN13" field, or are we going to replace all the ISBN10s with > ISBN13s? > If the latter, I do think it's important that we record what's > actually > written on the book somewhere. The field pointing from /book/edition/ to /book/isbn will be named isbn13. This is the implementation. We have not made provisions for storing what is actually on the book (in terms of isbn) in the graph. However, we're working on a mechanism for making some raw source data available for user inspection. It won't be in the graph, but it should be accessible in the client and via MQL extensions. -r -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090824/baf4cf82/attachment.htm -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090824/baf4cf82/attachment.bin From kurt at spaceship.com Mon Aug 24 17:15:28 2009 From: kurt at spaceship.com (Kurt Bollacker) Date: Mon, 24 Aug 2009 10:15:28 -0700 Subject: [Data-modeling] The Curse of the ISBN In-Reply-To: References: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> <20090824111854.GG13823@sphinx.mythic-beasts.com> Message-ID: <20090824171528.GA17744@spaceship.com> On Mon, Aug 24, 2009 at 09:52:01AM -0700, Reilly Hayes wrote: > On Aug 24, 2009, at 4:18 AM, Philip Kendall wrote: > >The data structure looks as good as it can to me, but I'm slightly > >concerned about what support the client is going to have for editing > >any > >of this. A few use cases: > > > >* Creating a new book with a new ISBN > > This will require the user to press "create new" when entering the > field. Will this be a Web Client UI feature (immediate key creation when node is created) or gardening task batch processing (delayed key creation). If the latter, what will the delay be? Kurt :-) From rfh at metaweb.com Mon Aug 24 17:40:30 2009 From: rfh at metaweb.com (Reilly Hayes) Date: Mon, 24 Aug 2009 10:40:30 -0700 Subject: [Data-modeling] The Curse of the ISBN In-Reply-To: <20090824171528.GA17744@spaceship.com> References: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> <20090824111854.GG13823@sphinx.mythic-beasts.com> <20090824171528.GA17744@spaceship.com> Message-ID: <963B1676-8F34-46D7-9A62-5947F0E2FEF1@metaweb.com> The key will not be created immediately. The key will be created as part of a gardening task. This will be nightly at first, and then a triggered even that will fire shortly after the node is created. -r On Aug 24, 2009, at 10:15 AM, Kurt Bollacker wrote: > > On Mon, Aug 24, 2009 at 09:52:01AM -0700, Reilly Hayes wrote: >> On Aug 24, 2009, at 4:18 AM, Philip Kendall wrote: >>> The data structure looks as good as it can to me, but I'm slightly >>> concerned about what support the client is going to have for editing >>> any >>> of this. A few use cases: >>> >>> * Creating a new book with a new ISBN >> >> This will require the user to press "create new" when entering the >> field. > > Will this be a Web Client UI feature (immediate key creation when node > is created) or gardening task batch processing (delayed key creation). > If the latter, what will the delay be? > > Kurt :-) > > > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 2434 bytes Desc: not available Url : http://lists.freebase.com/pipermail/data-modeling/attachments/20090824/94cdec8a/attachment.bin From jeff at metaweb.com Mon Aug 24 18:53:02 2009 From: jeff at metaweb.com (Jeff Prucher) Date: Mon, 24 Aug 2009 11:53:02 -0700 Subject: [Data-modeling] English Words In-Reply-To: <4A8DCF26.3080909@metaweb.com> References: <4A84C43B.3040303@metaweb.com><4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com><4A8C25F5.1090602@metaweb.com><3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com><2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> <4A8DCF26.3080909@metaweb.com> Message-ID: <29E22A11C3EF4FB18087D813260C74AB@p4> > -----Original Message----- > From: data-modeling-bounces at freebase.com > [mailto:data-modeling-bounces at freebase.com] On Behalf Of Scott Meyer > Sent: Thursday, August 20, 2009 3:33 PM > To: Freebase data modeling mailing list > Subject: Re: [Data-modeling] English Words > > Jeff Prucher wrote: > > > > > >> -----Original Message----- > >> From: data-modeling-bounces at freebase.com > >> [mailto:data-modeling-bounces at freebase.com] On Behalf Of > Iain Sproat > >> Sent: Thursday, August 20, 2009 2:06 AM > >> To: Freebase data modeling mailing list > >> Subject: Re: [Data-modeling] English Words > >> > >> On Thu, Aug 20, 2009 at 1:43 AM, Arthur van > >> Hoff wrote: > >>> How are the various word types (noun, verb, ...) going to > >> be distinguished? > >> > >> This would be an additional feature to add on later. I > would like to > >> see the basic schema get put in place and tested before we begin > >> adding more functionality. A property on synset would > probably work. > > > > This depends on our implementation. If "rat (n.)" and "rat > (v.)" are > > two different instances of /common/symbol, then a property > for part of > > speech should go on /common/symbol. If, however, we think > the noun and > > verb share the same /common/symbol node, synset would be the right > > place. I'd vote for the former, especially since we almost > certainly > > are going to want to be able to included inflected forms -- > which vary > > between parts of speech, but we should see how WordNet does > this, too. > > Looking at the prolog schema, it appears that a "word" is > just a synset with one member. Part-of-speach, encoded as a > byte in front of the 8-byte synset id, applies equally to > words and synsets. > > So, I think that the natural translation into Freebase is to > have /common/symbol include /common/synset, with the addition > of the /common/symbol type meaning: "the name of this object > is an English word" with the definition as given by the > /common/symset/definition" > > This seems to handle the existing "rat" example nicely, "rat" > is both a symbol and a synset containing hyponyms: pocket > rat, brown rat, etc. > We will have topic<->synset mappings for "rodent", "rat", > "brown rat", etc. > > http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1 > =1&o6=&o4=&o3=1&s=rat&i=11&h=1101000000000100000100000000#c I disagree. If all we wanted to do was mirror WordNet, this approach would work. But it seems that there's a lot of interest in non-WordNet-type data (although how serious this interest is, e.g., does anyone have data for it or does it just sound like a good idea, I can't say), such as etymology, pronunciation, and inflection. Doing these would require that /common/symbol and /common/synset be distinct. For example, WordNet gives five noun synsets for "rat"; these are all synsets of the same English word -- there are not five different words "rat", with unique etymologies, inflections, etc. Unfortunately, WordNet does not care about homonyms: there are two English verbs "cleave" (to split and to cling to), which are inflected differently ("cleft" and "cleaved", respectively). But if you search for the verb "cleft", you get both "cleaves". This will make the extracting of the data into separate /common/symbol and /common/synset nodes rather difficult (unless of course someone else has already solved it, which wouldn't surprise me). Jeff From sm at metaweb.com Mon Aug 24 22:40:46 2009 From: sm at metaweb.com (Scott Meyer) Date: Mon, 24 Aug 2009 15:40:46 -0700 Subject: [Data-modeling] English Words In-Reply-To: <29E22A11C3EF4FB18087D813260C74AB@p4> References: <4A84C43B.3040303@metaweb.com><4A89D4AB.2050403@metaweb.com> <20090818223958.GM6305@spaceship.com><4A8C25F5.1090602@metaweb.com><3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com><2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> <4A8DCF26.3080909@metaweb.com> <29E22A11C3EF4FB18087D813260C74AB@p4> Message-ID: <4A9316EE.4000107@metaweb.com> Jeff Prucher wrote: > > >> -----Original Message----- >> From: data-modeling-bounces at freebase.com >> [mailto:data-modeling-bounces at freebase.com] On Behalf Of Scott Meyer >> Sent: Thursday, August 20, 2009 3:33 PM >> To: Freebase data modeling mailing list >> Subject: Re: [Data-modeling] English Words >> >> Jeff Prucher wrote: >>> >>> >>>> -----Original Message----- >>>> From: data-modeling-bounces at freebase.com >>>> [mailto:data-modeling-bounces at freebase.com] On Behalf Of >> Iain Sproat >>>> Sent: Thursday, August 20, 2009 2:06 AM >>>> To: Freebase data modeling mailing list >>>> Subject: Re: [Data-modeling] English Words >>>> >>>> On Thu, Aug 20, 2009 at 1:43 AM, Arthur van >>>> Hoff wrote: >>>>> How are the various word types (noun, verb, ...) going to >>>> be distinguished? >>>> >>>> This would be an additional feature to add on later. I >> would like to >>>> see the basic schema get put in place and tested before we begin >>>> adding more functionality. A property on synset would >> probably work. >>> This depends on our implementation. If "rat (n.)" and "rat >> (v.)" are >>> two different instances of /common/symbol, then a property >> for part of >>> speech should go on /common/symbol. If, however, we think >> the noun and >>> verb share the same /common/symbol node, synset would be the right >>> place. I'd vote for the former, especially since we almost >> certainly >>> are going to want to be able to included inflected forms -- >> which vary >>> between parts of speech, but we should see how WordNet does >> this, too. >> >> Looking at the prolog schema, it appears that a "word" is >> just a synset with one member. Part-of-speach, encoded as a >> byte in front of the 8-byte synset id, applies equally to >> words and synsets. >> >> So, I think that the natural translation into Freebase is to >> have /common/symbol include /common/synset, with the addition >> of the /common/symbol type meaning: "the name of this object >> is an English word" with the definition as given by the >> /common/symset/definition" >> >> This seems to handle the existing "rat" example nicely, "rat" >> is both a symbol and a synset containing hyponyms: pocket >> rat, brown rat, etc. >> We will have topic<->synset mappings for "rodent", "rat", >> "brown rat", etc. >> >> http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1 >> =1&o6=&o4=&o3=1&s=rat&i=11&h=1101000000000100000100000000#c > > I disagree. If all we wanted to do was mirror WordNet, this approach would > work. But it seems that there's a lot of interest in non-WordNet-type data > (although how serious this interest is, e.g., does anyone have data for it > or does it just sound like a good idea, I can't say), such as etymology, > pronunciation, and inflection. Doing these would require that /common/symbol > and /common/synset be distinct. For example, WordNet gives five noun synsets > for "rat"; these are all synsets of the same English word -- there are not > five different words "rat", with unique etymologies, inflections, etc. > Unfortunately, WordNet does not care about homonyms: there are two English > verbs "cleave" (to split and to cling to), which are inflected differently > ("cleft" and "cleaved", respectively). But if you search for the verb > "cleft", you get both "cleaves". This will make the extracting of the data > into separate /common/symbol and /common/synset nodes rather difficult > (unless of course someone else has already solved it, which wouldn't > surprise me). If Wikitionary is to be believed, cleave has two different etymologies, one for each sense you describe. If /common/symbol includes etymology then using a single /common/symbol for all synsets is the wrong way to go. Doesn't seem unusual for meaning to cleave closer to etymology than to orthography. :-) Seems like morphology would have the same problem. I think that just mirroring WordNet is an excellent first step. Perhaps the prudent thing to do is defer /common/symbol until has an actionable proposal (ie. data to load) which actually differentiates it from /common/synset. FWIW, Wordnet gives cleave three senses http://wordnetweb.princeton.edu/perl/webwn?o2=&o0=1&o7=&o5=&o1=1&o6=&o4=&o3=&s=cleave&i=4&h=100000110101001000#c -Scott From spencerkelly86 at gmail.com Tue Aug 25 01:34:37 2009 From: spencerkelly86 at gmail.com (Spencer Kelly) Date: Mon, 24 Aug 2009 21:34:37 -0400 Subject: [Data-modeling] English Words In-Reply-To: <29E22A11C3EF4FB18087D813260C74AB@p4> References: <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> <4A8DCF26.3080909@metaweb.com> <29E22A11C3EF4FB18087D813260C74AB@p4> Message-ID: On Mon, Aug 24, 2009 at 2:53 PM, Jeff Prucher wrote: > does anyone have data for it or does it just sound like a good idea, I can't say), you betcha. the CMU Pronouncing Dictionary is great and gpl, aswell The moby lexicon (also gpl) has a great Pronunciator. etymology is harder, but theres lotsof data. > For example, WordNet gives five noun synsets > for "rat"; these are all synsets of the same English word -- there are not > five different words "rat", with unique etymologies, inflections, etc. ya, yuck. well is it easier to split these homonyms before or after? ...and lets take a moment to realize that what we're doing is making the world's greatest ever dictionary. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090824/be859947/attachment.htm From arthur.van.hoff at gmail.com Tue Aug 25 02:18:13 2009 From: arthur.van.hoff at gmail.com (Arthur van Hoff) Date: Mon, 24 Aug 2009 19:18:13 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> <4A8DCF26.3080909@metaweb.com> <29E22A11C3EF4FB18087D813260C74AB@p4> Message-ID: On Mon, Aug 24, 2009 at 6:34 PM, Spencer Kelly wrote: > > ...and lets take a moment to realize that what we're doing is making the > world's greatest ever dictionary. +1 ! -- Arthur van Hoff arthur.van.hoff at gmail.com 650-283-0842 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090824/5508731f/attachment-0001.htm From sm at metaweb.com Tue Aug 25 23:34:16 2009 From: sm at metaweb.com (Scott Meyer) Date: Tue, 25 Aug 2009 16:34:16 -0700 Subject: [Data-modeling] English Words In-Reply-To: References: <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com> <4A8DCF26.3080909@metaweb.com> <29E22A11C3EF4FB18087D813260C74AB@p4> Message-ID: <4A9474F8.5080003@metaweb.com> Spencer Kelly wrote: > > > On Mon, Aug 24, 2009 at 2:53 PM, Jeff Prucher > wrote: >> does anyone have data for it or does it just sound like a good idea, > I can't say), > > you betcha. > the CMU Pronouncing Dictionary > is great and gpl, > aswell The moby lexicon (also gpl) has a > great Pronunciator. I took a quick look at Moby and was reminded that pronunciation sometimes follows meaning: "close the door" vs. "how close are you?" The number of such cases is characterized as "several hundred;" tractable for by-hand reconciliation. So, considering only pronunciation, data structure is one of: 1. put pronunciation (a text property) on synset - live with a large amount of duplicate data - no ability to join (homonyms) 2. put pronunciation on synset only if spelling (or pronunciation) is different from hypernym - not sure how much this saves, if anything 3. put pronunciation on symbol and live with several hundred symbols which have the same name but different pronunciations - interning is complex - Effectively, this makes pronunciation part of a symbol's "name" - no homonym joining 4. make pronunciation be a cvt which allows pronunciation to be associated with a synset - complex - having it both ways costs a lot 5. Like #4 but with the convention that no synset linkage means that the single pronunciation applies to all synsets using this symbol. - But why have CVT at all? 6. put pronunciation on either the symbol or the synset and (by convention) check the synset first. Like #5 but get rid of the CVT + Optimal for the 1 symbol - 1 pronunciation case - effectively, symbols are identical to synsets, so why have symbols? 7. Create a /common/pronunciation type (name is unicode ipa?) and associate this with synsets. Same as #1 but with identities for pronunciation - intern "by hand" + Nice in that it does not duplicate the symbol/synset name again. + allows easy join for homonym By "interning" I mean "creating an identity only if an equivalent identity does not already exist." Concretely, before creating the pronunciation "cloz" we would first check to see if that identity exists and use any existing identity rather than creating a duplicate. In Freebase, this is typically done with a namespace. Considering just the problem of representing pronunciation, I like #7 the best. Conversely, I don't see much value in having /common/symbol distinct from /common/synset. If anything, symbol seems to complicate life for pronunciation. Concrete recommendation for Wordnet+Moby: Two types, /common/synset and /common/pronunciation related by a single property, pronounced<->meanings which would allow querying of homonyms like so: { "type" : "/common/synset" "pronounced" : { "name" : null, # get IPA "meanings" :[{ "name" : null, # homonyms }] } } All nice and neat as long as we're talking about one dialect/language. As soon as we admit different pronunciations (British vs. American) we're going to need some way (CVT) to represent "alternate" pronunciations. Anyway... If you buy all this, then it seems reasonable to load Wordnet (synsets) first, then add pronunciation later. -Scott From jeff at metaweb.com Thu Aug 27 18:45:25 2009 From: jeff at metaweb.com (Jeff Prucher) Date: Thu, 27 Aug 2009 11:45:25 -0700 Subject: [Data-modeling] English Words In-Reply-To: <4A9474F8.5080003@metaweb.com> References: <4A8C25F5.1090602@metaweb.com> <3700175F-00E7-4959-9936-7ABB2D2079CA@metaweb.com> <2D2A19AF-A394-4AD9-B0C2-9A7C4CE48531@metaweb.com><4A8DCF26.3080909@metaweb.com><29E22A11C3EF4FB18087D813260C74AB@p4> <4A9474F8.5080003@metaweb.com> Message-ID: <8D3E5DA7811B47168A1BA6C50F4180AB@amd> > -----Original Message----- > From: data-modeling-bounces at freebase.com > [mailto:data-modeling-bounces at freebase.com] On Behalf Of Scott Meyer > > Anyway... If you buy all this, then it seems reasonable to > load Wordnet > (synsets) first, then add pronunciation later. I agree -- let's do the synsets first (skipping the issue of symbols entirely for now). (Not that we shouldn't continue the other discussions -- we can do it in parallel!) Jeff From pauljmackay at gmail.com Sun Aug 30 16:24:45 2009 From: pauljmackay at gmail.com (Paul Mackay) Date: Sun, 30 Aug 2009 09:24:45 -0700 Subject: [Data-modeling] List of backwards compatible operations on schemas Message-ID: Would it be possible to document a list operations that can be made to a schema once a Base has been populated with a reasonable amount of data? It would help to know what can be done once data is present and what would require more complex data migration steps. Or if a list like this already exists does anyone know of a link? thanks paul -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.freebase.com/pipermail/data-modeling/attachments/20090830/07749bae/attachment.htm From zenkat at metaweb.com Mon Aug 31 22:50:03 2009 From: zenkat at metaweb.com (Brian Karlak) Date: Mon, 31 Aug 2009 15:50:03 -0700 Subject: [Data-modeling] The Curse of the ISBN In-Reply-To: References: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> <4A902E46.5000205@metaweb.com> <8A5DE5FA-6BE3-4CCF-B0B2-ED1D98139526@twinql.com> Message-ID: <6A0B4A70-0DE8-4C33-BA33-D223E8F73123@metaweb.com> Hello Iain -- One of the drivers behind the /weak/isbn proposal is the conflict between needing to have a key resolve to a single item in Freebase, and allowing multiple editions to display the same ISBN. With wikipedia keys -- and other "strong" keys -- the second constraint doesn't exist. Unlike an ISBN, a wikipedia key doesn't need to show up as a topic property. It's OK if we pick one topic as the "best" topic to keep the keys on since the other topics don't need to show them. Since the weak key proposal requires the creation of many helper nodes & properties -- along with gardening tasks to keep it all in sync -- it's probably best that we reserve it for cases where we really need to allow a key to show up on multiple topics as a property. However, it is a pattern that we hope to use in other cases (like UPCs) where there contradictory constraints exist. Brian On Aug 22, 2009, at 12:06 PM, Iain Sproat wrote: > I like this implementation of having a /best and /items. However, > Stefano makes a good point with weak being a bad name, but I think it > goes further than naming perceptions. > > I'd argue that *all* keys are "weak" to some degree. By calling > something "strong" is saying that the original datasource is perfect > and has absolutely no possibility of any conflated data elements. > Very unlikely IMHO. > > Any Wikipedia imported article with a freebase 'split' flag is the > proof here. It shows that the wikipedia key is in fact actually 2 > semantic ideas, and thus "weak" to some extent. The correct > functionality would be that after splitting a freebase topic a > wikipedia key should then point to both resultant Freebase topics. As > far as I'm aware this isn't currently happening. We neatly sidestep > it by moving the key to only one of the topics, the other is keyless. > > Would it not be better to add a /best and /items properties to *all* > keys? Providing extra/redundant functionality in "strong" keys > wouldn't break them, but will provide a consistent interface to anyone > querying keys. This would also remove the need for a /weak namespace. > > Iain > > On Sat, Aug 22, 2009 at 10:26 PM, Richard Newman > wrote: >>> I understand that 'weak' is a very precise definition of what this >>> identification scheme is, and can apply to others just as well, but >>> maybe there is a word we can use that has the same meaning but >>> doesn't >>> inspire a "our identifiers are better than yours" undertone to the >>> casual observer. >>> >>> Thoughts? >> >> >> Another thread came to mind: "contextual", "context-dependent", etc. >> >> ISBNs are treated as unambiguous within a certain context (such as a >> bookshop ? it would be a very rare shop indeed whose computer systems >> were aware of reuse of ISBNs). If you search for an ISBN in Barnes >> and >> Noble, you're not going to get multiple results. >> >> In the general sense ISBNs can be rendered unambiguous by the >> introduction of (varying degrees of) additional information: "the >> hardback edition", "the book titled 'Foo'", "the work first published >> in the sixties". >> >> -R >> _______________________________________________ >> Data-modeling mailing list >> Data-modeling at freebase.com >> http://lists.freebase.com/mailman/listinfo/data-modeling >> > _______________________________________________ > Data-modeling mailing list > Data-modeling at freebase.com > http://lists.freebase.com/mailman/listinfo/data-modeling From jamie at metaweb.com Mon Aug 31 23:05:30 2009 From: jamie at metaweb.com (Jamie Taylor) Date: Mon, 31 Aug 2009 16:05:30 -0700 (PDT) Subject: [Data-modeling] English Words In-Reply-To: <119656610.49301251759509287.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> Message-ID: <1086592356.49321251759930919.JavaMail.root@zimbra01.corp.sjc1.metaweb.com> If we are starting with Synsets (which I think is an excellent idea) and a few of the other basic Wordnet structures I have some scripts from a previous Wordnet loading experiment that I can run. I'll get a candidate load up in a few days. >From a basic Wordnet backbone we can add the other fancier structures as desired later.... J ----- Original Message ----- From: "Jeff Prucher" To: "Freebase data modeling mailing list" Sent: Thursday, August 27, 2009 11:45:25 AM GMT -08:00 US/Canada Pacific Subject: Re: [Data-modeling] English Words > -----Original Message----- > From: data-modeling-bounces at freebase.com > [mailto:data-modeling-bounces at freebase.com] On Behalf Of Scott Meyer > > Anyway... If you buy all this, then it seems reasonable to > load Wordnet > (synsets) first, then add pronunciation later. I agree -- let's do the synsets first (skipping the issue of symbols entirely for now). (Not that we shouldn't continue the other discussions -- we can do it in parallel!) Jeff _______________________________________________ Data-modeling mailing list Data-modeling at freebase.com http://lists.freebase.com/mailman/listinfo/data-modeling From zenkat at metaweb.com Mon Aug 31 23:12:32 2009 From: zenkat at metaweb.com (Brian Karlak) Date: Mon, 31 Aug 2009 16:12:32 -0700 Subject: [Data-modeling] The Curse of the ISBN In-Reply-To: <4A902E46.5000205@metaweb.com> References: <61B633FC-96B7-4A7E-BE72-B16B0E5DD8A1@metaweb.com> <4A902E46.5000205@metaweb.com> Message-ID: <7EADEC1C-578C-4027-9953-D31AF8B79092@metaweb.com> On Aug 22, 2009, at 10:43 AM, Stefano Mazzocchi wrote: > I understand that 'weak' is a very precise definition of what this > identification scheme is, and can apply to others just as well, but > maybe there is a word we can use that has the same meaning but doesn't > inspire a "our identifiers are better than yours" undertone to the > casual observer. Well, it's important to note that the "/weak/" namespace is an indication of how the keys are used in Freebase. It's not a property of the keys themselves. For instance, ISBNs are a strong key for booksellers. An ISBN uniquely identifies a single book that is available for sale. A bookseller's supply chain software will use ISBNs as strong keys, just as logistics companies will use UPC as strong keys. Freebase, however, is trying to use these keys in ways that they were not designed for. Specifically, we're trying to track the historic usage of ISBN & UPC codes on items that were available for purchase in the past. We're also trying to track book editions with more granularity than a bookseller by tracking binding format, cover art, price, printing run, and the like. In other words, it's only because the Freebase definition of what a "book edition" is differs from a booksellers that we're forced to use the key in a "weak" manner. Because of this, I believe it's perfectly appropriate to use the "weak" namespace to manage these keys in Freebase. It's not a reflection on the keys themselves -- just how they are used in Freebase. If anyone happens to see the namespace in Freebase (they are rather hidden, after all), it's a simple matter to explain that the limitation is ours, not theirs. Brian From sm at metaweb.com Mon Aug 31 22:44:39 2009 From: sm at metaweb.com (Scott Meyer) Date: Mon, 31 Aug 2009 15:44:39 -0700 Subject: [Data-modeling] List of backwards compatible operations on schemas In-Reply-To: References: Message-ID: <4A9C5257.5050700@metaweb.com> Paul Mackay wrote: > Would it be possible to document a list operations that can be made to a > schema once a Base has been populated with a reasonable amount of data? > It would help to know what can be done once data is present and what > would require more complex data migration steps. We don't have any documentation that deals with backwards compatibility, so, I wrote up the following: The simple answer is: Adding properties is always backwards compatible, changing properties is sometimes backwards compatible. Let's define some terms: With respect to schema, "backwards compatible" means that MQL queries formulated with version N of a schema still function with version N + 1. The goal of a "backwards compatible" schema change is to allow existing applications to continue to run. One obvious invarient is property names. If a property, /somedomain/sometype/foo, is present in schema version N, the it has to be present in version N+1. Furthermore, the type of that property has to be "compatible". For object types, "compatible" means it has a superset of property names where the type of each property in the old set is compatible: If /somedomain/sometype/foo once got you to an object with a property named "bar", it should continue to do so in future versions of the schema. For value types, "compatible" is a bit slipperier as the reaction of a JSON consumer to a change from: "some_property" : 1.0 to: "some_property" : 1 is unspecified. Mostly, you can get away with the obvious set of changes. Subtleties In some cases, for example when there is no data for a property, just removing the property is probably the right thing to do. But you will break queries containing: "deletedproperty" : null Arguably, MQL should silently ignore this case, but it doesn't. Avoid creating properties that you don't have actual data for. Adding them later is easy, deleting, not so much. A very common schema change is converting a simple value to a CVT, for example changing a country's population from an integer to a dated integer. Such a change is not backwards compatible so we would have to create a new property for the dated population. OK, now we have /mycountry/population, an integer, and /mycountry/dated_population, a CVT. We have some existing data for the former, none for the latter. How to move forward: 1. Add population data in the new format only. The existing data in the old format will continue to exist but will become less and less useful over time. This toes the schema compatibility line, but as existing applications will gradually cease to useful due to lack of data, the fulfillment of the schema compatibility promise is a bit empty. 2. Preserve both types of population setting the simple integer based on the most recent dated population. This keeps applications using the old schema fully functional at the expense of some ongoing data gardening activity. Whether this counts as "denormalization" - evil duplication of data - depends on the contents of the property and how it gets used. For example, if we have reliable population data for Elbonia dated 2000, and someone hears a news report on Elbonia giving the current population as "23 million" using 23,000,000 for the simple value is arguably better than waiting another two years to get the official, 23,275,381, or worst having a user fabricate a dated value (23,000,000 as of 2009) when the date is, in fact, unknown. Simple schema makes the data much easier to use: Grabbing "population" is much easier than grabbing the most recent population value sorted by date. In the relatively near future, you'll be able to write an "extended MQL" property which can, among other things, compute a simple property, such as "age", on the basis of more complex data, in that case: date of birth, date of death. 3. Actively delete the simple population property This comes off as a bit hostile, but given the way that things play out in scenario #1, it might just be honest. Hope this helps, -Scott