[Data-modeling] Languages and Rosetta
Kurt Bollacker
kurt at spaceship.com
Tue May 5 06:47:50 UTC 2009
On Tue, May 05, 2009 at 01:39:37AM -0400, Tom Morris wrote:
> > No. ?It is intended to co-exist and supplement the /language domain.
>
> Co-exist in the terms of schema/types or topics or both? It's a very
> visible subject (obviously), so I think we need to make it easy for
> users to make correct choices. The fact that Langoid is the primary
> type for your schema will obviously help keep civilians from picking
> it, but things aren't as clear cut for topics.
Schema for sure, but in the cases where there is dispute or radically
different use cases (which usually means the uses don't overlap)
topics as well.
> >> I've flagged a few things for merger, but now I've come across
> >> something that I can't even flag
> >> http://www.freebase.com/view/base/rosetta/group/Chinese (because of
> >> permissions perhaps?)
> >
> > oooh.. ?You picked a good one. ?Chinese is one of the few lingusitic
> > entities that the linguists consider to be a "macrolanguage" or
> > "language group" rather than a specific language. The Freebase topic
> > for Chinese:
> >
> > http://www.freebase.com/view/en/chinese_language
> >
> > includes both the "language" and "group" properties in an intuitive
> > (but non-rigorous to linguists) way.
>
> For the record, I was trying to merge
> http://www.freebase.com/view/base/rosetta/group/Chinese
> http://www.freebase.com/view/en/chinese_language
Yup. I got that.
> as a Language family.
>
> >?A good indicator of this is the
> > fact that this topic is typed both "Human Language" and "Language
> > Family", which is taxonomically impossible. ?However, for the real
> > world, this entropy is OK.
>
> Actually, from my reading of the Wikipedia article, upon which the
> Freebase typing is based, it's pretty clear that it should be a
> Language Family, not a Human Language. I think before I started it
> was only typed as a Human Language, not a Language Family,
That's why I said it was a good example. The "intuitive"
understanding is that Chinese is a language, not a family. It even
has an ISO 639-3 code, a concession to practical usage rather than
linguistic rigor. The linguist would remove the "Human Language"
type, but this would destroy useful information. We need to support
both worlds.
> but from
> what I've seen in my travels through Freebase, I think you'll find
> that almost everything has been typed Human Language - dialects,
> languages, families, you name it. Given the lack of guidance and the
> amateurs doing the typing, that's not too surprising, but I don't
> think it's an irrecoverable situation. As long as the linguists
> definitions aren't too completely wacky, I don't see why users
> couldn't be convinced to use them (particularly if the type hierarchy
> is already completely populated).
I think that's right. The creation of the rosetta base and languoids
is a first step in this transition, which does not trample on any
existing uses of Human Language.
> > The taxonomy of languoids in the Rosetta Basehas been created by
> > linguists, and is purposefully trying to stay out of the way of common
> > usage of languages when there is potential confusion.
>
> Unfortunately, that's really hard to do because of the way the
> autocomplete works. Unless it's clear from the description that a
> topic or type is from someplace other than the commons, there's no way
> for a user to tell.
I've never seen a controversial typing of Human Language yet in
Freebase. It's obviously either right or wrong. If it were not for
the pending and yet to be found merges, I'd eliminate the Human
Language typing on all topics that do not have 639-3 codes.
> > I'd leave it alone until we are sure that the merge-meisters at
> > Metaweb have done their stuff with the 600 pending requests.
>
> OK, I'll put things on the back burner until the merge happens. I
> haven't seen anything except the ones that I flagged show up in the
> Freebase public queue, so I presume these are all off in some magic
> internal-only Metaweb queue currently.
I think the total was 624 merges. It seemed silly to mark all of them
explicitly since they had been hand vetted by linguists (I wrote a
reconciliation tool to help them). For the record, I used no internal
Metaweb tools to find or vet these language merges.
Kurt :-)
More information about the Data-modeling
mailing list