[Data-modeling] how much work done on modeling of personal names -- even for surname + given name?
Tom Morris
tfmorris at gmail.com
Thu Mar 5 19:45:35 UTC 2009
This is a thorny problem, but it's an important one since people and
their names are central to so many different types of application. I
think Ed's model is a great start and will send some comments on that
separately.
On Wed, Mar 4, 2009 at 2:24 PM, Robert Cook <robert at metaweb.com> wrote:
> I think that it would be fairly straightforward to create a model that
> has high utility but near impossible to create a model that was
> "right" that would be used by anybody.
For casual users, dealing with modern Western names, with the
constraint that everything has to be entered and output in exactly the
format it will be used in without any transformations, you won't beat
the utility/usability ratio of the current schema. If all those
parameters are fixed, we probably should just stop now. I don't think
that would be in Freebase's best interest though. I think there are
more useful schemas which still have high usability.
>> All elements of a single name need to be tied together and then the
>> aliasing of multiple names layered on top of that. Otherwise you
>> can't correctly model Mary Smith and Mrs. Mary Smith Jones. She's
>> never known as Mary Smith Smith.
>
> It's true that this would be not modeled explicitly capturing the
> state change, but it could be done in a useful way still. The real
> ambiguity in this case exists in the real world. Does Mary Smith Jones
> consider "jones" to be her middle name or part of her composite family
> name? This may be different from person to person.
Sure, it can vary from person to person, but it's an easily answered
question with a well defined answer. Just ask Mary or see where on
that credit card application she put 'Smith' or check the name of the
column that contains 'Smith' in the database. In present day USA,
she'd probably use 'Smith-Jones' with a hyphen, but even if she
decides her surname is 'Smith Jones,' it'll still be in a single
column in the database or single field on the form.
> Worse is the carrying forward of family names to offspring as given (or family?) names.
The fact that they're lexically equivalent, doesn't make them
semantically equivalent. If they're using it as a given name, it's a
given name. Bruce Lee/Li and Lee Majors aren't using the same name.
This applies to Chris' case as well. If I name my daughter 'Moon
Unit,' then that's a girl's given name, whatever anyone else thinks it
means.
> Put another way, an imperfect model with data always trumps a
> "correct" one with no data.
I agree that modeling in the abstract is a fool's game, but I'm a
little confused by the assertion that there's no structured name data
available. I run into it at every turn. Here's a database that I was
looking at loading last night
http://bioguide.congress.gov/scripts/biodisplay.pl?index=K000107 with
surname and given names stored separately. Here's the same person in
another database that has the surname split out
http://dbpedia.org/page/John_F._Kennedy Pretty much any
corporate/organizational/governmental database with personal names is
going to have some type of structure.
>> You'll also want a place to put honorific prefixes (Dr., Prof., Gen.,
>> Rev., etc) and generational suffixes (Jr, Sr, III), particularly if
>> you're going to be constructing "full" names out of their component
>> pieces.
>
> These seem orthogonal and could be added if there is data.
I agree that they're orthogonal. How important they are really
depends on whether you want structured/semi-structured places for all
pieces of name. If you're going to maintain a separate "Display Name"
or some such which has all the pieces in the correct order, it
obviously is less critical. If not, you need prefixes and suffixes,
even if you don't attempt to assign any semantic meaning to them.
On Wed, Mar 4, 2009 at 5:16 PM, glenn mcdonald
<gmcdonald at itasoftware.com> wrote:
> Here's another approach: instead of isolating just surname, make your
> new property be /people/person/sortname. [...] The big advantage of this is
> that it keeps you out of the quagmire of modeling all the internal
> semantic complexity of worldwide naming patterns, but still allows
> you to model the sorting, which is the thing you most often care about.
The disadvantage is that it enforces a single global collating
sequence. That may be acceptable for a single application or a single
company, but quickly falls apart when faced with a diverse set of
requirements for sorting. At least if you've got the basic name parts
identified, applications can choose their own collating sequence.
They can even add an application-specific sort name using a private
type/property if need be, but it doesn't really work on a global
basis.
On Thu, Mar 5, 2009 at 1:02 PM, Kirrily Robert <kirrily at metaweb.com> wrote:
> I'm convinced that any CVT that has fields for "given name", "family
> name", etc won't work. If we did have a CVT we'd need to have fields
> of name part (eg. given name), name value (eg. "John"). That would
> allow you to say "saint name: Catherine" or "Patronym: Williamson" or
> "Regnal name: Charles" in addition to the more common (western)
> options. However, it wouldn't allow you to easily sort by surname.
While it's true that a surname/given name model doesn't support
patronymics, I don't think that necessarily means it "won't work." It
will have less than 100% coverage for structured naming of the world's
names, but I think that's a coverage/complexity tradeoff that can be
made. One name people are covered since those are just given names.
Royal names, pope names, etc are rare enough that I think they can be
shoehorned into whatever model works for the 90% case. Patronyms I
think are borderline, but I'd like to see them included.
> However, it wouldn't allow you to easily sort by surname.
The problem more generally is to sort by the commonly accepted means
for the society. In Iceland and, I'm guessing, other patronymic
societies, this is usually by given name.
Comments on Ed's proposal to follow...
Tom
More information about the Data-modeling
mailing list