[Data-modeling] Location question
Richard Newman
rnewman at twinql.com
Sun Jul 12 05:54:51 UTC 2009
(Apologies if this has been discussed to death; a quick scan through
the archive didn't satisfy my curiosity. I began phrasing this as an
email to my only FB list, developers, but decided that it was a data-
modeling issue… so here I am.)
I recently ended up down a rabbit hole exploring common
categorizations of the United States — e.g., "Mountain States",
"Pacific Northwest" — and I'd appreciate some insight, opinion, or
brutal rejection from those more knowledgeable.
The Census Bureau Divisions (e.g., "Northwest") are clear cut. The
common categorizations (e.g., "Pacific Northwest") on the other hand,
based as they are on convention and wooly usage, pose some interesting
challenges, which apply to notional regions around the world. They're
clearly locations — they are regions that contain other locations,
just like the Census Bureau Divisions — but they're not definite (one
cannot definitively say that Pendleton, OR is in the Palouse, because
some people exclude Oregon from that region).
This leads to at least two expressiveness problems:
* The "Northwestern United States" is always considered to include
Oregon and Washington… but also *sometimes* Idaho, Montana, Wyoming,
Southeast Alaska, and parts of Northern California (presumably
depending on who's talking). I don't see a good type or property in
Freebase to express varying levels of "considered membership" of any
kind (location containment or otherwise). I suppose I could put
memberships in my own domain (rendering them personal), or a base
(rendering them subjective), but surely some definitions of these
regions belong in the Commons? Linkage between these categories and
their states is a valuable navigation tool.
* The previous bullet mentioned "parts of Northern California" and
"Southeast Alaska". I feel uneasy reifying these entities ("the parts
of California sometimes considered to be in the Northwestern United
States" is a pretty bad name for a topic), even in my own domain, and
doing so still leaves the problem of choosing which parts to include!
The general approach to wooly containment seems to have been "dart
throwing", judging by examples:
<http://www.freebase.com/view/en/northern_california/-/location/location/contains
>
-- Northern California appears to consist of the Bay Area, SoMa
warranting its own mention! No Redding, which seems odd. Is there a
better way?
I realize that this is data modeling, and these are the breaks.
However, this occurred in the context of playing again with Parallax*,
in which I see no way to define a set of entities apart from as the
range of a property of some other entity (not even through
intersection) — i.e., if I want to map cities in just Oregon and
Washington, the precise constituents of "Northwestern United States"
suddenly become very important, because it might be my only way to get
a handle on that collection. That experience suggests that this might
be an issue that shouldn't just be swept under the rug, particularly
given the broad applicability and importance of geographic data.
The alternatives to solving this are presumably either to omit
containment links for these categories (not much use when I pick
"Pacific Northwest" in Parallax and see no contained locations at
all), or to assert only those things which are reasonable certainties.
I'm not particularly happy with that approach, for the simple reason
that it leads to definitions that appear artificially narrow (because
only the intersection of divergent definitions has been included) and
differ from a reading of the English text imported from Wikipedia.
Thoughts?
-R
* Kicks the hell out of Wolfram Alpha for this kind of exploration, by
the way; Alpha has difficulty doing any kind of work with sets. Good
job, David!
More information about the Data-modeling
mailing list