[Data-modeling] Location question

Richard Newman rnewman at twinql.com
Sun Jul 12 05:54:51 UTC 2009


(Apologies if this has been discussed to death; a quick scan through  
the archive didn't satisfy my curiosity. I began phrasing this as an  
email to my only FB list, developers, but decided that it was a data- 
modeling issue… so here I am.)

I recently ended up down a rabbit hole exploring common  
categorizations of the United States — e.g., "Mountain States",  
"Pacific Northwest" — and I'd appreciate some insight, opinion, or  
brutal rejection from those more knowledgeable.

The Census Bureau Divisions (e.g., "Northwest") are clear cut. The  
common categorizations (e.g., "Pacific Northwest") on the other hand,  
based as they are on convention and wooly usage, pose some interesting  
challenges, which apply to notional regions around the world. They're  
clearly locations — they are regions that contain other locations,  
just like the Census Bureau Divisions — but they're not definite (one  
cannot definitively say that Pendleton, OR is in the Palouse, because  
some people exclude Oregon from that region).

This leads to at least two expressiveness problems:

* The "Northwestern United States" is always considered to include  
Oregon and Washington… but also *sometimes* Idaho, Montana, Wyoming,  
Southeast Alaska, and parts of Northern California (presumably  
depending on who's talking). I don't see a good type or property in  
Freebase to express varying levels of "considered membership" of any  
kind (location containment or otherwise). I suppose I could put  
memberships in my own domain (rendering them personal), or a base  
(rendering them subjective), but surely some definitions of these  
regions belong in the Commons? Linkage between these categories and  
their states is a valuable navigation tool.

* The previous bullet mentioned "parts of Northern California" and  
"Southeast Alaska". I feel uneasy reifying these entities ("the parts  
of California sometimes considered to be in the Northwestern United  
States" is a pretty bad name for a topic), even in my own domain, and  
doing so still leaves the problem of choosing which parts to include!

The general approach to wooly containment seems to have been "dart  
throwing", judging by examples:

<http://www.freebase.com/view/en/northern_california/-/location/location/contains 
 >

-- Northern California appears to consist of the Bay Area, SoMa  
warranting its own mention! No Redding, which seems odd. Is there a  
better way?


I realize that this is data modeling, and these are the breaks.  
However, this occurred in the context of playing again with Parallax*,  
in which I see no way to define a set of entities apart from as the  
range of a property of some other entity (not even through  
intersection) — i.e., if I want to map cities in just Oregon and  
Washington, the precise constituents of "Northwestern United States"  
suddenly become very important, because it might be my only way to get  
a handle on that collection. That experience suggests that this might  
be an issue that shouldn't just be swept under the rug, particularly  
given the broad applicability and importance of geographic data.

The alternatives to solving this are presumably either to omit  
containment links for these categories (not much use when I pick  
"Pacific Northwest" in Parallax and see no contained locations at  
all), or to assert only those things which are reasonable certainties.  
I'm not particularly happy with that approach, for the simple reason  
that it leads to definitions that appear artificially narrow (because  
only the intersection of divergent definitions has been included) and  
differ from a reading of the English text imported from Wikipedia.

Thoughts?

-R

* Kicks the hell out of Wolfram Alpha for this kind of exploration, by  
the way; Alpha has difficulty doing any kind of work with sets. Good  
job, David!


More information about the Data-modeling mailing list