[Developers] Geographer, incorrect data, and asking the right questions in the right context
Tom Morris
tfmorris at gmail.com
Sat Mar 7 00:04:14 UTC 2009
Hi Stefano,
I started to reply to this earlier, but it fell through the cracks
until Sean's recent post on the thread reminded me. First, let me say
that, although my tone can be brusque, I wasn't trying to insult your
application. I think Geographer is a cool! I'm just not convinced
that it's being applied to a problem that it's well suited for. If
people were using it to geocode little known places that they had
personal knowledge of, I'd love it. Also, handling of places is a
fundamental building block that is important to get right, so I tend
to get a little, shall we say, "excited."
I agree that enabling useful user contribution is key and that the
work needs to be easy to accomplish and fine grained so that it can be
started and stopped easily. However I still think the current
workflow is missing the mark.
More inline below...
On Fri, Feb 20, 2009 at 3:09 PM, Stefano Mazzocchi <stefano at metaweb.com> wrote:
> Tom Morris wrote:
>
> [snip]
>
>> The general point of understanding the context that the user is
>> operating in is an important one though. I was playing with
>> Geographer yesterday and really question its premise. I got presented
>> with a whole raft of things that were a) mispelled, but in GNIS with
>> their correct spelling, b) U.S. census places, which clearly have geo
>> information associated with them somewhere, c) small NH towns which
>> are in GNIS or d) were Roman names for modern day British cities and
>> towns. These problems each have solutions which don't have anything
>> to do with users dragging pushpins around maps of areas that they
>> aren't familiar with and the fact that they now have geocoordinates is
>> just going to mask the real problem. Gazeteers have been well
>> understood since print days. Freebase needs to be making better use
>> of existing data sources like GNIS.
>
> There are several different points that you raise here and I feel
> compelled to address them separately:
>
> 1) "Gazeteers exist, so use them": I don't think there will ever be the
> day that Freebase will run out of external databases to harvest or
> cross-reference data against. Your point is valid, and I received the
> same exact criticism from people internally, but I feel it looks at the
> problem only and exclusively from the data quality angle while I'm more
> interested in the contribution-enticing angle of the problem.
>
> Generally speaking, it's easier and more natural for computer scientists
> (and I know because I'm one) to think in terms of 'better programs'
> rather than 'better social dynamics'. Believe me: I have to fight that
> natural tendency every day too :-)
It's important to not only entice users to contribute, but to get them
to contribute in a useful way. As I mentioned in my original note,
context is key. Since users are presented with random place names,
there is no opportunity for them to use personal knowledge to solve
the problem. They are presented with Google's best guess as to the
location, which is often correct, so they are going to be habituated
to always accepting its suggestion. If you're using this as a hack to
skirt Google's ToS on data harvesting, fine, but you're not really
getting useful user contribution and the same information is available
directly from GNIS without the degradation introduced by the user
clicking slightly off the mark with the map zoomed out.
The way the situation is presented to the user biases their response.
Although there's a Skip button, the users are strongly encouraged to
click on the map since it dominates the screen. Let's look at two
examples. The first case is the misspelling that I was talking about
in my first note. Freebase had imported Rancho Murrieta, CA (sic)
when it was misspelled on Wikipedia and then failed to track WP when
it corrected the mistake and spelled it correctly as Rancho Murieta.
The problem that needed solving is not placing "Rancho Murrieta" on a
map, but getting it spelled correctly and merged with its duplicate.
Additionally, Google was confused and positioned the map several miles
away from the correctly spelled location. Clicking on the map would
have added more data to the phantom location, reinforcing the
impression that it was an actual place. The user has no way of
indicating "misspelled" or anything other than "place" or "skip." If
they follow the directions that the app leads them in, all they're
going to do is reinforce/amplify the bad data in the system.
The second example is the Town of Corning, NY (/en/corning_new_york,
not to be confused with the City of Corning, NY /en/corning) which was
the first thing that I was presented when I tried Geographer again
just now. I was presented a map with a large "Corning" right in the
middle of it. Where will 99% of users click? That's right, on the
City of Corning. Now I'm not exactly sure where the Town of Corning
is, but I can tell from looking at GNIS that it's not in quite the
same place as the city. Of course this process still won't get the
GNIS id for the town of Corning added to the entry, so someone else
will have to do that and then they're going to have to resolve the
discrepancy between the GNIS geolocation for the town and the
geolocation invented when the user clicked on the map.
If you look at the Wikipedia page at
http://en.wikipedia.org/wiki/index.html?curid=260076 you'll see that
it not only has the GNIS ID and lat/long, but the year for the census
population, the FIPS code and a bunch of other useful information
which is getting dropped.
Freeloading off user contribution is all well and good, but it can't
be a substitute for doing the basics correctly in the first place.
GNIS isn't some exotic fringe database that should be queued up for
loading at some indefinite point in the future. It's a fundamental
piece of infrastructural scaffolding which can anchor the rest of the
work in this space.
> 3) "The fact that they now have geocoordinates is just going to mask the
> real problem": this is the focal issue for me.
>
> My very personal opinion on this matter (and, beware: many people inside
> and outside the company disagree with me on this) is that an incorrect
> statement is better than no statement at all.
That would explain the multitude of incorrect statements in Freebase. :-)
Seriously though, I hope the examples above demonstrated that there
are cases where this is not true. We can argue about relative
frequencies, but clearly there are cases where bad data is worse than
no data at all.
> 2) The easier it is for me to estimate the burden of contribution (drag
> pushpin on map instead vs. entering lat/long by hand) the higher my
> probability of contribution.
Making things easier for people doesn't necessarily mean pushing pins
in maps. If you presented the top 3 choices from GNIS, nicely
formatted, perhaps displayed on a map, and asked the user to pick A,
B, C or "none of these is a good match," I bet you'd get much better
data. If you're going to have them invent the location from scratch
in some circumstances, why not ask them to rate how good they think
they're positioning is? That way you can take all the "it's somewhere
around here" quality ratings and run them through another stage of
processing.
As I said up top, I think there are good uses for Geographer in data
capture. I just think that the problem it is being applied to has
different root causes and we should go back and fix the root cause,
not attempt to patch things up after the fact.
Tom
More information about the Developers
mailing list