[Developers] Geographer, incorrect data, and asking the right questions in the right context
Stefano Mazzocchi
stefano at metaweb.com
Mon Mar 23 01:39:07 UTC 2009
Tom,
thanks much for taking the time to write this and very sorry for the
delayed response, which is not because of lack of interest but simply
because I wanted to take the time to think it thru.
First, I wanted say that feedback like this is precious because rare
(most people that don't like something simply walk away or remain
silent) so thank you for that.
Tom Morris wrote:
> Hi Stefano,
>
> I started to reply to this earlier, but it fell through the cracks
> until Sean's recent post on the thread reminded me. First, let me say
> that, although my tone can be brusque, I wasn't trying to insult your
> application.
Oh, don't worry, it didn't feel brusque at all. Maybe after 10 years of
Apache mailing lists and 5 years of MIT professors' criticism, I'm the
first to appreciate honesty even when it borders being a little brutal.
So, no worries about sugar coating the pill with me :-)
> I think Geographer is a cool! I'm just not convinced
> that it's being applied to a problem that it's well suited for. If
> people were using it to geocode little known places that they had
> personal knowledge of, I'd love it. Also, handling of places is a
> fundamental building block that is important to get right, so I tend
> to get a little, shall we say, "excited."
And excitement is precisely what I want to entice. Of course, in a
positive way though, so any criticism on that front is highly welcome
and highly valuable.
> I agree that enabling useful user contribution is key and that the
> work needs to be easy to accomplish and fine grained so that it can be
> started and stopped easily. However I still think the current
> workflow is missing the mark.
Ok, I'm all ears.
> More inline below...
>
> On Fri, Feb 20, 2009 at 3:09 PM, Stefano Mazzocchi <stefano at metaweb.com> wrote:
>> Tom Morris wrote:
>>
>> [snip]
>>
>>> The general point of understanding the context that the user is
>>> operating in is an important one though. I was playing with
>>> Geographer yesterday and really question its premise. I got presented
>>> with a whole raft of things that were a) mispelled, but in GNIS with
>>> their correct spelling, b) U.S. census places, which clearly have geo
>>> information associated with them somewhere, c) small NH towns which
>>> are in GNIS or d) were Roman names for modern day British cities and
>>> towns. These problems each have solutions which don't have anything
>>> to do with users dragging pushpins around maps of areas that they
>>> aren't familiar with and the fact that they now have geocoordinates is
>>> just going to mask the real problem. Gazeteers have been well
>>> understood since print days. Freebase needs to be making better use
>>> of existing data sources like GNIS.
>> There are several different points that you raise here and I feel
>> compelled to address them separately:
>>
>> 1) "Gazeteers exist, so use them": I don't think there will ever be the
>> day that Freebase will run out of external databases to harvest or
>> cross-reference data against. Your point is valid, and I received the
>> same exact criticism from people internally, but I feel it looks at the
>> problem only and exclusively from the data quality angle while I'm more
>> interested in the contribution-enticing angle of the problem.
>>
>> Generally speaking, it's easier and more natural for computer scientists
>> (and I know because I'm one) to think in terms of 'better programs'
>> rather than 'better social dynamics'. Believe me: I have to fight that
>> natural tendency every day too :-)
>
> It's important to not only entice users to contribute, but to get them
> to contribute in a useful way. As I mentioned in my original note,
> context is key. Since users are presented with random place names,
> there is no opportunity for them to use personal knowledge to solve
> the problem. They are presented with Google's best guess as to the
> location, which is often correct, so they are going to be habituated
> to always accepting its suggestion. If you're using this as a hack to
> skirt Google's ToS on data harvesting, fine, but you're not really
> getting useful user contribution and the same information is available
> directly from GNIS without the degradation introduced by the user
> clicking slightly off the mark with the map zoomed out.
My very personal sampling indicates that such degradation is actually
equivalent to the degradation inflicted on publicly available GNIS data
simply by the fact that older triangulation techniques were used.
Even in the Getty TGN Gazeteer (which is a regarded as one of the
highest quality datasets in this space), even a personal sampling
(mostly small villages in the alps of northern Italy) that were off by
even half a mile, probably due to the fact that data was taken from a
survey by the italian military geographic institute which didn't really
have that precision back in the days.
As weird as it might seem, I don't know how much data in those gazeteers
has been overlayed over satellite images and rechecked for accuracy.
Which is what brought me to prototype the 'adjuster' part of the
Geographer app.
> The way the situation is presented to the user biases their response.
> Although there's a Skip button, the users are strongly encouraged to
> click on the map since it dominates the screen. Let's look at two
> examples. The first case is the misspelling that I was talking about
> in my first note. Freebase had imported Rancho Murrieta, CA (sic)
> when it was misspelled on Wikipedia and then failed to track WP when
> it corrected the mistake and spelled it correctly as Rancho Murieta.
> The problem that needed solving is not placing "Rancho Murrieta" on a
> map, but getting it spelled correctly and merged with its duplicate.
> Additionally, Google was confused and positioned the map several miles
> away from the correctly spelled location. Clicking on the map would
> have added more data to the phantom location, reinforcing the
> impression that it was an actual place. The user has no way of
> indicating "misspelled" or anything other than "place" or "skip." If
> they follow the directions that the app leads them in, all they're
> going to do is reinforce/amplify the bad data in the system.
Point taken.
I agree that the interface is designed to entice a contribution and it
is probably biased toward the idea that any contribution is better than
no contribution at all... if only because once there is a lat/long info
there, the topic can now be plotted on a map and people familiar with
the topic (or cleanup bots that can cross reference other datasets) will
immediately notice the mistake and are more likely to correct it than to
add the coordinates if they are not there (or the topic didn't show up
on a map).
Like I mentioned before, even inside Metaweb, many people disagree with
me on this: what is the marginal value of a statement? for whom? how can
we evaluate its 'truthfullness'? is there such thing as a marginal
enticing probability of a statement (the ability for a contributed
statement to entice another one)? is this value dependent on its
truthfullness? is this value independent on the UI?
These are the questions that mostly tickle my curiosity while observing
our development of the "games with a purpose".
> The second example is the Town of Corning, NY (/en/corning_new_york,
> not to be confused with the City of Corning, NY /en/corning) which was
> the first thing that I was presented when I tried Geographer again
> just now. I was presented a map with a large "Corning" right in the
> middle of it. Where will 99% of users click? That's right, on the
> City of Corning. Now I'm not exactly sure where the Town of Corning
> is, but I can tell from looking at GNIS that it's not in quite the
> same place as the city. Of course this process still won't get the
> GNIS id for the town of Corning added to the entry, so someone else
> will have to do that and then they're going to have to resolve the
> discrepancy between the GNIS geolocation for the town and the
> geolocation invented when the user clicked on the map.
>
> If you look at the Wikipedia page at
> http://en.wikipedia.org/wiki/index.html?curid=260076 you'll see that
> it not only has the GNIS ID and lat/long, but the year for the census
> population, the FIPS code and a bunch of other useful information
> which is getting dropped.
>
> Freeloading off user contribution is all well and good, but it can't
> be a substitute for doing the basics correctly in the first place.
> GNIS isn't some exotic fringe database that should be queued up for
> loading at some indefinite point in the future. It's a fundamental
> piece of infrastructural scaffolding which can anchor the rest of the
> work in this space.
Again, like I mentioned before, importing existing data into Freebase is
an ongoing effort and it will hardly ever be completed.
My primary interests with Geographer was not to compensate or highlight
deficiencies in the data acquisition, not at all, but to prototype and
experiment with different kinds of user interfaces for data contribution
(and also test Acre in those scenarios).
I'm not interested in criticizing our data importing efforts or in
evaluating what is 'fundamental' and what is not Freebase data
scaffolding, it's simply not my place nor I have sufficient knowledge to
evaluate such criticism.
I am interested, on the other hand, in whatever idea you might have in
making Geographer (or anything derived from it) more useful, maybe by
making it work on data types that are not as obvious and easy to find in
existing datasets.
>> 3) "The fact that they now have geocoordinates is just going to mask the
>> real problem": this is the focal issue for me.
>>
>> My very personal opinion on this matter (and, beware: many people inside
>> and outside the company disagree with me on this) is that an incorrect
>> statement is better than no statement at all.
>
> That would explain the multitude of incorrect statements in Freebase. :-)
Well, given that my personal contributions to Freebase's data account to
a minuscule amount and that my laissez-faire liberal attitude toward
a-priori quality control is not shared by many, I strongly doubt that.
> Seriously though, I hope the examples above demonstrated that there
> are cases where this is not true. We can argue about relative
> frequencies, but clearly there are cases where bad data is worse than
> no data at all.
Of course. My statement was obviously meant to spark a healthy debate
other than lay misleading ground rules.
The real issue for me is finding ways to getting closer to answering the
questions that I posed above.
>> 2) The easier it is for me to estimate the burden of contribution (drag
>> pushpin on map instead vs. entering lat/long by hand) the higher my
>> probability of contribution.
>
> Making things easier for people doesn't necessarily mean pushing pins
> in maps. If you presented the top 3 choices from GNIS, nicely
> formatted, perhaps displayed on a map, and asked the user to pick A,
> B, C or "none of these is a good match," I bet you'd get much better
> data.
Very good suggestion.
> If you're going to have them invent the location from scratch
> in some circumstances, why not ask them to rate how good they think
> they're positioning is? That way you can take all the "it's somewhere
> around here" quality ratings and run them through another stage of
> processing.
Another good idea.
> As I said up top, I think there are good uses for Geographer in data
> capture. I just think that the problem it is being applied to has
> different root causes and we should go back and fix the root cause,
> not attempt to patch things up after the fact.
I think there is value in your criticism and I'll take you up on that
for the next round of prototyping around Geographer.
Again, thank for you precious feedback, I hope you'll stick around to
give me more in the next iteration of Geographer's prototypes.
--
Stefano Mazzocchi Application Catalyst
Metaweb Technologies, Inc. stefano at metaweb.com
-------------------------------------------------------------------
More information about the Developers
mailing list