[Developers] how to make best use of Freebase suggest for reconciling collections of names

Raymond Yee raymond.yee at gmail.com
Wed Jun 24 20:11:07 UTC 2009


Thanks, Tom, for answering my query -- I was hoping you would do so!  
See my responses below.

Tom Morris wrote:
> That sounds like a useful tool.  So useful in fact, that it should be
> provided as part of the basic offering of Freebase.
>   
I'm glad to hear that my suggested is along the lines of what others 
think would be useful. I too think the tools should be part of the 
Freebase offering -- but I remain interested enough in getting some 
entities reconciled that I'm willing to put some work towards building 
some tools (though they might be fairly rough.)
> One of the things that you don't describe at all is what role Types
> play in this process.  Are you looking for entities of a specific type
> (government agencies)?  Obviously the Freebase topics are going to be
> a mixture of typed, untyped, and non-existent, so you'll need to deal
> with them all on that end, but if you are looking for a specific
> target type or types, you can help yourself.
>   
I didn't mention Types at all because I was hoping to get fairly far w/o 
constraining or suggesting by Types.  But yes, often I'll be reconciling 
items of uniform typing and I'd like to be able to express constraints 
like "the item is a government agency and is, in fact, an American 
government agency but if a promising Freebase entity is so typed, don't 
eliminate it."  From glancing at 
http://www.freebase.com/api/service/search?help#limiting I'm under the 
impression that I should be able to express this statement. 

What would be nice is to express these constraints simply in the UI of a 
reconciliation app, something that would let me tweek constraints and 
see how I can get things to match. 
> On Wed, Jun 24, 2009 at 1:47 PM, Raymond Yee <raymond.yee at gmail.com> wrote:
>
>   
>> a) I'd like to find ways to make it easy for the user to know which matches have been matched with "high confidence" by Freebase  so that she can scan through the list quickly....Do people have suggestions about how to measure "high confidence" beyond a high relevance score?
>>
>>     
>
> I looked at two things when I did this: 1) absolute score and 2) delta
> between top score and the rest as compared to the standard deviation
> for the series.  That will help, but search's scoring isn't tunable
> and it's attempting to be all things to all people, so it may not
> return results that are maximally useful for your application.
>   
Thanks for the pointer -- did you weight #1 and #2 in some way?
> One of the ways that I worked around this was to basically consider
> the search score to be one input to an overall scoring system that
> takes other factors into account.  White lists and black lists of
> types can be a help (e.g. if it's typed a Play, it's probably not a
> government agency -- unless it's a farce, of course :-))  I've also
> resorted to going back to the original Wikipedia article to do
> analysis on it for useful scoring hints.
>   
Great idea...
>   
>> b) have input and output mechanisms that tie into, say, Google spreadsheet.  I often find it convenient to work with spreadsheets with columns of attributes -- one of which I'd like to have is the Freebase ID.  I'd like to point this app to a Google spreadsheet (or upload an Excel or OpenOffice.org calc file or CSV or TSV file) and then have as an output the data with the Freebase ID filled out....(much like how the reconciliation interface works)
>>
>>     
>
> Depending how big your data set is, you're probably going to want to
> have multiple people working on it.  This means that it would be
> useful to have a shared queue that they work from.
>   
I agree...I'd like to be able to specify some context for others to 
understand what matching is happening.  A bit like Amazon's Mechanical 
Turk....
> Not really relevant to your particular application, but something the
> whole data import process could really use is a much more powerful
> table parser - something which is at least equivalent to what Excel or
> Calc can do.  Using a spreadsheet tool as a front end would allow you
> to do this part there, but you really should be able to just dump html
> tables or csv files or anything somewhat structured into the list
> import box and massage it to get reasonable results without resorting
> to a separate tool.
>   
I agree.  In one of my particular cases, I have an XML document that I'd 
like to enrich with a attribute to indicate the Freebase ID.
> Something similar to this has been on my To Do list for a while, but
> a) I haven't had time to work on it and b) I fundamentally believe
> that such basic pieces of plumbing should be provided as part of the
> basic system.  Having said that, if you get started on it, I'll help
> to the extent that I can.
>   
Thanks, Tom, for your offer of help.  I'll see how far I get and report 
back.

-Raymond
> Tom
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/developers/attachments/20090624/0732aeff/attachment.htm 


More information about the Developers mailing list