[Freebase-discuss] Lexical cleanup of Freebase topic names

Brian Karlak zenkat at google.com
Mon Apr 30 19:50:10 UTC 2012


On Mon, Apr 30, 2012 at 9:49 AM, Tom Morris <tfmorris at gmail.com> wrote:

>
> Names invented by artists is certainly a challenging domain to tackle
> first!  Is the analysis lexical only or do you include factors such as
> where the band is from, where the album was released, etc?  Including such
> domain knowledge might help filter out some of the misidentifications.
>

The original MBZ load tried to take band origin into account.  The ones
that made it through had incomplete signal.

This fix, however, is only lexicographic.  To be honest, we could try to
add in the other signals, but to be honest I suspect we'd be getting
diminishing returns over the current manual fixes people have already
submitted.


>
>
>> Topics that were created from the English Wikipedia have been explicitly
>> excluded from the fix.
>>
>
> I see a number of Cyrillic digraphs which came from Wikipedia. e.g.
>
> http://www.freebase.com/view/m/05c2wwp
> http://www.freebase.com/view/m/05c2yw_
>
> which makes me wonder if the Wikipedia filter has a leak in it.  They're
> also a) not Russian and b) often associated with multiple languages.
>

I think I know what is going on here ... these were re-written by a
gardening process, and so the attribution has been obscured.  We need to
make sure that we also check for the presence of a /wikipedia/en key.


> If you see anything incorrect, or would like to provide more precise
>> language assignment than our automated algorithms were able to guess, feel
>> free to add the proper language code in column C of the spreadsheet.
>>
>
> Is there a language code for symbols which cross languages (like Phil's
> particle topics)?  Is there a way to indicate that a name is in multiple
> languages e.g. http://www.freebase.com/view/m/0d_39rz
>  Το μπλουζ του κόσμου - World blues
>
Currently Freebase has no way of representing "symbol" languages or "mixed"
languages.

As per the previous thread, I think that the best we can do with symbols is
to keep them as /lang/en, since it is still the "default" language (and
it's how those topics are referred to in English).

For mixed languages, I'm torn.  In some cases, splitting them into two
entries, one for each language, would be "best" ... although that could be
technically incorrect in some cases as well.

As it stands, I'm veering towards the non-english language, as per the
original rationale behind this fix -- these will look strange to English
speakers.  However, I recognize that it's entirely arbitrary, and I'm open
to suggestions.


> The proposed changes are generally a net positive and, at 17,000 names, it
> is pretty small in the grand scheme of things, so I'm not opposed to it,
> but it would be nice to get it in as good a shape as possible.
>

Cool news.  Please feel free to update the spreadsheet as you feel
appropriate.  I think that will be the best way of getting the
highest-quality load in.

Brian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freebase.com/pipermail/freebase-discuss/attachments/20120430/88c38fa3/attachment-0001.htm>


More information about the Freebase-discuss mailing list