[Freebase-discuss] Lexical cleanup of Freebase topic names

Brian Karlak zenkat at google.com
Sat Apr 28 00:58:01 UTC 2012


On Thu, Apr 12, 2012 at 6:05 AM, Tom Morris <tfmorris at gmail.com> wrote:

Some lexical cleanups which I think *would* be a good idea include:
>
> - converting all strings to Unicode Normalization Form C
> (http://unicode.org/reports/tr15/#Norm_Forms).  This is generally
> useful to make life simpler for people, but also the input fields of
> the web client just can't deal with combining marks without getting
> very confused
>
> - removing embedded tabs from names
>
> - removing trailing equals signs (and spaces) from translated book
> titles which were of the form "<lang 1 title> =: <lang 2 title>" and
> are now of the form "<lang 1 title> ="
>
> - adding missing articles (The, A, An, Das, Le, La, Il, El, etc) back
> to book titles missing them
>
> - fixing names where UTF-8 encoding was mistakenly interpreted as Mac
> Roman (/user/coco's art loads) or ISO Latin-1 (books, other topics)
> and then loaded back in as UTF-8 again
>

These are all good ideas, and I've filed
DATA-471<http://bugs.freebase.com/browse/DATA-471> to
track them.  The gardening team needs to get the new version of the system
into production before they can take them on, but once that is out, these
will be at the top of the list.

I think removing names which are misencoded as "English" (even if
> we're able to guess with 100% accuracy) is problematic until better
> fallback rules are in place.  A lot of things don't work correctly
> when there's no English name (e.g. the history display
> http://www.freebase.com/history/view/m/093b7vw).  MQL's default is
> English.  People have had years of the English-only paradigm ingrained
> in their behavior (and more importantly, their apps).  Etc, etc.
>

Well, it's important to note that ever since MusicBrainz was imported,
we've had more than a million topics in Freebase that don't have a /lang/en
name.  Now that Freebase is part of a large global company, the number of
non-English entities is only going to grow.

Back in November 2010, when we did the MusicBrainz load, we had a discussion
about this <http://goo.gl/k3vIO>.  Based on that, we added in fallback
support to display non-English topic names in www.freebase.com when a
/lang/en name was not available.  In addition, we've done a lot of work to
internationalize the new dev.freebase.com client, including the ability to
edit non-English names.

On the flip side, mis-encoding non-English names as /lang/en has a high
cost.  Imagine the response of an English speaker to
"與妳到永久<http://www.freebase.com/view/m/0lnjw9>",
in a list of otherwise English results.  It's not just wrong -- it's
incomprehensible and confusing.  By keeping these obviously non-English
strings labelled as /lang/en, we're leaving ourselves open to a
particularly ugly bug.

Note that this doesn't apply to all languages.  Most English speakers are
comfortable with names in other languages, as long as they use Latin
script.  We're fine with "Der Ring des Nibelungen" and "Canciones de mi
corazón".  In fact, these names may be considered the proper names of these
topics in English.  However, names in non-Latin scripts will always appear
incorrect to English speakers.  "Скорее и
быстро<http://www.freebase.com/view/m/0lg4kb>"
and "أنده عليك <http://www.freebase.com/view/m/0_s7dd>" will never be
perceived as correct -- at a minimum, transliteration is required.

Based on this principle, we have identified a limited set of ~17K names
currently labelled as /lang/en that contain a majority of characters from
non-Latin character sets.  For all of these, we made our best guess as to
the proper language.  The technology for doing this is not perfect for
short name strings, unfortunately -- sometimes we mis-annotate Ukranian as
Russian, for instance.  As with English and Spanish, however, it's our
belief that such mis-annotations are unlikely to be perceived as
gratuitously incorrect -- ie, the new language is at least "more correct"
than the old English one.

This set is available for review in this public Google
spreadsheet<https://docs.google.com/spreadsheet/ccc?key=0Araptci5cAfadHhWQVFSeXdTUE96dWhDZzdJLTVJRlE>.
 They have also been loaded to sandbox with attribution
/user/google_gardener/attr/6<http://www.sandbox-freebase.com/inspect/user/google_gardener/attr/6?limit=1000>.
 Almost all of the changes are updates to mis-annotated MusicBrainz
entries.  Topics that were created from the English Wikipedia have been
explicitly excluded from the fix.

We think this is a small, targeted fix that will increase the quality of
the data we offer to our community.  Please give it a look and let us know
what you think.  If you see anything incorrect, or would like to provide
more precise language assignment than our automated algorithms were able to
guess, feel free to add the proper language code in column C of the
spreadsheet.  If all goes well, we plan to push sometimes next week.

Thanks,
Brian

PS -- We are aware of the known bug where a few topics with equation-like
names such as "hΨ=0" made it through our filters.  We are adding manual
fixes to correct those few instances.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.freebase.com/pipermail/freebase-discuss/attachments/20120427/02a9e0ab/attachment.htm>


More information about the Freebase-discuss mailing list