[Developers] Bad chars

Christopher R. Maden crism at metaweb.com
Mon Feb 9 12:50:48 UTC 2009


Ofer Kalisky <kalisky at hotmail.com> wrote:
> In some entries there are bad characters, which make the quadruples file (in the datadumps) bad for processing (with python's csv for example).
> 
> see /guid/9202a8c04000641f8000000002d1dbca for an example. (see the history, cause I just deleted the bad char)
> 
> Is there anything to do with this? cause every time I try to work with the datadumps I must do some iterations till I get rid of all entries with bad chars...

Cleaning up bad instances as you find them is a good start.

If there is s systemic problem you can identify, please file a bug report in Jira at <URL: https://bugs.freebase.com/ > and we will try to clean it up.

That particular item was a musical track loaded from MusicBrainz, and if you follow the link at the bottom of the page, you’ll see that U+001A is in the title of the track there, too.  That is a valid character, though some systems (particularly DOS/Windows ones) will interpret the ^Z as a string terminator.

The MusicBrainz import process didn’t consider the possibility of control characters in the names of things; if this is a widespread problem, we could do a clean-up pass.  Were the other problems you found in topics derived from MusicBrainz?

~Chris
-- 
Christopher R. Maden
Data Architect
Freebase.com: <URL: http://www.freebase.com/ >
Metaweb Technologies, Inc. <URL: http://www.metaweb.com/ >


More information about the Developers mailing list