[Developers] mql_escape and UTF-8

Christopher R. Maden crism at metaweb.com
Wed Jul 30 16:19:42 UTC 2008


I am going to be somewhat overly detailed in this reply so that it will 
be archived for anyone else who is wondering.

Wikipedia article names can include Unicode characters:

Gabriel García Márquez

They do not include underscores, HTML entities, URL encodings, or 
anything else.

To refer to a Wikipedia article in a URL, one must turn spaces into 
underscores and URL-encode the result.  This is true of any URL on any 
Web site, not just Wikipedia.  The canonical URL for the above-mentioned 
article is:

http://en.wikipedia.org/wiki/Gabriel_Garc%C3%ADa_M%C3%A1rquez

The acute accented i is Unicode character U+00ED.  Some broken systems 
will encode the character as %ED; this is wrong, though some Web servers 
will accept it.  The correct URL encoding is to turn the character into 
a UTF-8 byte sequence (whose details I am not going to go into here). 
The UTF-8 byte sequence for í is C3 AD, so the URL encoding is %C3%AD.

Unfortunately, many standard URL escaping libraries do not correctly 
handle characters with Unicode codepoints above 128 (U+007F), which is 
why Kurt and I wrote the code that he posted.

The byte-wise URL encoding is horribly annoying, which is why Freebase 
uses a simpler escape mechanism.  Every character in a key is either 
represented by itself, or by a dollar sign and four hex digits.  The 
four hex digits are the Unicode codepoint for that character; í becomes 
$00ED.  Spaces are turned into $0020, but are rarely used.  To reduce 
annoyance when dealing with Wikipedia names, and to make our URLs look 
prettier, we copied the convention of turning spaces into _ before 
key-encoding them.

The key corresponding to the canonical name for the article about 
Gabriel García Márquez is thus Gabriel_Garc$00EDa_M$00E1rquez.

When working between these systems in Python, it is important to use 
Unicode strings, not normal strings, at all times.  Similarly, in Java, 
remember that all strings are UTF-16 (2-byte-wide Unicode).  Encoding or 
decoding Freebase $hhhh syntax should be straightforward in either case.

HTH,
Chris
-- 
Christopher R. Maden
Data Architect
Freebase.com: <URL: http://www.freebase.com/ >
Metaweb Technologes, Inc. <URL: http://www.metaweb.com/ >


More information about the Developers mailing list