[Developers] mql_escape and UTF-8
Christopher R. Maden
crism at metaweb.com
Wed Jul 30 16:19:42 UTC 2008
I am going to be somewhat overly detailed in this reply so that it will
be archived for anyone else who is wondering.
Wikipedia article names can include Unicode characters:
Gabriel García Márquez
They do not include underscores, HTML entities, URL encodings, or
anything else.
To refer to a Wikipedia article in a URL, one must turn spaces into
underscores and URL-encode the result. This is true of any URL on any
Web site, not just Wikipedia. The canonical URL for the above-mentioned
article is:
http://en.wikipedia.org/wiki/Gabriel_Garc%C3%ADa_M%C3%A1rquez
The acute accented i is Unicode character U+00ED. Some broken systems
will encode the character as %ED; this is wrong, though some Web servers
will accept it. The correct URL encoding is to turn the character into
a UTF-8 byte sequence (whose details I am not going to go into here).
The UTF-8 byte sequence for í is C3 AD, so the URL encoding is %C3%AD.
Unfortunately, many standard URL escaping libraries do not correctly
handle characters with Unicode codepoints above 128 (U+007F), which is
why Kurt and I wrote the code that he posted.
The byte-wise URL encoding is horribly annoying, which is why Freebase
uses a simpler escape mechanism. Every character in a key is either
represented by itself, or by a dollar sign and four hex digits. The
four hex digits are the Unicode codepoint for that character; í becomes
$00ED. Spaces are turned into $0020, but are rarely used. To reduce
annoyance when dealing with Wikipedia names, and to make our URLs look
prettier, we copied the convention of turning spaces into _ before
key-encoding them.
The key corresponding to the canonical name for the article about
Gabriel García Márquez is thus Gabriel_Garc$00EDa_M$00E1rquez.
When working between these systems in Python, it is important to use
Unicode strings, not normal strings, at all times. Similarly, in Java,
remember that all strings are UTF-16 (2-byte-wide Unicode). Encoding or
decoding Freebase $hhhh syntax should be straightforward in either case.
HTH,
Chris
--
Christopher R. Maden
Data Architect
Freebase.com: <URL: http://www.freebase.com/ >
Metaweb Technologes, Inc. <URL: http://www.metaweb.com/ >
More information about the Developers
mailing list