[Developers] mql_escape and UTF-8
Shug Boabby
shug.boabby at gmail.com
Wed Jul 30 16:39:54 UTC 2008
Thanks Chris... I think I'd already worked all that out, but I was
just wondering if anybody had actually written a Java encoder/decoder
between UTF-8/MW Hex. I realise it should be simple to convert the
$000 syntax, but it is troublesome to have to write this code myself.
I really wish you'd decided to just use the URL encoding scheme as
that would require no additional work on our side of things (despite
it looking ugly). It's just not standard enough (although, admittedly,
prettier).
2008/7/30 Christopher R. Maden <crism at metaweb.com>:
> I am going to be somewhat overly detailed in this reply so that it will
> be archived for anyone else who is wondering.
>
> Wikipedia article names can include Unicode characters:
>
> Gabriel García Márquez
>
> They do not include underscores, HTML entities, URL encodings, or
> anything else.
>
> To refer to a Wikipedia article in a URL, one must turn spaces into
> underscores and URL-encode the result. This is true of any URL on any
> Web site, not just Wikipedia. The canonical URL for the above-mentioned
> article is:
>
> http://en.wikipedia.org/wiki/Gabriel_Garc%C3%ADa_M%C3%A1rquez
>
> The acute accented i is Unicode character U+00ED. Some broken systems
> will encode the character as %ED; this is wrong, though some Web servers
> will accept it. The correct URL encoding is to turn the character into
> a UTF-8 byte sequence (whose details I am not going to go into here).
> The UTF-8 byte sequence for í is C3 AD, so the URL encoding is %C3%AD.
>
> Unfortunately, many standard URL escaping libraries do not correctly
> handle characters with Unicode codepoints above 128 (U+007F), which is
> why Kurt and I wrote the code that he posted.
>
> The byte-wise URL encoding is horribly annoying, which is why Freebase
> uses a simpler escape mechanism. Every character in a key is either
> represented by itself, or by a dollar sign and four hex digits. The
> four hex digits are the Unicode codepoint for that character; í becomes
> $00ED. Spaces are turned into $0020, but are rarely used. To reduce
> annoyance when dealing with Wikipedia names, and to make our URLs look
> prettier, we copied the convention of turning spaces into _ before
> key-encoding them.
>
> The key corresponding to the canonical name for the article about
> Gabriel García Márquez is thus Gabriel_Garc$00EDa_M$00E1rquez.
>
> When working between these systems in Python, it is important to use
> Unicode strings, not normal strings, at all times. Similarly, in Java,
> remember that all strings are UTF-16 (2-byte-wide Unicode). Encoding or
> decoding Freebase $hhhh syntax should be straightforward in either case.
>
> HTH,
> Chris
> --
> Christopher R. Maden
> Data Architect
> Freebase.com: <URL: http://www.freebase.com/ >
> Metaweb Technologes, Inc. <URL: http://www.metaweb.com/ >
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>
More information about the Developers
mailing list