[Developers] mql_escape and UTF-8
brendan
brendan at metaweb.com
Wed Jul 30 18:09:56 UTC 2008
fyi, our bugbase is only available in our internal network, so most of
you won't be able to see that.
brendan
On Jul 30, 2008, at 11:07 AM, Warren Harris wrote:
> I just ran this by Chris Maden, but perhaps you want to weigh in: http://jira.metaweb.com/browse/ME-986
>
> Warren
>
> On Jul 30, 2008, at 10:26 AM, Nick Thompson wrote:
>
>> The best way to think of the $xxxx encoding is as "MQL key encoding".
>> These encoded keys are only used for /type/key/value properties I
>> believe.
>>
>> The original reason for MQL key encoding was to allow slash-separated
>> MQL ids. It is also used to allow "." in the sort syntax and to
>> allow the use of comparison suffixes like "<" and ">" without
>> ambiguity. So we needed some syntax to escape these characters so
>> that
>> arbitrary text could be stored in /type/key.
>>
>> I will add that MQL key escaping should really be thought of as an
>> aspect of the string encoding of /type/key, not as inherent in /
>> type/key
>> itself. Because MQL key encoding and decoding are one-to-one
>> mappings
>> It should be possible to provide unencoded access to MQL keys - i
>> think
>> there is some low level support for this in MQL enumerations but it's
>> not exposed through /type/key as far as i know.
>>
>> So why not URL encoding?
>>
>> Unfortunately URL encoding is the worst quoting syntax in common use.
>> Decoding is straightforward, but implementations differ about which
>> characters need to be encoded, and the rules are different in
>> different
>> parts of the URL. We didn't want to define a new encoding, but URL
>> encoding seemed like a very risky choice. You would be able to use a
>> stock URL decoder, but your stock URL encoder might produce confusing
>> problems with some characters.
>>
>> Furthermore there are cases where you have to layer URL encoding on
>> top
>> of MQL key encoding. Double URL encoding gets really really ugly.
>> This is the reason for the choice of '$' as an escape character -
>> $ does not require escaping in URLs. Since the key encoding is
>> stricter than URL encoding about which characters are escaped,
>> MQL ids should all be valid in URLs, and URL-decoding of a MQL id
>> should not change the MQL id at all.
>>
>> So yes, it's a pain, but it is a pretty reasonable solution to some
>> tricky problems. We do need a better public definition of the
>> encoding,
>> testcases, and a library of implementations in various languages - at
>> this point the python code that Kurt posted is probably the best
>> starting point for implementors.
>>
>> nick
>>
>> Shug Boabby wrote:
>>> Thanks Chris... I think I'd already worked all that out, but I was
>>> just wondering if anybody had actually written a Java encoder/
>>> decoder
>>> between UTF-8/MW Hex. I realise it should be simple to convert the
>>> $000 syntax, but it is troublesome to have to write this code
>>> myself.
>>> I really wish you'd decided to just use the URL encoding scheme as
>>> that would require no additional work on our side of things (despite
>>> it looking ugly). It's just not standard enough (although,
>>> admittedly,
>>> prettier).
>>>
>>> 2008/7/30 Christopher R. Maden <crism at metaweb.com>:
>>>> I am going to be somewhat overly detailed in this reply so that
>>>> it will
>>>> be archived for anyone else who is wondering.
>>>>
>>>> Wikipedia article names can include Unicode characters:
>>>>
>>>> Gabriel García Márquez
>>>>
>>>> They do not include underscores, HTML entities, URL encodings, or
>>>> anything else.
>>>>
>>>> To refer to a Wikipedia article in a URL, one must turn spaces into
>>>> underscores and URL-encode the result. This is true of any URL
>>>> on any
>>>> Web site, not just Wikipedia. The canonical URL for the above-
>>>> mentioned
>>>> article is:
>>>>
>>>> http://en.wikipedia.org/wiki/Gabriel_Garc%C3%ADa_M%C3%A1rquez
>>>>
>>>> The acute accented i is Unicode character U+00ED. Some broken
>>>> systems
>>>> will encode the character as %ED; this is wrong, though some Web
>>>> servers
>>>> will accept it. The correct URL encoding is to turn the
>>>> character into
>>>> a UTF-8 byte sequence (whose details I am not going to go into
>>>> here).
>>>> The UTF-8 byte sequence for í is C3 AD, so the URL encoding is
>>>> %C3%AD.
>>>>
>>>> Unfortunately, many standard URL escaping libraries do not
>>>> correctly
>>>> handle characters with Unicode codepoints above 128 (U+007F),
>>>> which is
>>>> why Kurt and I wrote the code that he posted.
>>>>
>>>> The byte-wise URL encoding is horribly annoying, which is why
>>>> Freebase
>>>> uses a simpler escape mechanism. Every character in a key is
>>>> either
>>>> represented by itself, or by a dollar sign and four hex digits.
>>>> The
>>>> four hex digits are the Unicode codepoint for that character; í
>>>> becomes
>>>> $00ED. Spaces are turned into $0020, but are rarely used. To
>>>> reduce
>>>> annoyance when dealing with Wikipedia names, and to make our URLs
>>>> look
>>>> prettier, we copied the convention of turning spaces into _ before
>>>> key-encoding them.
>>>>
>>>> The key corresponding to the canonical name for the article about
>>>> Gabriel García Márquez is thus Gabriel_Garc$00EDa_M$00E1rquez.
>>>>
>>>> When working between these systems in Python, it is important to
>>>> use
>>>> Unicode strings, not normal strings, at all times. Similarly, in
>>>> Java,
>>>> remember that all strings are UTF-16 (2-byte-wide Unicode).
>>>> Encoding or
>>>> decoding Freebase $hhhh syntax should be straightforward in
>>>> either case.
>>>>
>>>> HTH,
>>>> Chris
>>>> --
>>>> Christopher R. Maden
>>>> Data Architect
>>>> Freebase.com: <URL: http://www.freebase.com/ >
>>>> Metaweb Technologes, Inc. <URL: http://www.metaweb.com/ >
>>>> _______________________________________________
>>>> Developers mailing list
>>>> Developers at freebase.com
>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>
>>> _______________________________________________
>>> Developers mailing list
>>> Developers at freebase.com
>>> http://lists.freebase.com/mailman/listinfo/developers
>> _______________________________________________
>> Developers mailing list
>> Developers at freebase.com
>> http://lists.freebase.com/mailman/listinfo/developers
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/developers/attachments/20080730/7f22d676/attachment.htm
More information about the Developers
mailing list