[Developers] mql_escape and UTF-8
Warren Harris
warren at metaweb.com
Wed Jul 30 18:07:34 UTC 2008
I just ran this by Chris Maden, but perhaps you want to weigh in: http://jira.metaweb.com/browse/ME-986
Warren
On Jul 30, 2008, at 10:26 AM, Nick Thompson wrote:
> The best way to think of the $xxxx encoding is as "MQL key encoding".
> These encoded keys are only used for /type/key/value properties I
> believe.
>
> The original reason for MQL key encoding was to allow slash-separated
> MQL ids. It is also used to allow "." in the sort syntax and to
> allow the use of comparison suffixes like "<" and ">" without
> ambiguity. So we needed some syntax to escape these characters so
> that
> arbitrary text could be stored in /type/key.
>
> I will add that MQL key escaping should really be thought of as an
> aspect of the string encoding of /type/key, not as inherent in /type/
> key
> itself. Because MQL key encoding and decoding are one-to-one mappings
> It should be possible to provide unencoded access to MQL keys - i
> think
> there is some low level support for this in MQL enumerations but it's
> not exposed through /type/key as far as i know.
>
> So why not URL encoding?
>
> Unfortunately URL encoding is the worst quoting syntax in common use.
> Decoding is straightforward, but implementations differ about which
> characters need to be encoded, and the rules are different in
> different
> parts of the URL. We didn't want to define a new encoding, but URL
> encoding seemed like a very risky choice. You would be able to use a
> stock URL decoder, but your stock URL encoder might produce confusing
> problems with some characters.
>
> Furthermore there are cases where you have to layer URL encoding on
> top
> of MQL key encoding. Double URL encoding gets really really ugly.
> This is the reason for the choice of '$' as an escape character -
> $ does not require escaping in URLs. Since the key encoding is
> stricter than URL encoding about which characters are escaped,
> MQL ids should all be valid in URLs, and URL-decoding of a MQL id
> should not change the MQL id at all.
>
> So yes, it's a pain, but it is a pretty reasonable solution to some
> tricky problems. We do need a better public definition of the
> encoding,
> testcases, and a library of implementations in various languages - at
> this point the python code that Kurt posted is probably the best
> starting point for implementors.
>
> nick
>
> Shug Boabby wrote:
>> Thanks Chris... I think I'd already worked all that out, but I was
>> just wondering if anybody had actually written a Java encoder/decoder
>> between UTF-8/MW Hex. I realise it should be simple to convert the
>> $000 syntax, but it is troublesome to have to write this code myself.
>> I really wish you'd decided to just use the URL encoding scheme as
>> that would require no additional work on our side of things (despite
>> it looking ugly). It's just not standard enough (although,
>> admittedly,
>> prettier).
>>
>> 2008/7/30 Christopher R. Maden <crism at metaweb.com>:
>>> I am going to be somewhat overly detailed in this reply so that it
>>> will
>>> be archived for anyone else who is wondering.
>>>
>>> Wikipedia article names can include Unicode characters:
>>>
>>> Gabriel García Márquez
>>>
>>> They do not include underscores, HTML entities, URL encodings, or
>>> anything else.
>>>
>>> To refer to a Wikipedia article in a URL, one must turn spaces into
>>> underscores and URL-encode the result. This is true of any URL on
>>> any
>>> Web site, not just Wikipedia. The canonical URL for the above-
>>> mentioned
>>> article is:
>>>
>>> http://en.wikipedia.org/wiki/Gabriel_Garc%C3%ADa_M%C3%A1rquez
>>>
>>> The acute accented i is Unicode character U+00ED. Some broken
>>> systems
>>> will encode the character as %ED; this is wrong, though some Web
>>> servers
>>> will accept it. The correct URL encoding is to turn the character
>>> into
>>> a UTF-8 byte sequence (whose details I am not going to go into
>>> here).
>>> The UTF-8 byte sequence for í is C3 AD, so the URL encoding is
>>> %C3%AD.
>>>
>>> Unfortunately, many standard URL escaping libraries do not correctly
>>> handle characters with Unicode codepoints above 128 (U+007F),
>>> which is
>>> why Kurt and I wrote the code that he posted.
>>>
>>> The byte-wise URL encoding is horribly annoying, which is why
>>> Freebase
>>> uses a simpler escape mechanism. Every character in a key is either
>>> represented by itself, or by a dollar sign and four hex digits. The
>>> four hex digits are the Unicode codepoint for that character; í
>>> becomes
>>> $00ED. Spaces are turned into $0020, but are rarely used. To
>>> reduce
>>> annoyance when dealing with Wikipedia names, and to make our URLs
>>> look
>>> prettier, we copied the convention of turning spaces into _ before
>>> key-encoding them.
>>>
>>> The key corresponding to the canonical name for the article about
>>> Gabriel García Márquez is thus Gabriel_Garc$00EDa_M$00E1rquez.
>>>
>>> When working between these systems in Python, it is important to use
>>> Unicode strings, not normal strings, at all times. Similarly, in
>>> Java,
>>> remember that all strings are UTF-16 (2-byte-wide Unicode).
>>> Encoding or
>>> decoding Freebase $hhhh syntax should be straightforward in either
>>> case.
>>>
>>> HTH,
>>> Chris
>>> --
>>> Christopher R. Maden
>>> Data Architect
>>> Freebase.com: <URL: http://www.freebase.com/ >
>>> Metaweb Technologes, Inc. <URL: http://www.metaweb.com/ >
>>> _______________________________________________
>>> Developers mailing list
>>> Developers at freebase.com
>>> http://lists.freebase.com/mailman/listinfo/developers
>>>
>> _______________________________________________
>> Developers mailing list
>> Developers at freebase.com
>> http://lists.freebase.com/mailman/listinfo/developers
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/developers/attachments/20080730/97a41cc9/attachment-0001.htm
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3739 bytes
Desc: not available
Url : http://lists.freebase.com/pipermail/developers/attachments/20080730/97a41cc9/attachment-0001.bin
More information about the Developers
mailing list