[Developers] mql_escape and UTF-8

Warren Harris warren at metaweb.com
Wed Jul 30 18:07:34 UTC 2008


I just ran this by Chris Maden, but perhaps you want to weigh in: http://jira.metaweb.com/browse/ME-986

Warren

On Jul 30, 2008, at 10:26 AM, Nick Thompson wrote:

> The best way to think of the $xxxx encoding is as "MQL key encoding".
> These encoded keys are only used for /type/key/value properties I
> believe.
>
> The original reason for MQL key encoding was to allow slash-separated
> MQL ids.  It is also used to allow "." in the sort syntax and to
> allow the use of comparison suffixes like "<" and ">" without
> ambiguity.  So we needed some syntax to escape these characters so  
> that
> arbitrary text could be stored in /type/key.
>
> I will add that MQL key escaping should really be thought of as an
> aspect of the string encoding of /type/key, not as inherent in /type/ 
> key
> itself.  Because MQL key encoding and decoding are one-to-one mappings
> It should be possible to provide unencoded access to MQL keys - i  
> think
> there is some low level support for this in MQL enumerations but it's
> not exposed through /type/key as far as i know.
>
> So why not URL encoding?
>
> Unfortunately URL encoding is the worst quoting syntax in common use.
> Decoding is straightforward, but implementations differ about which
> characters need to be encoded, and the rules are different in  
> different
> parts of the URL.  We didn't want to define a new encoding, but URL
> encoding seemed like a very risky choice.  You would be able to use a
> stock URL decoder, but your stock URL encoder might produce confusing
> problems with some characters.
>
> Furthermore there are cases where you have to layer URL encoding on  
> top
> of MQL key encoding.  Double URL encoding gets really really ugly.
> This is the reason for the choice of '$' as an escape character -
> $ does not require escaping in URLs.  Since the key encoding is
> stricter than URL encoding about which characters are escaped,
> MQL ids should all be valid in URLs, and URL-decoding of a MQL id
> should not change the MQL id at all.
>
> So yes, it's a pain, but it is a pretty reasonable solution to some
> tricky problems.  We do need a better public definition of the  
> encoding,
> testcases, and a library of implementations in various languages - at
> this point the python code that Kurt posted is probably the best
> starting point for implementors.
>
>     nick
>
> Shug Boabby wrote:
>> Thanks Chris... I think I'd already worked all that out, but I was
>> just wondering if anybody had actually written a Java encoder/decoder
>> between UTF-8/MW Hex. I realise it should be simple to convert the
>> $000 syntax, but it is troublesome to have to write this code myself.
>> I really wish you'd decided to just use the URL encoding scheme as
>> that would require no additional work on our side of things (despite
>> it looking ugly). It's just not standard enough (although,  
>> admittedly,
>> prettier).
>>
>> 2008/7/30 Christopher R. Maden <crism at metaweb.com>:
>>> I am going to be somewhat overly detailed in this reply so that it  
>>> will
>>> be archived for anyone else who is wondering.
>>>
>>> Wikipedia article names can include Unicode characters:
>>>
>>> Gabriel García Márquez
>>>
>>> They do not include underscores, HTML entities, URL encodings, or
>>> anything else.
>>>
>>> To refer to a Wikipedia article in a URL, one must turn spaces into
>>> underscores and URL-encode the result.  This is true of any URL on  
>>> any
>>> Web site, not just Wikipedia.  The canonical URL for the above- 
>>> mentioned
>>> article is:
>>>
>>> http://en.wikipedia.org/wiki/Gabriel_Garc%C3%ADa_M%C3%A1rquez
>>>
>>> The acute accented i is Unicode character U+00ED.  Some broken  
>>> systems
>>> will encode the character as %ED; this is wrong, though some Web  
>>> servers
>>> will accept it.  The correct URL encoding is to turn the character  
>>> into
>>> a UTF-8 byte sequence (whose details I am not going to go into  
>>> here).
>>> The UTF-8 byte sequence for í is C3 AD, so the URL encoding is  
>>> %C3%AD.
>>>
>>> Unfortunately, many standard URL escaping libraries do not correctly
>>> handle characters with Unicode codepoints above 128 (U+007F),  
>>> which is
>>> why Kurt and I wrote the code that he posted.
>>>
>>> The byte-wise URL encoding is horribly annoying, which is why  
>>> Freebase
>>> uses a simpler escape mechanism.  Every character in a key is either
>>> represented by itself, or by a dollar sign and four hex digits.  The
>>> four hex digits are the Unicode codepoint for that character; í  
>>> becomes
>>> $00ED.  Spaces are turned into $0020, but are rarely used.  To  
>>> reduce
>>> annoyance when dealing with Wikipedia names, and to make our URLs  
>>> look
>>> prettier, we copied the convention of turning spaces into _ before
>>> key-encoding them.
>>>
>>> The key corresponding to the canonical name for the article about
>>> Gabriel García Márquez is thus Gabriel_Garc$00EDa_M$00E1rquez.
>>>
>>> When working between these systems in Python, it is important to use
>>> Unicode strings, not normal strings, at all times.  Similarly, in  
>>> Java,
>>> remember that all strings are UTF-16 (2-byte-wide Unicode).   
>>> Encoding or
>>> decoding Freebase $hhhh syntax should be straightforward in either  
>>> case.
>>>
>>> HTH,
>>> Chris
>>> --
>>> Christopher R. Maden
>>> Data Architect
>>> Freebase.com: <URL: http://www.freebase.com/ >
>>> Metaweb Technologes, Inc. <URL: http://www.metaweb.com/ >
>>> _______________________________________________
>>> Developers mailing list
>>> Developers at freebase.com
>>> http://lists.freebase.com/mailman/listinfo/developers
>>>
>> _______________________________________________
>> Developers mailing list
>> Developers at freebase.com
>> http://lists.freebase.com/mailman/listinfo/developers
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/developers/attachments/20080730/97a41cc9/attachment-0001.htm 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3739 bytes
Desc: not available
Url : http://lists.freebase.com/pipermail/developers/attachments/20080730/97a41cc9/attachment-0001.bin 


More information about the Developers mailing list