[Developers] mql_escape and UTF-8

Shug Boabby shug.boabby at gmail.com
Thu Jul 31 09:41:49 UTC 2008


Thanks for all the explanations... what would really help is if you
could produce some code to do the escaping/unescaping on some of the
more popular languages. Python code exists, but code for Javascript
(JSONP) and Java (server side) are my two uses.

2008/7/30 brendan <brendan at metaweb.com>:
> fyi, our bugbase is only available in our internal network, so most of you
> won't be able to see that.
> brendan
> On Jul 30, 2008, at 11:07 AM, Warren Harris wrote:
>
> I just ran this by Chris Maden, but perhaps you want to weigh
> in: http://jira.metaweb.com/browse/ME-986
> Warren
> On Jul 30, 2008, at 10:26 AM, Nick Thompson wrote:
>
> The best way to think of the $xxxx encoding is as "MQL key encoding".
> These encoded keys are only used for /type/key/value properties I
> believe.
>
> The original reason for MQL key encoding was to allow slash-separated
> MQL ids.  It is also used to allow "." in the sort syntax and to
> allow the use of comparison suffixes like "<" and ">" without
> ambiguity.  So we needed some syntax to escape these characters so that
> arbitrary text could be stored in /type/key.
>
> I will add that MQL key escaping should really be thought of as an
> aspect of the string encoding of /type/key, not as inherent in /type/key
> itself.  Because MQL key encoding and decoding are one-to-one mappings
> It should be possible to provide unencoded access to MQL keys - i think
> there is some low level support for this in MQL enumerations but it's
> not exposed through /type/key as far as i know.
>
> So why not URL encoding?
>
> Unfortunately URL encoding is the worst quoting syntax in common use.
> Decoding is straightforward, but implementations differ about which
> characters need to be encoded, and the rules are different in different
> parts of the URL.  We didn't want to define a new encoding, but URL
> encoding seemed like a very risky choice.  You would be able to use a
> stock URL decoder, but your stock URL encoder might produce confusing
> problems with some characters.
>
> Furthermore there are cases where you have to layer URL encoding on top
> of MQL key encoding.  Double URL encoding gets really really ugly.
> This is the reason for the choice of '$' as an escape character -
> $ does not require escaping in URLs.  Since the key encoding is
> stricter than URL encoding about which characters are escaped,
> MQL ids should all be valid in URLs, and URL-decoding of a MQL id
> should not change the MQL id at all.
>
> So yes, it's a pain, but it is a pretty reasonable solution to some
> tricky problems.  We do need a better public definition of the encoding,
> testcases, and a library of implementations in various languages - at
> this point the python code that Kurt posted is probably the best
> starting point for implementors.
>
>     nick
>
> Shug Boabby wrote:
>
> Thanks Chris... I think I'd already worked all that out, but I was
>
> just wondering if anybody had actually written a Java encoder/decoder
>
> between UTF-8/MW Hex. I realise it should be simple to convert the
>
> $000 syntax, but it is troublesome to have to write this code myself.
>
> I really wish you'd decided to just use the URL encoding scheme as
>
> that would require no additional work on our side of things (despite
>
> it looking ugly). It's just not standard enough (although, admittedly,
>
> prettier).
>
> 2008/7/30 Christopher R. Maden <crism at metaweb.com>:
>
> I am going to be somewhat overly detailed in this reply so that it will
>
> be archived for anyone else who is wondering.
>
> Wikipedia article names can include Unicode characters:
>
> Gabriel García Márquez
>
> They do not include underscores, HTML entities, URL encodings, or
>
> anything else.
>
> To refer to a Wikipedia article in a URL, one must turn spaces into
>
> underscores and URL-encode the result.  This is true of any URL on any
>
> Web site, not just Wikipedia.  The canonical URL for the above-mentioned
>
> article is:
>
> http://en.wikipedia.org/wiki/Gabriel_Garc%C3%ADa_M%C3%A1rquez
>
> The acute accented i is Unicode character U+00ED.  Some broken systems
>
> will encode the character as %ED; this is wrong, though some Web servers
>
> will accept it.  The correct URL encoding is to turn the character into
>
> a UTF-8 byte sequence (whose details I am not going to go into here).
>
> The UTF-8 byte sequence for í is C3 AD, so the URL encoding is %C3%AD.
>
> Unfortunately, many standard URL escaping libraries do not correctly
>
> handle characters with Unicode codepoints above 128 (U+007F), which is
>
> why Kurt and I wrote the code that he posted.
>
> The byte-wise URL encoding is horribly annoying, which is why Freebase
>
> uses a simpler escape mechanism.  Every character in a key is either
>
> represented by itself, or by a dollar sign and four hex digits.  The
>
> four hex digits are the Unicode codepoint for that character; í becomes
>
> $00ED.  Spaces are turned into $0020, but are rarely used.  To reduce
>
> annoyance when dealing with Wikipedia names, and to make our URLs look
>
> prettier, we copied the convention of turning spaces into _ before
>
> key-encoding them.
>
> The key corresponding to the canonical name for the article about
>
> Gabriel García Márquez is thus Gabriel_Garc$00EDa_M$00E1rquez.
>
> When working between these systems in Python, it is important to use
>
> Unicode strings, not normal strings, at all times.  Similarly, in Java,
>
> remember that all strings are UTF-16 (2-byte-wide Unicode).  Encoding or
>
> decoding Freebase $hhhh syntax should be straightforward in either case.
>
> HTH,
>
> Chris
>
> --
>
> Christopher R. Maden
>
> Data Architect
>
> Freebase.com: <URL: http://www.freebase.com/ >
>
> Metaweb Technologes, Inc. <URL: http://www.metaweb.com/ >
>
> _______________________________________________
>
> Developers mailing list
>
> Developers at freebase.com
>
> http://lists.freebase.com/mailman/listinfo/developers
>
> _______________________________________________
>
> Developers mailing list
>
> Developers at freebase.com
>
> http://lists.freebase.com/mailman/listinfo/developers
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>
>


More information about the Developers mailing list