[Data-modeling] Writing UTF-8 in MQL strings
Kurt Bollacker
kurt at spaceship.com
Sat Mar 28 01:29:53 UTC 2009
On Fri, Mar 27, 2009 at 05:55:41PM -0400, Christopher R. Maden wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Kurt Bollacker wrote:
> > I've been writing language data to sandbox, and I'm trying to create
> > names for languages that use UTF-8 characters. Consider the "More"
> > language at:
>
> No, they use Unicode characters.
>
> UTF-8 is a way of representing those characters as a series of bytes.
>
> In JSON, characters should be encoded with their Unicode code points,
> not as byte-wise representations.
>
> The string you are looking for is “Mòoré” or “M\u00f2or\u00e9.” The “u”
> in “\u” stands for “Unicode.”
>
> It happens that the first 256 characters of Unicode are identical to the
> first (and total) 256 characters of ISO Latin 1, which may have confused
> you. However, the Latin 1 representations are the correct ones, as they
> are also the Unicode representations.
I kinda got close to this. I should have looked at the html of the
pages causing problems. It turns out that the Freebase web UI uses
HTML numeric references exclusively rather than including encoded
unicode characters. However, if you look at:
http://mql.freebaseapps.com/ch02.html#typetext
in the MQL Reference Guide, one can see that
"The text of a /type/text value must be a string of Unicode
characters, encoded using the UTF-8 encoding."
Thus, MQL writes are supposed to be done with UTF-8 encoded
characters, *NOT* unicode code points. If you do that, however, you
run into the same incorrect string display that I did. So the problem
is either:
- The documentation is wrong and MQL does not support UTF-8, but
rather 16-bit unicode code points directly.
OR
- The client is buggy in its display of UTF-8 characters.
Either way, I need policy certainty before I write potentially
thousands of display names requiring unicode.
So I suppose I should submit this as a bug. Any opinions?
Kurt :-)
More information about the Data-modeling
mailing list