[Developers] mql_escape and UTF-8

Shug Boabby shug.boabby at gmail.com
Wed Jul 30 15:08:12 UTC 2008


Hi all,

I had a few posts last week regarding the Wikipedia ID and Freebase
name. I'm now running into the problem that the /wikipedia/en and
actual Wikipedia Name use completely different encoding schemes.

The (actual) Wikipedia name is a URL-encoded string that may
optionally use underscores instead of spaces.

The /wikipedia/en key uses MW Hex encoding and uses underscores
instead of spaces.

I am coding in Java and using the URLEncoder/URLDecoder classes (and a
regex to deal with spaces), I am comfortably able to convert the
Wikipedia Names to/from UTF-8 and a URL safe version.

However, I am unable to convert the MW Hex keys to/from UTF-8 because
I cannot find any existing code to do the conversion (I am also unsure
of the name of this, apparently custom encoding). Does anybody know of
any existing code to encode/decode MW Hex to/from UTF-8 in Java?

Also, I believe this subtle point should be documented in more detail
alongside any examples making use of /wikipedia/en because it means
that /wikipedia/en is definitely *not* the Wikipedia name.


More information about the Developers mailing list