[Developers] esoteric unicode issue with the blurb API
Alec Flett
alecf at metaweb.com
Wed Mar 11 00:13:14 UTC 2009
Hey folks -
this is an extremely esoteric question for anyone using the blurb
service from outside of a browser or ACRE. If you don't care that much
about unicode, or always use a browser or ACRE, this shouldn't affect
you at all.
we're trying to address a vaguery in our blurb API, and hoping to get
input from anyone who uses the "blurb" service.
In particular, there is a parameter called "maxlength", which is
defined in the documentation as:
maxlength: maximum number of characters in the blurb, default is 200
but what this does not indicate is, what kind of characters. In
particular, is this unicode characters or 8-bit bytes? This service
typically returns UTF-8 which is a variable length character encoding
such that non-ASCII characters may be represented with 2 or more
actually bytes.
Up until recently, the implementation was "maximum number of 8-bit
encoded characters" but due to some internal code refactoring, the
meaning became "maximum number of unicode characters"
In particular what this means is that when you ask for a maxlength of,
say, 200, and there are non-ASCII bytes in the original article/blurb,
they will get expanded into multi-byte characters and you may
potentially get, say, 212 bytes back. Those 212 bytes may expand out
to exactly 200 unicode characters, but the response size is still
greater than the maxlength that you passed in.
This seems like the right behavior to me, but I wanted to run it by
anyone on the list to see if this behavior would mess anyone up.
Alec
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/developers/attachments/20090310/78d05aac/attachment.htm
More information about the Developers
mailing list