[Developers] esoteric unicode issue with the blurb API

Alec Flett alecf at metaweb.com
Wed Mar 11 00:13:14 UTC 2009


Hey folks -
this is an extremely esoteric question for anyone using the blurb  
service from outside of a browser or ACRE. If you don't care that much  
about unicode, or always use a browser or ACRE, this shouldn't affect  
you at all.

we're trying to address a vaguery in our blurb API, and hoping to get  
input from anyone who uses the "blurb" service.

In particular, there is a parameter called "maxlength", which is  
defined in the documentation as:
maxlength: maximum number of characters in the blurb, default is 200

but what this does not indicate is, what kind of characters. In  
particular, is this unicode characters or 8-bit bytes? This service  
typically returns UTF-8 which is a variable length character encoding  
such that non-ASCII characters may be represented with 2 or more  
actually bytes.

Up until recently, the implementation was "maximum number of 8-bit  
encoded characters" but due to some internal code refactoring, the  
meaning became "maximum number of unicode characters"

In particular what this means is that when you ask for a maxlength of,  
say, 200, and there are non-ASCII bytes in the original article/blurb,  
they will get expanded into multi-byte characters and you may  
potentially get, say, 212 bytes back. Those 212 bytes may expand out  
to exactly 200 unicode characters, but the response size is still  
greater than the maxlength that you passed in.

This seems like the right behavior to me, but I wanted to run it by  
anyone on the list to see if this behavior would mess anyone up.


Alec
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/developers/attachments/20090310/78d05aac/attachment.htm 


More information about the Developers mailing list