[Developers] $0027 encodings in wikipedia topic names
Drew Perttula
drewp at bigasterisk.com
Wed Feb 11 09:42:51 UTC 2009
curl http://rdf.freebase.com/rdf/en.barack_obama | grep value.value
<http://rdf.freebase.com/ns/type.value.value> "Barry_O$0027Bama"],
<http://rdf.freebase.com/ns/type.value.value> "Pres$002E_Obama"],
...
What is that $0027 encoding? I've never seen that style before. I also
think the underscores might be spaces, but I'm not sure if that's always
true. (I.e. maybe sometimes they are real underscores, and you can't
tell which case is which.)
In that same file, the objects of
<http://rdf.freebase.com/ns/type.object.name> are normal-looking unicode
without $ escapes.
For the record, I'm currently using this python to undo the escaping:
s = s.replace("_", " ")
s = re.sub(r'\$(\d\d\d\d)', lambda g: unichr(int(g.group(1), 16)), s)
More information about the Developers
mailing list