[Developers] /wikipedia/en_id not giving results for redirected ids

Shug Boabby shug.boabby at gmail.com
Sat Jul 26 19:02:35 UTC 2008


hold on... this is the wrong way round, right? I want to get the
Wikipedia Names (e.g. "Mr_Spock") from the WEX dumps, not the Freebase
names. The WEX Dumps are definitely *not* the Wikipedia names, because
they have spaces instead of underscores. There may be other
differences, but they are not documented.

2008/7/24 Kurt Bollacker <kurt at metaweb.com>:
>
> On Thu, Jul 24, 2008 at 10:35:10PM +0100, Shug Boabby wrote:
>> and I would still like somebody to clarify for
>> me how to get the actual "wikipedia/en" name, given the article.name
>> from the WEX dumps (spaces to underscores, but what else?).
>
> Here are some python functions we often use to convert WP names to
> Freebase keys.  mql_escape() may be all you need, but cleanwikiword()
> is helpful for some messy WP names.  You may also use utf8unescape()
> when you need to handle a WP name you got from a HTTP GET.
>
> Keep in mind that not all WP names will resolve to Freebase IDs, for
> reasons such as they are not topics (e.g. disambiguation articles) or
> are brand new names that Freebase hasn't synced to yet.
>
> Let me know if you have questions.                              Kurt :-)
>
>
>
> ######################################################################
> import codecs,struct,re
>
> # Let's deal with URL escaping of UTF
> utf8decode=codecs.getdecoder('utf-8')
> isodecode=codecs.getdecoder('iso8859_1')
>
> def sub3(mo):
>    return(utf8decode(struct.pack("BBB",int(mo.group(1),16),int(mo.group(2),16),int(mo.group(3),16)))[0])
>
> def sub2(mo):
>    return(utf8decode(struct.pack("BB",int(mo.group(1),16),int(mo.group(2),16)))[0])
>
> def sub1(mo):
>    try:
>        return(utf8decode(struct.pack("B",int(mo.group(1),16)))[0])
>    except UnicodeDecodeError:
>        return(isodecode(struct.pack("B",int(mo.group(1),16)))[0])
>
> def utf8unescape(s):
>    ''' Converts UTF-8 strings that have been URL escaped
>        back into UTF-8. '''
>    # Get 3-byte UTF-8 sequences 1110xxxx 10yyyyyy 10zzzzzz
>    s=re.sub('%(e[0-9a-f])%([89ab][0-9a-f])%([89ab][0-9a-f])(?im)',sub3,s)
>    # Get 2-byte UTF-8 sequences 110xxxxx 10yyyyyy
>    s=re.sub('%([cd][0-9a-f])%([89ab][0-9a-f])(?im)',sub2,s)
>    # Get 1-byte UTF-8 sequences 0xxxxxxx
>    s=re.sub('%([0-7][0-9a-f])(?im)',sub1,s)
>    # Nuke any illegal characters.
>    s=re.sub(u'[\ud800-\udfff\ufdd0-\ufdef\ufffe\uffff]','',s)
>    s=re.sub('[\x00-\x08\x0b\x0c\x0e-\x1f]','',s)
>    return(s)
>
> # Clean up whitespace
> def cleanwikiword(s):
>    ''' Clean up the spacing chars of a Wikipedia name '''
>    s=utf8unescape(s)
>    s=re.sub('^[_ \t\r\n]+','',s)
>    s=re.sub('[_ \t\r\n]+$','',s)
>    s=re.sub('[_ \t\r\n]+','_',s)
>    s=s[0].upper()+s[1:]
>    return(s)
>
> # Do the MW hex encoding
> def dollarhex(mo):
>    return(("$%04x" % ord(mo.group(1))).upper())
>
> def mql_escape(s):
>    ''' Convert a string into a valid Freebase key value. '''
>    s=re.sub('([^-A-Za-z0-9_])',dollarhex,s)
>    s=re.sub('(^[-_])',dollarhex,s)
>    s=re.sub('([-_])$',dollarhex,s)
>    return(s)
>
> # This function shows how the above can be used together.
> def wikiurltomwkey(s):
>    return(mql_escape(cleanwikiword(utf8unescape(s))))
>
> # Do a test
> if __name__=='__main__':
>    s0='----%e9%80%81 %d4%90 %45 %1f  -_-_   blah____'
>    s1=utf8unescape(s0)
>    s2=cleanwikiword(s1)
>    s3=mql_escape(s2)
>    print ':'+repr(s0)+':\n:'+repr(s1)+':\n:'+repr(s2)+':\n:'+repr(s3)+':\n'
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>


More information about the Developers mailing list