[Developers] /wikipedia/en_id not giving results for redirected ids
Shug Boabby
shug.boabby at gmail.com
Sat Jul 26 22:59:38 UTC 2008
Thanks Alex! This should all really be documented in the WEX downloads file!
So what you're saying here is that the /wikipedia/en keys are not
actually the same as the Wikipedia Names, because you use a dollar
notation to escape certain UTF characters. Do you have any "inverse"
code that will convert the /wikipedia/en keys into Wikipedia keys?
For the avoidance of doubt... to what extent are the Wikipedia Names
(article.name) URL escaped? And what encoding do they use? Once spaces
have been converted to underscores, what about the /wikipedia/en keys?
2008/7/26 Alexander Marks <al at metaweb.com>:
> And, just to be clear, those same Wikipedia names can always be converted into real Freebase keys, using Python logic like this:
>
> >>> article_name = "String Quartet No. 1 (Beethoven)"
> >>> freebase_id = "/wikipedia/en/" + mql_escape(article_name.replace(" ", "_"))
> >>> print freebase_id
> /wikipedia/en/String_Quartet_No$002E_1_$0028Beethoven$0029
>
> where mql_escape is defined in Kurt's earlier message.
>
> Al
>
> ----- Original Message -----
> From: "Alexander Marks" <al at metaweb.com>
> To: "For discussions about MQL, Freebase API and apps built on Freebase" <developers at freebase.com>
> Cc: "For discussions about MQL, Freebase API and apps built on Freebase" <developers at freebase.com>
> Sent: Saturday, July 26, 2008 12:29:34 PM (GMT-0800) America/Los_Angeles
> Subject: Re: [Developers] /wikipedia/en_id not giving results for redirected ids
>
> The names you see in WEX are *exclusively* Wikipedia names, not Freebase names. In Mediawiki, spaces are equivalent to underscores, so you can simply substitute underscores if you prefer that format (for instance, if you are generating URLs).
>
> See http://meta.wikimedia.org/wiki/Help:Page_name#Spaces_vs._underscores for more naming details.
>
> Al
>
> ----- Original Message -----
> From: "Shug Boabby" <shug.boabby at gmail.com>
> To: "For discussions about MQL, Freebase API and apps built on Freebase" <developers at freebase.com>
> Sent: Saturday, July 26, 2008 12:02:35 PM (GMT-0800) America/Los_Angeles
> Subject: Re: [Developers] /wikipedia/en_id not giving results for redirected ids
>
> hold on... this is the wrong way round, right? I want to get the
> Wikipedia Names (e.g. "Mr_Spock") from the WEX dumps, not the Freebase
> names. The WEX Dumps are definitely *not* the Wikipedia names, because
> they have spaces instead of underscores. There may be other
> differences, but they are not documented.
>
> 2008/7/24 Kurt Bollacker <kurt at metaweb.com>:
>>
>> On Thu, Jul 24, 2008 at 10:35:10PM +0100, Shug Boabby wrote:
>>> and I would still like somebody to clarify for
>>> me how to get the actual "wikipedia/en" name, given the article.name
>>> from the WEX dumps (spaces to underscores, but what else?).
>>
>> Here are some python functions we often use to convert WP names to
>> Freebase keys. mql_escape() may be all you need, but cleanwikiword()
>> is helpful for some messy WP names. You may also use utf8unescape()
>> when you need to handle a WP name you got from a HTTP GET.
>>
>> Keep in mind that not all WP names will resolve to Freebase IDs, for
>> reasons such as they are not topics (e.g. disambiguation articles) or
>> are brand new names that Freebase hasn't synced to yet.
>>
>> Let me know if you have questions. Kurt :-)
>>
>>
>>
>> ######################################################################
>> import codecs,struct,re
>>
>> # Let's deal with URL escaping of UTF
>> utf8decode=codecs.getdecoder('utf-8')
>> isodecode=codecs.getdecoder('iso8859_1')
>>
>> def sub3(mo):
>> return(utf8decode(struct.pack("BBB",int(mo.group(1),16),int(mo.group(2),16),int(mo.group(3),16)))[0])
>>
>> def sub2(mo):
>> return(utf8decode(struct.pack("BB",int(mo.group(1),16),int(mo.group(2),16)))[0])
>>
>> def sub1(mo):
>> try:
>> return(utf8decode(struct.pack("B",int(mo.group(1),16)))[0])
>> except UnicodeDecodeError:
>> return(isodecode(struct.pack("B",int(mo.group(1),16)))[0])
>>
>> def utf8unescape(s):
>> ''' Converts UTF-8 strings that have been URL escaped
>> back into UTF-8. '''
>> # Get 3-byte UTF-8 sequences 1110xxxx 10yyyyyy 10zzzzzz
>> s=re.sub('%(e[0-9a-f])%([89ab][0-9a-f])%([89ab][0-9a-f])(?im)',sub3,s)
>> # Get 2-byte UTF-8 sequences 110xxxxx 10yyyyyy
>> s=re.sub('%([cd][0-9a-f])%([89ab][0-9a-f])(?im)',sub2,s)
>> # Get 1-byte UTF-8 sequences 0xxxxxxx
>> s=re.sub('%([0-7][0-9a-f])(?im)',sub1,s)
>> # Nuke any illegal characters.
>> s=re.sub(u'[\ud800-\udfff\ufdd0-\ufdef\ufffe\uffff]','',s)
>> s=re.sub('[\x00-\x08\x0b\x0c\x0e-\x1f]','',s)
>> return(s)
>>
>> # Clean up whitespace
>> def cleanwikiword(s):
>> ''' Clean up the spacing chars of a Wikipedia name '''
>> s=utf8unescape(s)
>> s=re.sub('^[_ \t\r\n]+','',s)
>> s=re.sub('[_ \t\r\n]+$','',s)
>> s=re.sub('[_ \t\r\n]+','_',s)
>> s=s[0].upper()+s[1:]
>> return(s)
>>
>> # Do the MW hex encoding
>> def dollarhex(mo):
>> return(("$%04x" % ord(mo.group(1))).upper())
>>
>> def mql_escape(s):
>> ''' Convert a string into a valid Freebase key value. '''
>> s=re.sub('([^-A-Za-z0-9_])',dollarhex,s)
>> s=re.sub('(^[-_])',dollarhex,s)
>> s=re.sub('([-_])$',dollarhex,s)
>> return(s)
>>
>> # This function shows how the above can be used together.
>> def wikiurltomwkey(s):
>> return(mql_escape(cleanwikiword(utf8unescape(s))))
>>
>> # Do a test
>> if __name__=='__main__':
>> s0='----%e9%80%81 %d4%90 %45 %1f -_-_ blah____'
>> s1=utf8unescape(s0)
>> s2=cleanwikiword(s1)
>> s3=mql_escape(s2)
>> print ':'+repr(s0)+':\n:'+repr(s1)+':\n:'+repr(s2)+':\n:'+repr(s3)+':\n'
>>
>> _______________________________________________
>> Developers mailing list
>> Developers at freebase.com
>> http://lists.freebase.com/mailman/listinfo/developers
>>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>
More information about the Developers
mailing list