[Developers] Bulk download from sandbox.freebase.com

Tim Kientzle tim at metaweb.com
Wed Oct 10 01:26:50 UTC 2007


Thank you! Thank you!

I've been looking for a good test case for the Parallel HTTP requester 
I'm playing with.  This fills my requirements quite nicely.  ;-)

TBKK

Colin Evans wrote:
> Hi Kavitha,
> If you posted the script and queries that you're using, it would help me 
> to try to debug the issues that you're encountering.
> 
> The approach that we're currently using for dumping large chunks of the 
> graph is to do a simple cursored query for all guids of topics, like this:
> 
> ---
> {
>   "cursor":true,
>   "query":[{
>     "guid":null,
>     "limit":1000,
>     "type":"/common/topic"
>   }]
> }
> ---
> 
> .. and then run this query for each guid that we get back:
> 
> ---
> {
>   "query":[{
>     "guid":<guid>,
>     "/type/namespace/keys":[{
>       "limit":1000,
>       "link":{
>         "creator":null,
>         "timestamp":null
>       },
>       "namespace":null,
>       "optional":true,
>       "value":null
>     }],
>     "/type/object/key":[{
>       "limit":1000,
>       "link":{
>         "creator":null,
>         "timestamp":null
>       },
>       "namespace":null,
>       "optional":true,
>       "value":null
>     }],
>     "/type/reflect/any_master":[{
>       "guid":null,
>       "limit":1000,
>       "link":{
>         "creator":null,
>         "master_property":null,
>         "timestamp":null
>       },
>       "name":null,
>       "optional":true,
>       "type":[]
>     }],
>     "/type/reflect/any_reverse":[{
>       "guid":null,
>       "limit":1000,
>       "link":{
>         "creator":null,
>         "master_property":null,
>         "timestamp":null
>       },
>       "name":null,
>       "optional":true,
>       "type":[]
>     }],
>     "t:/type/reflect/any_value":[{
>       "lang":null,
>       "limit":1000,
>       "link":{
>         "creator":null,
>         "master_property":null,
>         "timestamp":null
>       },
>       "optional":true,
>       "type":"/type/text",
>       "value":null
>     }],
>     "v:/type/reflect/any_value":[{
>       "limit":1000,
>       "link":{
>         "creator":null,
>         "master_property":null,
>         "timestamp":null
>       },
>       "optional":true,
>       "type":null,
>       "value":null
>     }]
>   }]
> }
> ---
> 
> The above query uses some not-formally-supported features in order to 
> get every inbound and outbound property, so don't be surprised if we 
> remove and replace some of the above structures as MQL progresses.  Note 
> that the above query will time out for some guids -- this is because 
> there are too many inbound and outbound properties for the query to 
> finish in time.  At this point, I recommend that you skip those guids -- 
> the system should get faster with time.
> 
> Pulling down one topic at a time will be slow.  I recommend making full 
> use of HTTP pipelining, and also that you set up multiple threads to 
> query the system in parallel.  The latency should stay fairly constant, 
> but you should be able to get better throughput with multiple 
> simultaneous queries.
> 
> We're working through a lot of these issues ourselves right now, and so 
> we don't currently have better answers on how to do this.  Given that, 
> we're open to making the data available, and are just trying to figure 
> out the best way to do it. 
> 
> Thanks
> Colin Evans
> 
> 
> Kavitha Srinivas wrote:
>> Hello
>>     We wrote some Javascript to read every instance of a topic using  
>> cursors from sandbox.freebase.com.  We managed to read successfully a  
>> few times from the cursor using this technique.  However, after that  
>> we keep getting timeouts in the read.  We amended our script to (a)  
>> try multiple times, which did not work because it appeared to get  
>> stuck at the same point, (b) try using limits of 10 to get data, but  
>> then this will simply not scale.  Any help is appreciated.
>>
>> Thanks!
>> Kavitha
>>
>>
>> On Oct 8, 2007, at 3:53 PM, Tim Kientzle wrote:
>>
>>   
>>> If you can come up with a good way to do this
>>> and performance really turns out to be a problem,
>>> we *may* be able to host it internally and provide
>>> the dump as a downloadable file.  (Have the dump
>>> regenerated once a week or so, perhaps?)
>>>
>>> No promises, but it's a possibility.
>>>
>>> You should, of course, try to make it work
>>> externally first.  Our internal connections
>>> are faster, but may not be as much faster as you
>>> think.
>>>
>>> TBKK
>>>
>>> Shawn Simister wrote:
>>>     
>>>> I'm eager to see how this turns out. Even with a cursor returning 100
>>>> instances at a time it could take a while to get every instance in
>>>> Freebase. Maybe there's some way that Metaweb could run your query
>>>> locally and just let us download the JSON results as one file.  
>>>> Then we
>>>> can convert those query results to RDF locally. You might also  
>>>> consider
>>>> contacting the DBpedia.org  folks. If I remember correctly, they were
>>>> also interested in getting RDF dumps of the Freebase data.
>>>>
>>>> Shawn
>>>>
>>>> John Giannandrea wrote:
>>>>       
>>>>> Kavitha Srinivas wrote:
>>>>>
>>>>>         
>>>>>> Here's what we tried -- we tried to get any instance of /common/ 
>>>>>> topic
>>>>>> and dump all of its links to other instances of /common/topic.   
>>>>>> When
>>>>>> we try this with no explicit limits set, this gives us some  
>>>>>> randomly
>>>>>> selected instances (within a default limit, which I guess is 100).
>>>>>>
>>>>>>           
>>>>> To succeed at this you will need to use the "cursor"  feature of MQL
>>>>> documented here.
>>>>>
>>>>> http://www.freebase.com/view/helptopic?id=%
>>>>> 239202a8c04000641f800000000544e139#cursors
>>>>>
>>>>> -jg
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Developers mailing list
>>>>> Developers at freebase.com
>>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>>
>>>>>
>>>>>         
>>>> --------------------------------------------------------------------- 
>>>> ---
>>>>
>>>> _______________________________________________
>>>> Developers mailing list
>>>> Developers at freebase.com
>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>       
>>> _______________________________________________
>>> Developers mailing list
>>> Developers at freebase.com
>>> http://lists.freebase.com/mailman/listinfo/developers
>>>     
>>
>> _______________________________________________
>> Developers mailing list
>> Developers at freebase.com
>> http://lists.freebase.com/mailman/listinfo/developers
>>   
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers


More information about the Developers mailing list