[Developers] Bulk download from sandbox.freebase.com

Kavitha Srinivas ksrinivs at gmail.com
Wed Oct 10 13:21:36 UTC 2007


Yes it does!  Thanks all for your suggestions.
Kavitha


On Oct 10, 2007, at 12:04 AM, Kurt Bollacker wrote:

> On Tue, Oct 09, 2007 at 10:39:45PM -0400, Kavitha Srinivas wrote:
>> Hi Tim
>>    Thanks for offering to help us.  The key issue might be timeouts
>> on objects with many properties.  We played with different limits in
>> this query, but with no luck.
>> Here's the query file we used, as a sample.
>
> There only very few cases of topics with too many properties to ask
> for in a single query.
>
>  - Getting the instances of common types
>  - Getting the keys of large namespaces
>  - Getting all of the properties of literal languages (e.g. /lang/en)
>
> While there are ways to use cursors to get these, there is a simpler
> solution.  Since you are planning a full crawl, (almost) all property
> instances will be accessed twice (one from each direction) with the
> query that Colin sent.  You can speed up your crawl and likely avoid
> (most of) the timeouts by only asking for the property instances once
> each as in:
>
>
>  {
>         "guid":TOPIC_GUID_GOES_HERE,
>         "/type/reflect/any_master":[{
>             "optional":true,
>             "limit":1000,
>             "guid":null,
>             "link": 
> {"creator":null,"timestamp":null,"master_property":null},
>             "name":null,
>             "type":[]
>         }],
>         "v:/type/reflect/any_value":[{
>             "optional":true,
>             "limit":1000,
>             "link": 
> {"creator":null,"timestamp":null,"master_property":null},
>             "type":null,
>             "value":null
>         }],
>         "/type/object/key":[{
>             "optional":true,
>             "limit":1000,
>             "link":{"creator":null,"timestamp":null},
>             "value":null,
>             "namespace":null
>         }]
> }
>
> After the crawl, you will have to go back and reattach the links from
> the other direction (the opposite side will be named in the "guid" for
> "any_master" and "namespace" for "key"s).
>
> Hope this helps.....						Kurt :-)
>
>
>
>
>> Kavitha
>
>> var readEverythingQuery = "[{\"/type/reflect/any_master\":[{\"link 
>> \":{\"master_property\":{\"name\":null}},\"name\":null,\"type\":\"/ 
>> common/topic\"}],\"name\":null,\"type\":\"/common/topic\"}]";
>>
>> var mqlreadURI = "http://sandbox.freebase.com/api/service/mqlread";
>>
>> function composeQueryURL(query, cursor) {
>>   return encodeURI(mqlreadURI + "?queries={\"qname\":{\"query\":"  
>> + query + ",\"cursor\":" + cursor + "}}");
>> }
>>
>> var limit = 5;
>>
>> var cursor = "true";
>>
>> while (limit > 0) {
>>   var url = composeQueryURL(readEverythingQuery, cursor);
>>
>>   var resultStr = readUrl( url );
>>
>>   print(resultStr);
>>
>>   var result = eval( '(' + resultStr + ')' );
>>
>>   print(result);
>>
>>   cursor = '"' + result.qname.cursor + '"';
>>
>>   print(cursor);
>>
>>   var objects = result.qname.result;
>>   for(var i = 0; i < objects.length; i++) {
>>     var object = objects[i];
>>     print(object.name);
>>     var links = object["/type/reflect/any_master"];
>>     for(var j = 0; j < links.length; j++) {
>>       print("  -- " + links[j].link.master_property.name + " --> "  
>> + links[j].name);
>>     }
>>   }
>>
>>   limit--;
>> }
>>
>
>>
>> On Oct 9, 2007, at 9:26 PM, Tim Kientzle wrote:
>>
>>> Thank you! Thank you!
>>>
>>> I've been looking for a good test case for the Parallel HTTP  
>>> requester
>>> I'm playing with.  This fills my requirements quite nicely.  ;-)
>>>
>>> TBKK
>>>
>>> Colin Evans wrote:
>>>> Hi Kavitha,
>>>> If you posted the script and queries that you're using, it would
>>>> help me
>>>> to try to debug the issues that you're encountering.
>>>>
>>>> The approach that we're currently using for dumping large chunks
>>>> of the
>>>> graph is to do a simple cursored query for all guids of topics,
>>>> like this:
>>>>
>>>> ---
>>>> {
>>>>  "cursor":true,
>>>>  "query":[{
>>>>    "guid":null,
>>>>    "limit":1000,
>>>>    "type":"/common/topic"
>>>>  }]
>>>> }
>>>> ---
>>>>
>>>> .. and then run this query for each guid that we get back:
>>>>
>>>> ---
>>>> {
>>>>  "query":[{
>>>>    "guid":<guid>,
>>>>    "/type/namespace/keys":[{
>>>>      "limit":1000,
>>>>      "link":{
>>>>        "creator":null,
>>>>        "timestamp":null
>>>>      },
>>>>      "namespace":null,
>>>>      "optional":true,
>>>>      "value":null
>>>>    }],
>>>>    "/type/object/key":[{
>>>>      "limit":1000,
>>>>      "link":{
>>>>        "creator":null,
>>>>        "timestamp":null
>>>>      },
>>>>      "namespace":null,
>>>>      "optional":true,
>>>>      "value":null
>>>>    }],
>>>>    "/type/reflect/any_master":[{
>>>>      "guid":null,
>>>>      "limit":1000,
>>>>      "link":{
>>>>        "creator":null,
>>>>        "master_property":null,
>>>>        "timestamp":null
>>>>      },
>>>>      "name":null,
>>>>      "optional":true,
>>>>      "type":[]
>>>>    }],
>>>>    "/type/reflect/any_reverse":[{
>>>>      "guid":null,
>>>>      "limit":1000,
>>>>      "link":{
>>>>        "creator":null,
>>>>        "master_property":null,
>>>>        "timestamp":null
>>>>      },
>>>>      "name":null,
>>>>      "optional":true,
>>>>      "type":[]
>>>>    }],
>>>>    "t:/type/reflect/any_value":[{
>>>>      "lang":null,
>>>>      "limit":1000,
>>>>      "link":{
>>>>        "creator":null,
>>>>        "master_property":null,
>>>>        "timestamp":null
>>>>      },
>>>>      "optional":true,
>>>>      "type":"/type/text",
>>>>      "value":null
>>>>    }],
>>>>    "v:/type/reflect/any_value":[{
>>>>      "limit":1000,
>>>>      "link":{
>>>>        "creator":null,
>>>>        "master_property":null,
>>>>        "timestamp":null
>>>>      },
>>>>      "optional":true,
>>>>      "type":null,
>>>>      "value":null
>>>>    }]
>>>>  }]
>>>> }
>>>> ---
>>>>
>>>> The above query uses some not-formally-supported features in  
>>>> order to
>>>> get every inbound and outbound property, so don't be surprised  
>>>> if we
>>>> remove and replace some of the above structures as MQL
>>>> progresses.  Note
>>>> that the above query will time out for some guids -- this is  
>>>> because
>>>> there are too many inbound and outbound properties for the query to
>>>> finish in time.  At this point, I recommend that you skip those
>>>> guids --
>>>> the system should get faster with time.
>>>>
>>>> Pulling down one topic at a time will be slow.  I recommend making
>>>> full
>>>> use of HTTP pipelining, and also that you set up multiple  
>>>> threads to
>>>> query the system in parallel.  The latency should stay fairly
>>>> constant,
>>>> but you should be able to get better throughput with multiple
>>>> simultaneous queries.
>>>>
>>>> We're working through a lot of these issues ourselves right now,
>>>> and so
>>>> we don't currently have better answers on how to do this.  Given
>>>> that,
>>>> we're open to making the data available, and are just trying to
>>>> figure
>>>> out the best way to do it.
>>>>
>>>> Thanks
>>>> Colin Evans
>>>>
>>>>
>>>> Kavitha Srinivas wrote:
>>>>> Hello
>>>>>    We wrote some Javascript to read every instance of a topic  
>>>>> using
>>>>> cursors from sandbox.freebase.com.  We managed to read
>>>>> successfully a
>>>>> few times from the cursor using this technique.  However, after  
>>>>> that
>>>>> we keep getting timeouts in the read.  We amended our script to  
>>>>> (a)
>>>>> try multiple times, which did not work because it appeared to get
>>>>> stuck at the same point, (b) try using limits of 10 to get  
>>>>> data, but
>>>>> then this will simply not scale.  Any help is appreciated.
>>>>>
>>>>> Thanks!
>>>>> Kavitha
>>>>>
>>>>>
>>>>> On Oct 8, 2007, at 3:53 PM, Tim Kientzle wrote:
>>>>>
>>>>>
>>>>>> If you can come up with a good way to do this
>>>>>> and performance really turns out to be a problem,
>>>>>> we *may* be able to host it internally and provide
>>>>>> the dump as a downloadable file.  (Have the dump
>>>>>> regenerated once a week or so, perhaps?)
>>>>>>
>>>>>> No promises, but it's a possibility.
>>>>>>
>>>>>> You should, of course, try to make it work
>>>>>> externally first.  Our internal connections
>>>>>> are faster, but may not be as much faster as you
>>>>>> think.
>>>>>>
>>>>>> TBKK
>>>>>>
>>>>>> Shawn Simister wrote:
>>>>>>
>>>>>>> I'm eager to see how this turns out. Even with a cursor
>>>>>>> returning 100
>>>>>>> instances at a time it could take a while to get every  
>>>>>>> instance in
>>>>>>> Freebase. Maybe there's some way that Metaweb could run your  
>>>>>>> query
>>>>>>> locally and just let us download the JSON results as one file.
>>>>>>> Then we
>>>>>>> can convert those query results to RDF locally. You might also
>>>>>>> consider
>>>>>>> contacting the DBpedia.org  folks. If I remember correctly,
>>>>>>> they were
>>>>>>> also interested in getting RDF dumps of the Freebase data.
>>>>>>>
>>>>>>> Shawn
>>>>>>>
>>>>>>> John Giannandrea wrote:
>>>>>>>
>>>>>>>> Kavitha Srinivas wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Here's what we tried -- we tried to get any instance of / 
>>>>>>>>> common/
>>>>>>>>> topic
>>>>>>>>> and dump all of its links to other instances of /common/topic.
>>>>>>>>> When
>>>>>>>>> we try this with no explicit limits set, this gives us some
>>>>>>>>> randomly
>>>>>>>>> selected instances (within a default limit, which I guess is
>>>>>>>>> 100).
>>>>>>>>>
>>>>>>>>>
>>>>>>>> To succeed at this you will need to use the "cursor"  feature
>>>>>>>> of MQL
>>>>>>>> documented here.
>>>>>>>>
>>>>>>>> http://www.freebase.com/view/helptopic?id=%
>>>>>>>> 239202a8c04000641f800000000544e139#cursors
>>>>>>>>
>>>>>>>> -jg
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Developers mailing list
>>>>>>>> Developers at freebase.com
>>>>>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> ---------------------------------------------------------------- 
>>>>>>> --
>>>>>>> ---
>>>>>>> ---
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Developers mailing list
>>>>>>> Developers at freebase.com
>>>>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Developers mailing list
>>>>>> Developers at freebase.com
>>>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Developers mailing list
>>>>> Developers at freebase.com
>>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>>
>>>>
>>>>
>>>> ------------------------------------------------------------------- 
>>>> --
>>>> ---
>>>>
>>>> _______________________________________________
>>>> Developers mailing list
>>>> Developers at freebase.com
>>>> http://lists.freebase.com/mailman/listinfo/developers
>>> _______________________________________________
>>> Developers mailing list
>>> Developers at freebase.com
>>> http://lists.freebase.com/mailman/listinfo/developers
>>
>
>> _______________________________________________
>> Developers mailing list
>> Developers at freebase.com
>> http://lists.freebase.com/mailman/listinfo/developers
>



More information about the Developers mailing list