[Developers] Bulk download from sandbox.freebase.com
Kavitha Srinivas
ksrinivs at gmail.com
Wed Oct 10 13:21:36 UTC 2007
Yes it does! Thanks all for your suggestions.
Kavitha
On Oct 10, 2007, at 12:04 AM, Kurt Bollacker wrote:
> On Tue, Oct 09, 2007 at 10:39:45PM -0400, Kavitha Srinivas wrote:
>> Hi Tim
>> Thanks for offering to help us. The key issue might be timeouts
>> on objects with many properties. We played with different limits in
>> this query, but with no luck.
>> Here's the query file we used, as a sample.
>
> There only very few cases of topics with too many properties to ask
> for in a single query.
>
> - Getting the instances of common types
> - Getting the keys of large namespaces
> - Getting all of the properties of literal languages (e.g. /lang/en)
>
> While there are ways to use cursors to get these, there is a simpler
> solution. Since you are planning a full crawl, (almost) all property
> instances will be accessed twice (one from each direction) with the
> query that Colin sent. You can speed up your crawl and likely avoid
> (most of) the timeouts by only asking for the property instances once
> each as in:
>
>
> {
> "guid":TOPIC_GUID_GOES_HERE,
> "/type/reflect/any_master":[{
> "optional":true,
> "limit":1000,
> "guid":null,
> "link":
> {"creator":null,"timestamp":null,"master_property":null},
> "name":null,
> "type":[]
> }],
> "v:/type/reflect/any_value":[{
> "optional":true,
> "limit":1000,
> "link":
> {"creator":null,"timestamp":null,"master_property":null},
> "type":null,
> "value":null
> }],
> "/type/object/key":[{
> "optional":true,
> "limit":1000,
> "link":{"creator":null,"timestamp":null},
> "value":null,
> "namespace":null
> }]
> }
>
> After the crawl, you will have to go back and reattach the links from
> the other direction (the opposite side will be named in the "guid" for
> "any_master" and "namespace" for "key"s).
>
> Hope this helps..... Kurt :-)
>
>
>
>
>> Kavitha
>
>> var readEverythingQuery = "[{\"/type/reflect/any_master\":[{\"link
>> \":{\"master_property\":{\"name\":null}},\"name\":null,\"type\":\"/
>> common/topic\"}],\"name\":null,\"type\":\"/common/topic\"}]";
>>
>> var mqlreadURI = "http://sandbox.freebase.com/api/service/mqlread";
>>
>> function composeQueryURL(query, cursor) {
>> return encodeURI(mqlreadURI + "?queries={\"qname\":{\"query\":"
>> + query + ",\"cursor\":" + cursor + "}}");
>> }
>>
>> var limit = 5;
>>
>> var cursor = "true";
>>
>> while (limit > 0) {
>> var url = composeQueryURL(readEverythingQuery, cursor);
>>
>> var resultStr = readUrl( url );
>>
>> print(resultStr);
>>
>> var result = eval( '(' + resultStr + ')' );
>>
>> print(result);
>>
>> cursor = '"' + result.qname.cursor + '"';
>>
>> print(cursor);
>>
>> var objects = result.qname.result;
>> for(var i = 0; i < objects.length; i++) {
>> var object = objects[i];
>> print(object.name);
>> var links = object["/type/reflect/any_master"];
>> for(var j = 0; j < links.length; j++) {
>> print(" -- " + links[j].link.master_property.name + " --> "
>> + links[j].name);
>> }
>> }
>>
>> limit--;
>> }
>>
>
>>
>> On Oct 9, 2007, at 9:26 PM, Tim Kientzle wrote:
>>
>>> Thank you! Thank you!
>>>
>>> I've been looking for a good test case for the Parallel HTTP
>>> requester
>>> I'm playing with. This fills my requirements quite nicely. ;-)
>>>
>>> TBKK
>>>
>>> Colin Evans wrote:
>>>> Hi Kavitha,
>>>> If you posted the script and queries that you're using, it would
>>>> help me
>>>> to try to debug the issues that you're encountering.
>>>>
>>>> The approach that we're currently using for dumping large chunks
>>>> of the
>>>> graph is to do a simple cursored query for all guids of topics,
>>>> like this:
>>>>
>>>> ---
>>>> {
>>>> "cursor":true,
>>>> "query":[{
>>>> "guid":null,
>>>> "limit":1000,
>>>> "type":"/common/topic"
>>>> }]
>>>> }
>>>> ---
>>>>
>>>> .. and then run this query for each guid that we get back:
>>>>
>>>> ---
>>>> {
>>>> "query":[{
>>>> "guid":<guid>,
>>>> "/type/namespace/keys":[{
>>>> "limit":1000,
>>>> "link":{
>>>> "creator":null,
>>>> "timestamp":null
>>>> },
>>>> "namespace":null,
>>>> "optional":true,
>>>> "value":null
>>>> }],
>>>> "/type/object/key":[{
>>>> "limit":1000,
>>>> "link":{
>>>> "creator":null,
>>>> "timestamp":null
>>>> },
>>>> "namespace":null,
>>>> "optional":true,
>>>> "value":null
>>>> }],
>>>> "/type/reflect/any_master":[{
>>>> "guid":null,
>>>> "limit":1000,
>>>> "link":{
>>>> "creator":null,
>>>> "master_property":null,
>>>> "timestamp":null
>>>> },
>>>> "name":null,
>>>> "optional":true,
>>>> "type":[]
>>>> }],
>>>> "/type/reflect/any_reverse":[{
>>>> "guid":null,
>>>> "limit":1000,
>>>> "link":{
>>>> "creator":null,
>>>> "master_property":null,
>>>> "timestamp":null
>>>> },
>>>> "name":null,
>>>> "optional":true,
>>>> "type":[]
>>>> }],
>>>> "t:/type/reflect/any_value":[{
>>>> "lang":null,
>>>> "limit":1000,
>>>> "link":{
>>>> "creator":null,
>>>> "master_property":null,
>>>> "timestamp":null
>>>> },
>>>> "optional":true,
>>>> "type":"/type/text",
>>>> "value":null
>>>> }],
>>>> "v:/type/reflect/any_value":[{
>>>> "limit":1000,
>>>> "link":{
>>>> "creator":null,
>>>> "master_property":null,
>>>> "timestamp":null
>>>> },
>>>> "optional":true,
>>>> "type":null,
>>>> "value":null
>>>> }]
>>>> }]
>>>> }
>>>> ---
>>>>
>>>> The above query uses some not-formally-supported features in
>>>> order to
>>>> get every inbound and outbound property, so don't be surprised
>>>> if we
>>>> remove and replace some of the above structures as MQL
>>>> progresses. Note
>>>> that the above query will time out for some guids -- this is
>>>> because
>>>> there are too many inbound and outbound properties for the query to
>>>> finish in time. At this point, I recommend that you skip those
>>>> guids --
>>>> the system should get faster with time.
>>>>
>>>> Pulling down one topic at a time will be slow. I recommend making
>>>> full
>>>> use of HTTP pipelining, and also that you set up multiple
>>>> threads to
>>>> query the system in parallel. The latency should stay fairly
>>>> constant,
>>>> but you should be able to get better throughput with multiple
>>>> simultaneous queries.
>>>>
>>>> We're working through a lot of these issues ourselves right now,
>>>> and so
>>>> we don't currently have better answers on how to do this. Given
>>>> that,
>>>> we're open to making the data available, and are just trying to
>>>> figure
>>>> out the best way to do it.
>>>>
>>>> Thanks
>>>> Colin Evans
>>>>
>>>>
>>>> Kavitha Srinivas wrote:
>>>>> Hello
>>>>> We wrote some Javascript to read every instance of a topic
>>>>> using
>>>>> cursors from sandbox.freebase.com. We managed to read
>>>>> successfully a
>>>>> few times from the cursor using this technique. However, after
>>>>> that
>>>>> we keep getting timeouts in the read. We amended our script to
>>>>> (a)
>>>>> try multiple times, which did not work because it appeared to get
>>>>> stuck at the same point, (b) try using limits of 10 to get
>>>>> data, but
>>>>> then this will simply not scale. Any help is appreciated.
>>>>>
>>>>> Thanks!
>>>>> Kavitha
>>>>>
>>>>>
>>>>> On Oct 8, 2007, at 3:53 PM, Tim Kientzle wrote:
>>>>>
>>>>>
>>>>>> If you can come up with a good way to do this
>>>>>> and performance really turns out to be a problem,
>>>>>> we *may* be able to host it internally and provide
>>>>>> the dump as a downloadable file. (Have the dump
>>>>>> regenerated once a week or so, perhaps?)
>>>>>>
>>>>>> No promises, but it's a possibility.
>>>>>>
>>>>>> You should, of course, try to make it work
>>>>>> externally first. Our internal connections
>>>>>> are faster, but may not be as much faster as you
>>>>>> think.
>>>>>>
>>>>>> TBKK
>>>>>>
>>>>>> Shawn Simister wrote:
>>>>>>
>>>>>>> I'm eager to see how this turns out. Even with a cursor
>>>>>>> returning 100
>>>>>>> instances at a time it could take a while to get every
>>>>>>> instance in
>>>>>>> Freebase. Maybe there's some way that Metaweb could run your
>>>>>>> query
>>>>>>> locally and just let us download the JSON results as one file.
>>>>>>> Then we
>>>>>>> can convert those query results to RDF locally. You might also
>>>>>>> consider
>>>>>>> contacting the DBpedia.org folks. If I remember correctly,
>>>>>>> they were
>>>>>>> also interested in getting RDF dumps of the Freebase data.
>>>>>>>
>>>>>>> Shawn
>>>>>>>
>>>>>>> John Giannandrea wrote:
>>>>>>>
>>>>>>>> Kavitha Srinivas wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> Here's what we tried -- we tried to get any instance of /
>>>>>>>>> common/
>>>>>>>>> topic
>>>>>>>>> and dump all of its links to other instances of /common/topic.
>>>>>>>>> When
>>>>>>>>> we try this with no explicit limits set, this gives us some
>>>>>>>>> randomly
>>>>>>>>> selected instances (within a default limit, which I guess is
>>>>>>>>> 100).
>>>>>>>>>
>>>>>>>>>
>>>>>>>> To succeed at this you will need to use the "cursor" feature
>>>>>>>> of MQL
>>>>>>>> documented here.
>>>>>>>>
>>>>>>>> http://www.freebase.com/view/helptopic?id=%
>>>>>>>> 239202a8c04000641f800000000544e139#cursors
>>>>>>>>
>>>>>>>> -jg
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Developers mailing list
>>>>>>>> Developers at freebase.com
>>>>>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> ----------------------------------------------------------------
>>>>>>> --
>>>>>>> ---
>>>>>>> ---
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Developers mailing list
>>>>>>> Developers at freebase.com
>>>>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>>>>
>>>>>> _______________________________________________
>>>>>> Developers mailing list
>>>>>> Developers at freebase.com
>>>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Developers mailing list
>>>>> Developers at freebase.com
>>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>>
>>>>
>>>>
>>>> -------------------------------------------------------------------
>>>> --
>>>> ---
>>>>
>>>> _______________________________________________
>>>> Developers mailing list
>>>> Developers at freebase.com
>>>> http://lists.freebase.com/mailman/listinfo/developers
>>> _______________________________________________
>>> Developers mailing list
>>> Developers at freebase.com
>>> http://lists.freebase.com/mailman/listinfo/developers
>>
>
>> _______________________________________________
>> Developers mailing list
>> Developers at freebase.com
>> http://lists.freebase.com/mailman/listinfo/developers
>
More information about the Developers
mailing list