[Developers] Bulk download from sandbox.freebase.com

Colin Evans colin at metaweb.com
Tue Oct 9 23:02:15 UTC 2007


Hi Kavitha,
If you posted the script and queries that you're using, it would help me 
to try to debug the issues that you're encountering.

The approach that we're currently using for dumping large chunks of the 
graph is to do a simple cursored query for all guids of topics, like this:

---
{
  "cursor":true,
  "query":[{
    "guid":null,
    "limit":1000,
    "type":"/common/topic"
  }]
}
---

.. and then run this query for each guid that we get back:

---
{
  "query":[{
    "guid":<guid>,
    "/type/namespace/keys":[{
      "limit":1000,
      "link":{
        "creator":null,
        "timestamp":null
      },
      "namespace":null,
      "optional":true,
      "value":null
    }],
    "/type/object/key":[{
      "limit":1000,
      "link":{
        "creator":null,
        "timestamp":null
      },
      "namespace":null,
      "optional":true,
      "value":null
    }],
    "/type/reflect/any_master":[{
      "guid":null,
      "limit":1000,
      "link":{
        "creator":null,
        "master_property":null,
        "timestamp":null
      },
      "name":null,
      "optional":true,
      "type":[]
    }],
    "/type/reflect/any_reverse":[{
      "guid":null,
      "limit":1000,
      "link":{
        "creator":null,
        "master_property":null,
        "timestamp":null
      },
      "name":null,
      "optional":true,
      "type":[]
    }],
    "t:/type/reflect/any_value":[{
      "lang":null,
      "limit":1000,
      "link":{
        "creator":null,
        "master_property":null,
        "timestamp":null
      },
      "optional":true,
      "type":"/type/text",
      "value":null
    }],
    "v:/type/reflect/any_value":[{
      "limit":1000,
      "link":{
        "creator":null,
        "master_property":null,
        "timestamp":null
      },
      "optional":true,
      "type":null,
      "value":null
    }]
  }]
}
---

The above query uses some not-formally-supported features in order to 
get every inbound and outbound property, so don't be surprised if we 
remove and replace some of the above structures as MQL progresses.  Note 
that the above query will time out for some guids -- this is because 
there are too many inbound and outbound properties for the query to 
finish in time.  At this point, I recommend that you skip those guids -- 
the system should get faster with time.

Pulling down one topic at a time will be slow.  I recommend making full 
use of HTTP pipelining, and also that you set up multiple threads to 
query the system in parallel.  The latency should stay fairly constant, 
but you should be able to get better throughput with multiple 
simultaneous queries.

We're working through a lot of these issues ourselves right now, and so 
we don't currently have better answers on how to do this.  Given that, 
we're open to making the data available, and are just trying to figure 
out the best way to do it. 

Thanks
Colin Evans


Kavitha Srinivas wrote:
> Hello
>     We wrote some Javascript to read every instance of a topic using  
> cursors from sandbox.freebase.com.  We managed to read successfully a  
> few times from the cursor using this technique.  However, after that  
> we keep getting timeouts in the read.  We amended our script to (a)  
> try multiple times, which did not work because it appeared to get  
> stuck at the same point, (b) try using limits of 10 to get data, but  
> then this will simply not scale.  Any help is appreciated.
>
> Thanks!
> Kavitha
>
>
> On Oct 8, 2007, at 3:53 PM, Tim Kientzle wrote:
>
>   
>> If you can come up with a good way to do this
>> and performance really turns out to be a problem,
>> we *may* be able to host it internally and provide
>> the dump as a downloadable file.  (Have the dump
>> regenerated once a week or so, perhaps?)
>>
>> No promises, but it's a possibility.
>>
>> You should, of course, try to make it work
>> externally first.  Our internal connections
>> are faster, but may not be as much faster as you
>> think.
>>
>> TBKK
>>
>> Shawn Simister wrote:
>>     
>>> I'm eager to see how this turns out. Even with a cursor returning 100
>>> instances at a time it could take a while to get every instance in
>>> Freebase. Maybe there's some way that Metaweb could run your query
>>> locally and just let us download the JSON results as one file.  
>>> Then we
>>> can convert those query results to RDF locally. You might also  
>>> consider
>>> contacting the DBpedia.org  folks. If I remember correctly, they were
>>> also interested in getting RDF dumps of the Freebase data.
>>>
>>> Shawn
>>>
>>> John Giannandrea wrote:
>>>       
>>>> Kavitha Srinivas wrote:
>>>>
>>>>         
>>>>> Here's what we tried -- we tried to get any instance of /common/ 
>>>>> topic
>>>>> and dump all of its links to other instances of /common/topic.   
>>>>> When
>>>>> we try this with no explicit limits set, this gives us some  
>>>>> randomly
>>>>> selected instances (within a default limit, which I guess is 100).
>>>>>
>>>>>           
>>>> To succeed at this you will need to use the "cursor"  feature of MQL
>>>> documented here.
>>>>
>>>> http://www.freebase.com/view/helptopic?id=%
>>>> 239202a8c04000641f800000000544e139#cursors
>>>>
>>>> -jg
>>>>
>>>>
>>>> _______________________________________________
>>>> Developers mailing list
>>>> Developers at freebase.com
>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>
>>>>
>>>>         
>>> --------------------------------------------------------------------- 
>>> ---
>>>
>>> _______________________________________________
>>> Developers mailing list
>>> Developers at freebase.com
>>> http://lists.freebase.com/mailman/listinfo/developers
>>>       
>> _______________________________________________
>> Developers mailing list
>> Developers at freebase.com
>> http://lists.freebase.com/mailman/listinfo/developers
>>     
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>   

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/developers/attachments/20071009/070411cf/attachment-0001.htm 


More information about the Developers mailing list