[Developers] Bulk download from sandbox.freebase.com
Tim Kientzle
tim at metaweb.com
Wed Oct 10 01:26:50 UTC 2007
Thank you! Thank you!
I've been looking for a good test case for the Parallel HTTP requester
I'm playing with. This fills my requirements quite nicely. ;-)
TBKK
Colin Evans wrote:
> Hi Kavitha,
> If you posted the script and queries that you're using, it would help me
> to try to debug the issues that you're encountering.
>
> The approach that we're currently using for dumping large chunks of the
> graph is to do a simple cursored query for all guids of topics, like this:
>
> ---
> {
> "cursor":true,
> "query":[{
> "guid":null,
> "limit":1000,
> "type":"/common/topic"
> }]
> }
> ---
>
> .. and then run this query for each guid that we get back:
>
> ---
> {
> "query":[{
> "guid":<guid>,
> "/type/namespace/keys":[{
> "limit":1000,
> "link":{
> "creator":null,
> "timestamp":null
> },
> "namespace":null,
> "optional":true,
> "value":null
> }],
> "/type/object/key":[{
> "limit":1000,
> "link":{
> "creator":null,
> "timestamp":null
> },
> "namespace":null,
> "optional":true,
> "value":null
> }],
> "/type/reflect/any_master":[{
> "guid":null,
> "limit":1000,
> "link":{
> "creator":null,
> "master_property":null,
> "timestamp":null
> },
> "name":null,
> "optional":true,
> "type":[]
> }],
> "/type/reflect/any_reverse":[{
> "guid":null,
> "limit":1000,
> "link":{
> "creator":null,
> "master_property":null,
> "timestamp":null
> },
> "name":null,
> "optional":true,
> "type":[]
> }],
> "t:/type/reflect/any_value":[{
> "lang":null,
> "limit":1000,
> "link":{
> "creator":null,
> "master_property":null,
> "timestamp":null
> },
> "optional":true,
> "type":"/type/text",
> "value":null
> }],
> "v:/type/reflect/any_value":[{
> "limit":1000,
> "link":{
> "creator":null,
> "master_property":null,
> "timestamp":null
> },
> "optional":true,
> "type":null,
> "value":null
> }]
> }]
> }
> ---
>
> The above query uses some not-formally-supported features in order to
> get every inbound and outbound property, so don't be surprised if we
> remove and replace some of the above structures as MQL progresses. Note
> that the above query will time out for some guids -- this is because
> there are too many inbound and outbound properties for the query to
> finish in time. At this point, I recommend that you skip those guids --
> the system should get faster with time.
>
> Pulling down one topic at a time will be slow. I recommend making full
> use of HTTP pipelining, and also that you set up multiple threads to
> query the system in parallel. The latency should stay fairly constant,
> but you should be able to get better throughput with multiple
> simultaneous queries.
>
> We're working through a lot of these issues ourselves right now, and so
> we don't currently have better answers on how to do this. Given that,
> we're open to making the data available, and are just trying to figure
> out the best way to do it.
>
> Thanks
> Colin Evans
>
>
> Kavitha Srinivas wrote:
>> Hello
>> We wrote some Javascript to read every instance of a topic using
>> cursors from sandbox.freebase.com. We managed to read successfully a
>> few times from the cursor using this technique. However, after that
>> we keep getting timeouts in the read. We amended our script to (a)
>> try multiple times, which did not work because it appeared to get
>> stuck at the same point, (b) try using limits of 10 to get data, but
>> then this will simply not scale. Any help is appreciated.
>>
>> Thanks!
>> Kavitha
>>
>>
>> On Oct 8, 2007, at 3:53 PM, Tim Kientzle wrote:
>>
>>
>>> If you can come up with a good way to do this
>>> and performance really turns out to be a problem,
>>> we *may* be able to host it internally and provide
>>> the dump as a downloadable file. (Have the dump
>>> regenerated once a week or so, perhaps?)
>>>
>>> No promises, but it's a possibility.
>>>
>>> You should, of course, try to make it work
>>> externally first. Our internal connections
>>> are faster, but may not be as much faster as you
>>> think.
>>>
>>> TBKK
>>>
>>> Shawn Simister wrote:
>>>
>>>> I'm eager to see how this turns out. Even with a cursor returning 100
>>>> instances at a time it could take a while to get every instance in
>>>> Freebase. Maybe there's some way that Metaweb could run your query
>>>> locally and just let us download the JSON results as one file.
>>>> Then we
>>>> can convert those query results to RDF locally. You might also
>>>> consider
>>>> contacting the DBpedia.org folks. If I remember correctly, they were
>>>> also interested in getting RDF dumps of the Freebase data.
>>>>
>>>> Shawn
>>>>
>>>> John Giannandrea wrote:
>>>>
>>>>> Kavitha Srinivas wrote:
>>>>>
>>>>>
>>>>>> Here's what we tried -- we tried to get any instance of /common/
>>>>>> topic
>>>>>> and dump all of its links to other instances of /common/topic.
>>>>>> When
>>>>>> we try this with no explicit limits set, this gives us some
>>>>>> randomly
>>>>>> selected instances (within a default limit, which I guess is 100).
>>>>>>
>>>>>>
>>>>> To succeed at this you will need to use the "cursor" feature of MQL
>>>>> documented here.
>>>>>
>>>>> http://www.freebase.com/view/helptopic?id=%
>>>>> 239202a8c04000641f800000000544e139#cursors
>>>>>
>>>>> -jg
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Developers mailing list
>>>>> Developers at freebase.com
>>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>>
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> ---
>>>>
>>>> _______________________________________________
>>>> Developers mailing list
>>>> Developers at freebase.com
>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>
>>> _______________________________________________
>>> Developers mailing list
>>> Developers at freebase.com
>>> http://lists.freebase.com/mailman/listinfo/developers
>>>
>>
>> _______________________________________________
>> Developers mailing list
>> Developers at freebase.com
>> http://lists.freebase.com/mailman/listinfo/developers
>>
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
More information about the Developers
mailing list