[Developers] Bulk download from sandbox.freebase.com
Colin Evans
colin at metaweb.com
Tue Oct 9 23:02:15 UTC 2007
Hi Kavitha,
If you posted the script and queries that you're using, it would help me
to try to debug the issues that you're encountering.
The approach that we're currently using for dumping large chunks of the
graph is to do a simple cursored query for all guids of topics, like this:
---
{
"cursor":true,
"query":[{
"guid":null,
"limit":1000,
"type":"/common/topic"
}]
}
---
.. and then run this query for each guid that we get back:
---
{
"query":[{
"guid":<guid>,
"/type/namespace/keys":[{
"limit":1000,
"link":{
"creator":null,
"timestamp":null
},
"namespace":null,
"optional":true,
"value":null
}],
"/type/object/key":[{
"limit":1000,
"link":{
"creator":null,
"timestamp":null
},
"namespace":null,
"optional":true,
"value":null
}],
"/type/reflect/any_master":[{
"guid":null,
"limit":1000,
"link":{
"creator":null,
"master_property":null,
"timestamp":null
},
"name":null,
"optional":true,
"type":[]
}],
"/type/reflect/any_reverse":[{
"guid":null,
"limit":1000,
"link":{
"creator":null,
"master_property":null,
"timestamp":null
},
"name":null,
"optional":true,
"type":[]
}],
"t:/type/reflect/any_value":[{
"lang":null,
"limit":1000,
"link":{
"creator":null,
"master_property":null,
"timestamp":null
},
"optional":true,
"type":"/type/text",
"value":null
}],
"v:/type/reflect/any_value":[{
"limit":1000,
"link":{
"creator":null,
"master_property":null,
"timestamp":null
},
"optional":true,
"type":null,
"value":null
}]
}]
}
---
The above query uses some not-formally-supported features in order to
get every inbound and outbound property, so don't be surprised if we
remove and replace some of the above structures as MQL progresses. Note
that the above query will time out for some guids -- this is because
there are too many inbound and outbound properties for the query to
finish in time. At this point, I recommend that you skip those guids --
the system should get faster with time.
Pulling down one topic at a time will be slow. I recommend making full
use of HTTP pipelining, and also that you set up multiple threads to
query the system in parallel. The latency should stay fairly constant,
but you should be able to get better throughput with multiple
simultaneous queries.
We're working through a lot of these issues ourselves right now, and so
we don't currently have better answers on how to do this. Given that,
we're open to making the data available, and are just trying to figure
out the best way to do it.
Thanks
Colin Evans
Kavitha Srinivas wrote:
> Hello
> We wrote some Javascript to read every instance of a topic using
> cursors from sandbox.freebase.com. We managed to read successfully a
> few times from the cursor using this technique. However, after that
> we keep getting timeouts in the read. We amended our script to (a)
> try multiple times, which did not work because it appeared to get
> stuck at the same point, (b) try using limits of 10 to get data, but
> then this will simply not scale. Any help is appreciated.
>
> Thanks!
> Kavitha
>
>
> On Oct 8, 2007, at 3:53 PM, Tim Kientzle wrote:
>
>
>> If you can come up with a good way to do this
>> and performance really turns out to be a problem,
>> we *may* be able to host it internally and provide
>> the dump as a downloadable file. (Have the dump
>> regenerated once a week or so, perhaps?)
>>
>> No promises, but it's a possibility.
>>
>> You should, of course, try to make it work
>> externally first. Our internal connections
>> are faster, but may not be as much faster as you
>> think.
>>
>> TBKK
>>
>> Shawn Simister wrote:
>>
>>> I'm eager to see how this turns out. Even with a cursor returning 100
>>> instances at a time it could take a while to get every instance in
>>> Freebase. Maybe there's some way that Metaweb could run your query
>>> locally and just let us download the JSON results as one file.
>>> Then we
>>> can convert those query results to RDF locally. You might also
>>> consider
>>> contacting the DBpedia.org folks. If I remember correctly, they were
>>> also interested in getting RDF dumps of the Freebase data.
>>>
>>> Shawn
>>>
>>> John Giannandrea wrote:
>>>
>>>> Kavitha Srinivas wrote:
>>>>
>>>>
>>>>> Here's what we tried -- we tried to get any instance of /common/
>>>>> topic
>>>>> and dump all of its links to other instances of /common/topic.
>>>>> When
>>>>> we try this with no explicit limits set, this gives us some
>>>>> randomly
>>>>> selected instances (within a default limit, which I guess is 100).
>>>>>
>>>>>
>>>> To succeed at this you will need to use the "cursor" feature of MQL
>>>> documented here.
>>>>
>>>> http://www.freebase.com/view/helptopic?id=%
>>>> 239202a8c04000641f800000000544e139#cursors
>>>>
>>>> -jg
>>>>
>>>>
>>>> _______________________________________________
>>>> Developers mailing list
>>>> Developers at freebase.com
>>>> http://lists.freebase.com/mailman/listinfo/developers
>>>>
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> ---
>>>
>>> _______________________________________________
>>> Developers mailing list
>>> Developers at freebase.com
>>> http://lists.freebase.com/mailman/listinfo/developers
>>>
>> _______________________________________________
>> Developers mailing list
>> Developers at freebase.com
>> http://lists.freebase.com/mailman/listinfo/developers
>>
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/developers/attachments/20071009/070411cf/attachment-0001.htm
More information about the Developers
mailing list