[Developers] Bulk download from sandbox.freebase.com
Kurt Bollacker
kurt at metaweb.com
Wed Oct 10 04:04:09 UTC 2007
On Tue, Oct 09, 2007 at 10:39:45PM -0400, Kavitha Srinivas wrote:
> Hi Tim
> Thanks for offering to help us. The key issue might be timeouts
> on objects with many properties. We played with different limits in
> this query, but with no luck.
> Here's the query file we used, as a sample.
There only very few cases of topics with too many properties to ask
for in a single query.
- Getting the instances of common types
- Getting the keys of large namespaces
- Getting all of the properties of literal languages (e.g. /lang/en)
While there are ways to use cursors to get these, there is a simpler
solution. Since you are planning a full crawl, (almost) all property
instances will be accessed twice (one from each direction) with the
query that Colin sent. You can speed up your crawl and likely avoid
(most of) the timeouts by only asking for the property instances once
each as in:
{
"guid":TOPIC_GUID_GOES_HERE,
"/type/reflect/any_master":[{
"optional":true,
"limit":1000,
"guid":null,
"link":{"creator":null,"timestamp":null,"master_property":null},
"name":null,
"type":[]
}],
"v:/type/reflect/any_value":[{
"optional":true,
"limit":1000,
"link":{"creator":null,"timestamp":null,"master_property":null},
"type":null,
"value":null
}],
"/type/object/key":[{
"optional":true,
"limit":1000,
"link":{"creator":null,"timestamp":null},
"value":null,
"namespace":null
}]
}
After the crawl, you will have to go back and reattach the links from
the other direction (the opposite side will be named in the "guid" for
"any_master" and "namespace" for "key"s).
Hope this helps..... Kurt :-)
> Kavitha
> var readEverythingQuery = "[{\"/type/reflect/any_master\":[{\"link\":{\"master_property\":{\"name\":null}},\"name\":null,\"type\":\"/common/topic\"}],\"name\":null,\"type\":\"/common/topic\"}]";
>
> var mqlreadURI = "http://sandbox.freebase.com/api/service/mqlread";
>
> function composeQueryURL(query, cursor) {
> return encodeURI(mqlreadURI + "?queries={\"qname\":{\"query\":" + query + ",\"cursor\":" + cursor + "}}");
> }
>
> var limit = 5;
>
> var cursor = "true";
>
> while (limit > 0) {
> var url = composeQueryURL(readEverythingQuery, cursor);
>
> var resultStr = readUrl( url );
>
> print(resultStr);
>
> var result = eval( '(' + resultStr + ')' );
>
> print(result);
>
> cursor = '"' + result.qname.cursor + '"';
>
> print(cursor);
>
> var objects = result.qname.result;
> for(var i = 0; i < objects.length; i++) {
> var object = objects[i];
> print(object.name);
> var links = object["/type/reflect/any_master"];
> for(var j = 0; j < links.length; j++) {
> print(" -- " + links[j].link.master_property.name + " --> " + links[j].name);
> }
> }
>
> limit--;
> }
>
>
> On Oct 9, 2007, at 9:26 PM, Tim Kientzle wrote:
>
> >Thank you! Thank you!
> >
> >I've been looking for a good test case for the Parallel HTTP requester
> >I'm playing with. This fills my requirements quite nicely. ;-)
> >
> >TBKK
> >
> >Colin Evans wrote:
> >>Hi Kavitha,
> >>If you posted the script and queries that you're using, it would
> >>help me
> >>to try to debug the issues that you're encountering.
> >>
> >>The approach that we're currently using for dumping large chunks
> >>of the
> >>graph is to do a simple cursored query for all guids of topics,
> >>like this:
> >>
> >>---
> >>{
> >> "cursor":true,
> >> "query":[{
> >> "guid":null,
> >> "limit":1000,
> >> "type":"/common/topic"
> >> }]
> >>}
> >>---
> >>
> >>.. and then run this query for each guid that we get back:
> >>
> >>---
> >>{
> >> "query":[{
> >> "guid":<guid>,
> >> "/type/namespace/keys":[{
> >> "limit":1000,
> >> "link":{
> >> "creator":null,
> >> "timestamp":null
> >> },
> >> "namespace":null,
> >> "optional":true,
> >> "value":null
> >> }],
> >> "/type/object/key":[{
> >> "limit":1000,
> >> "link":{
> >> "creator":null,
> >> "timestamp":null
> >> },
> >> "namespace":null,
> >> "optional":true,
> >> "value":null
> >> }],
> >> "/type/reflect/any_master":[{
> >> "guid":null,
> >> "limit":1000,
> >> "link":{
> >> "creator":null,
> >> "master_property":null,
> >> "timestamp":null
> >> },
> >> "name":null,
> >> "optional":true,
> >> "type":[]
> >> }],
> >> "/type/reflect/any_reverse":[{
> >> "guid":null,
> >> "limit":1000,
> >> "link":{
> >> "creator":null,
> >> "master_property":null,
> >> "timestamp":null
> >> },
> >> "name":null,
> >> "optional":true,
> >> "type":[]
> >> }],
> >> "t:/type/reflect/any_value":[{
> >> "lang":null,
> >> "limit":1000,
> >> "link":{
> >> "creator":null,
> >> "master_property":null,
> >> "timestamp":null
> >> },
> >> "optional":true,
> >> "type":"/type/text",
> >> "value":null
> >> }],
> >> "v:/type/reflect/any_value":[{
> >> "limit":1000,
> >> "link":{
> >> "creator":null,
> >> "master_property":null,
> >> "timestamp":null
> >> },
> >> "optional":true,
> >> "type":null,
> >> "value":null
> >> }]
> >> }]
> >>}
> >>---
> >>
> >>The above query uses some not-formally-supported features in order to
> >>get every inbound and outbound property, so don't be surprised if we
> >>remove and replace some of the above structures as MQL
> >>progresses. Note
> >>that the above query will time out for some guids -- this is because
> >>there are too many inbound and outbound properties for the query to
> >>finish in time. At this point, I recommend that you skip those
> >>guids --
> >>the system should get faster with time.
> >>
> >>Pulling down one topic at a time will be slow. I recommend making
> >>full
> >>use of HTTP pipelining, and also that you set up multiple threads to
> >>query the system in parallel. The latency should stay fairly
> >>constant,
> >>but you should be able to get better throughput with multiple
> >>simultaneous queries.
> >>
> >>We're working through a lot of these issues ourselves right now,
> >>and so
> >>we don't currently have better answers on how to do this. Given
> >>that,
> >>we're open to making the data available, and are just trying to
> >>figure
> >>out the best way to do it.
> >>
> >>Thanks
> >>Colin Evans
> >>
> >>
> >>Kavitha Srinivas wrote:
> >>>Hello
> >>> We wrote some Javascript to read every instance of a topic using
> >>>cursors from sandbox.freebase.com. We managed to read
> >>>successfully a
> >>>few times from the cursor using this technique. However, after that
> >>>we keep getting timeouts in the read. We amended our script to (a)
> >>>try multiple times, which did not work because it appeared to get
> >>>stuck at the same point, (b) try using limits of 10 to get data, but
> >>>then this will simply not scale. Any help is appreciated.
> >>>
> >>>Thanks!
> >>>Kavitha
> >>>
> >>>
> >>>On Oct 8, 2007, at 3:53 PM, Tim Kientzle wrote:
> >>>
> >>>
> >>>>If you can come up with a good way to do this
> >>>>and performance really turns out to be a problem,
> >>>>we *may* be able to host it internally and provide
> >>>>the dump as a downloadable file. (Have the dump
> >>>>regenerated once a week or so, perhaps?)
> >>>>
> >>>>No promises, but it's a possibility.
> >>>>
> >>>>You should, of course, try to make it work
> >>>>externally first. Our internal connections
> >>>>are faster, but may not be as much faster as you
> >>>>think.
> >>>>
> >>>>TBKK
> >>>>
> >>>>Shawn Simister wrote:
> >>>>
> >>>>>I'm eager to see how this turns out. Even with a cursor
> >>>>>returning 100
> >>>>>instances at a time it could take a while to get every instance in
> >>>>>Freebase. Maybe there's some way that Metaweb could run your query
> >>>>>locally and just let us download the JSON results as one file.
> >>>>>Then we
> >>>>>can convert those query results to RDF locally. You might also
> >>>>>consider
> >>>>>contacting the DBpedia.org folks. If I remember correctly,
> >>>>>they were
> >>>>>also interested in getting RDF dumps of the Freebase data.
> >>>>>
> >>>>>Shawn
> >>>>>
> >>>>>John Giannandrea wrote:
> >>>>>
> >>>>>>Kavitha Srinivas wrote:
> >>>>>>
> >>>>>>
> >>>>>>>Here's what we tried -- we tried to get any instance of /common/
> >>>>>>>topic
> >>>>>>>and dump all of its links to other instances of /common/topic.
> >>>>>>>When
> >>>>>>>we try this with no explicit limits set, this gives us some
> >>>>>>>randomly
> >>>>>>>selected instances (within a default limit, which I guess is
> >>>>>>>100).
> >>>>>>>
> >>>>>>>
> >>>>>>To succeed at this you will need to use the "cursor" feature
> >>>>>>of MQL
> >>>>>>documented here.
> >>>>>>
> >>>>>>http://www.freebase.com/view/helptopic?id=%
> >>>>>>239202a8c04000641f800000000544e139#cursors
> >>>>>>
> >>>>>>-jg
> >>>>>>
> >>>>>>
> >>>>>>_______________________________________________
> >>>>>>Developers mailing list
> >>>>>>Developers at freebase.com
> >>>>>>http://lists.freebase.com/mailman/listinfo/developers
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>------------------------------------------------------------------
> >>>>>---
> >>>>>---
> >>>>>
> >>>>>_______________________________________________
> >>>>>Developers mailing list
> >>>>>Developers at freebase.com
> >>>>>http://lists.freebase.com/mailman/listinfo/developers
> >>>>>
> >>>>_______________________________________________
> >>>>Developers mailing list
> >>>>Developers at freebase.com
> >>>>http://lists.freebase.com/mailman/listinfo/developers
> >>>>
> >>>
> >>>_______________________________________________
> >>>Developers mailing list
> >>>Developers at freebase.com
> >>>http://lists.freebase.com/mailman/listinfo/developers
> >>>
> >>
> >>
> >>---------------------------------------------------------------------
> >>---
> >>
> >>_______________________________________________
> >>Developers mailing list
> >>Developers at freebase.com
> >>http://lists.freebase.com/mailman/listinfo/developers
> >_______________________________________________
> >Developers mailing list
> >Developers at freebase.com
> >http://lists.freebase.com/mailman/listinfo/developers
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
More information about the Developers
mailing list