[Developers] Looping behavior with cursors and (very) large data sets?

Will Fitzgerald will at powerset.com
Tue Jun 10 21:05:13 UTC 2008


At Powerset, to get the information we need to calculate a lot of name
entity information, I've been trying to execute this query with a cursor
(tiny url leads to MQL editor):

http://is.gd/uzz


[
  {
    "/common/topic/alias" : [],
    "/people/person/gender" : null,
    "a:type|=" : [
      "/business/company",
      "/location/us_state",
      "/location/province",
      "/location/country",
      "/people/person",
      "/education/education",
      "/military/military_conflict",
      "/location/citytown",
      "/medicine/disease",
      "/sports/sports_team",
      "/government/political_party"
    ],
    "guid" : null,
    "id" : null,
    "key" : [
      {
        "namespace" : null,
        "optional" : true,
        "value" : null
      }
    ],
    "name" : [],
    "type" : []
  }
]

Apparently, this query eventually loops, because I eventually get
duplicates. In one run, the cycle seems to be 280,812. Eventually, some
monitoring process on the Metaweb side kills off the query. I have reason to
believe that the query should return about 1,000,742 unique results.

Obviously, this isn't a query we execute very often, but it certainly was
surprising to see the cycle occur. I've coded around it by returning to an
older method we used -- downloading all guids for a type, and then merging;
but I thought the query above would result in less traffic -- how wrong I
was!

Is there any obvious thing I'm doing wrong in the query above that would
cause a loop? Do you think it's a problem with the MQL handler? I'd be glad
to investigate further, but would rather not pound the Metaweb servers.

Will Fitzgerald
Sr. Scientist
Powerset



More information about the Developers mailing list