[Developers] Entities referenced to "/user/tsegaran/"

Vijay Ramesh vijay.ramesh at powerset.com
Tue Aug 12 16:19:15 UTC 2008


Hello all,

Ran into a strange issue where we weren't picking up certain freebase-entities from a multi-step parsing of both "freebase-datadump-quadruples.tsv" and the type-specific data tsvs, and eventually tracked down the "what" although I am still very confused as to the "why."  The issue, in short, is with a set of 227 entities (see the attached txt file) that are all either /business/company or /business/industry types.  For this set of entities, the way they are referenced in the "freebase-datadump-quadruples.tsv" (I've checked for both March and July 2008 data) is different than the way they are referenced in the actual type-specific data-dumps (here, /business/company.tsv & /business/industry.tsv).
To illustrate with an example, first the MQL result:
[
  {
    "guid" : "#9202a8c04000641f800000000017cb7a",
    "id" : "/business/cik/0000019617",
    "name" : "JPMorgan Chase & Co."
  }
]

This entity in freebase-datadump-quadruples.tsv:
/business/cik   /type/namespace/keys    /user/tsegaran/sec/0000019617   0000019617
...
/user/tsegaran/sec/0000019617   /business/company/founded               2000
/user/tsegaran/sec/0000019617   /business/company/headquarters  /guid/9202a8c04000641f8000000007249da1
/user/tsegaran/sec/0000019617   /common/topic/alias     /lang/en        JP Morgan Chase
/user/tsegaran/sec/0000019617   /common/topic/alias     /lang/en        JPMorgan Chase
/user/tsegaran/sec/0000019617   /type/object/name       /lang/en        JPMorgan Chase & Co.
...
/wikipedia/en   /type/namespace/keys    /user/tsegaran/sec/0000019617   J$002EP$002E_MorganChase
/wikipedia/en   /type/namespace/keys    /user/tsegaran/sec/0000019617   J$002EP$002E_Morgan_$0026_Co$002E
etc...

& in /business/company.tsv:
JPMorgan Chase & Co.    /business/cik/0000019617                                                                /guid/9202a8c04000641f8000000007249da1,/guid/9202a8c04000641f8000000007a29a83                                         Xign                    2000                                            National Commercial Banks       /guid/9202a8c04000641f8000000003ce7778,/guid/9202a8c04000641f8000000005c91fa8

The problem being (only for this set of 227 entities, all of which seem to be related to /user/tsegaran, although the actually freebase pages don't seem to be created by him)  the entity is referenced by "/user/tsegaran/sec/0000019617" in the quadruples tsv, and by it's "normal" fid in the type-specific tsv (and in the MQL query). Even stranger, I can get the following by MQL query:

Query:
[
  {
    "guid" : null,
    "id" : "/user/tsegaran/sec/0000019617",
    "name" : null
  }
]

Result:
[
  {
    "guid" : "#9202a8c04000641f800000000017cb7a",
    "id" : "/user/tsegaran/sec/0000019617",
    "name" : "JPMorgan Chase & Co."
  }
]

But a search for all ids does not return the "/user/tsegaran/***" id

Query:
[
  {
    "guid" : "#9202a8c04000641f800000000017cb7a",
    "id" : [],
    "name" : null
  }
]

Result:
[
  {
    "guid" : "#9202a8c04000641f800000000017cb7a",
    "id" : [
      "/business/cik/0000019617"
    ],
    "name" : "JPMorgan Chase & Co."
  }
]

Now this might actually be a feature rather than a bug, but it seems strangely inconsistent with the vast majority of freebase entities and how they are referenced in the big quadruples tsv and the type-specific tsvs.  It is pretty apparent that this line - /business/cik   /type/namespace/keys    /user/tsegaran/sec/0000019617   0000019617 - in the quadruples tsv is where the "magic" is - linking "/user/tsegaran/sec/0000019617" to "/business/cik/0000019617" but my question still remains - why are these 227 entities treated differently than the rest of freebase? They don't seem to be "user-created" or anything like that, and as the MQL queries above seems to indicate, it doesn't seem that  "/user/tsegaran/sec/0000019617" is the "primary" fid (with "/business/cik/0000019617" serving as an alias of sorts) - and yet the big quadruples tsv seems to treat it as such.

The solution on our end, in terms of using this data, is simple enough, but it seems like an extra and unwarranted step that might be avoided if this strange mode of referencing is indeed a bug and subsequently cleared up. Any information would be appreciated (in particular, whether this is how the data is supposed to be, and thus we need to work around it, or whether this is the result of an hitherto unknown mistake that will be corrected).

Thanks,
 Vijay Krishna Ramesh <vijay.ramesh at powerset.com>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/developers/attachments/20080812/ef9bb439/attachment.htm 


More information about the Developers mailing list