[Developers] Dbpedia types vs FB types
Paul Houle
paul at ontology2.com
Fri Jul 31 19:35:24 UTC 2009
Paul Houle wrote:
> I'm looking at my sample some more. Here's the distribution of
> toplevel types from the dbpedia ontology
>
> +-----------------------------------+----------+
> | type | count(*) |
> +-----------------------------------+----------+
> | SupremeCourtOfTheUnitedStatesCase | 3 |
> | Website | 4 |
> | Event | 21 |
> | Infrastructure | 47 |
> | Work | 525 |
> | Organisation | 649 |
> | Place | 712 |
> | Person | 2208 |
> | NULL | 6961 |
> +-----------------------------------+----------+
>
>
I used the new simplified dump from metaweb to do the same thing
with freebase. Lacking a proper schema dump, I simply assumed that the
toplevel type was the most prevalent type (other than /common/topic)
that applies to a topic:
+---------------------------------------------------+----------+
| url | count(*) |
+---------------------------------------------------+----------+
| /people/person | 4066 |
| NULL | 3756 |
| /location/location | 1211 |
| /business/employer | 827 |
| /film/film | 427 |
| /projects/project_focus | 268 |
| /time/event | 46 |
| /organization/organization | 46 |
| /transportation/road | 44 |
| /architecture/museum | 41 |
| /broadcast/broadcast | 40 |
| /music/artist | 33 |
| /time/recurring_event | 30 |
| /music/album | 27 |
| /book/written_work | 25 |
| /book/periodical | 22 |
| /education/educational_institution | 14 |
| /base/dance/topic | 12 |
| /business/business_location | 11 |
| /tv/tv_program | 11 |
| /sports/sports_team | 9 |
| /boats/ship | 9 |
| /metropolitan_transit/transit_line | 7 |
| /base/amusementparks/topic | 7 |
| /business/company | 7 |
| /book/author | 6 |
| /visual_art/artwork | 5 |
| /user/robert/area_codes/topic | 5 |
| /book/book_subject | 5 |
| /food/dish | 4 |
| /architecture/structure | 4 |
| /transportation/bridge | 4 |
| /business/shopping_center | 4 |
| /sports/sports_facility | 3 |
| /film/film_location | 3 |
| /medicine/hospital | 3 |
| /music/genre | 3 |
| /award/award | 3 |
| /music/composition | 3 |
| /award/award_winner | 3 |
| /protected_sites/protected_site | 3 |
| /award/award_category | 2 |
| /government/government_agency | 2 |
| /tv/tv_network | 2 |
| /base/disaster2/topic | 2 |
| /user/skud/legal/topic | 2 |
| /education/school | 2 |
| /internet/website | 2 |
| /base/dance/dance_company | 2 |
| /government/governmental_body | 2 |
| /architecture/landscape_project | 2 |
| /biology/organism | 2 |
| /geography/body_of_water | 2 |
| /theater/theater_company | 2 |
| /book/school_or_movement | 2 |
| /user/skud/names/namesake | 2 |
| /military/armed_force | 1 |
| /projects/project | 1 |
| /user/iubookgirl/default_domain/academic_library | 1 |
| /geography/island | 1 |
| /influence/influence_node | 1 |
| /base/fblinux/topic | 1 |
| /film/writer | 1 |
| /user/rcheramy/default_domain/nickname | 1 |
| /award/award_presenting_organization | 1 |
| /architecture/unrealized_design | 1 |
| /base/americancomedy/comedy_venue | 1 |
| /base/collectives/topic | 1 |
| /games/game | 1 |
| /broadcast/radio_station | 1 |
| /cvg/cvg_developer | 1 |
| /base/omgfun/festival_series | 1 |
| /award/award_nominee | 1 |
| /user/petroleumj/default_domain/subway_station | 1 |
| /business/job_title | 1 |
| /user/skud/flags/topic | 1 |
| /visual_art/art_subject | 1 |
| /user/tsegaran/random/topic | 1 |
| /book/magazine | 1 |
| /user/techgnostic/default_domain/periodical | 1 |
| /food/brewery_brand_of_beer | 1 |
| /geography/bay | 1 |
| /metropolitan_transit/transit_system | 1 |
| /internet/website_owner | 1 |
| /visual_art/art_owner | 1 |
| /computer/software_developer | 1 |
| /fictional_universe/fictional_character_creator | 1 |
| /venture_capital/venture_investor | 1 |
| /base/omgfun/topic | 1 |
| /award/hall_of_fame | 1 |
| /base/exhibitions/topic | 1 |
| /base/symbols/topic | 1 |
| /architecture/architectural_structure_owner | 1 |
| /aviation/airliner_accident | 1 |
| /guid/9202a8c04000641f800000000af896ba | 1 |
| /user/guidewire/default_domain/online_music_store | 1 |
| /library/public_library_system | 1 |
| /user/gogza/default_domain/recurring_event | 1 |
| /base/americancomedy/topic | 1 |
+---------------------------------------------------+----------+
(Note that this is over a list of about 11k topics that I'm doing work
on to improve the classification of before I feed it into the next stage
of my production pipeline)
Freebase has types for about twice the number of people, and has about
half the number of untypeds as dbpedia. The freebase "toplevels" I'm
generating are completely uncontrolled so they you get some strange ones
towards the bottom: the "prevalance" filter has gotten rid of a large
number of references to certain common junk types such as the "Jungle"
type that you find all over the place in Freebase.
Note that the URL structure of "commons" types on FB tends to be
{problem_domain}/{type}
so you tend to see things like "book/author" where there is no
inheritance relation between book and author. You also see "/base/..."
types and "/user/.." types which represent namespaces inside FB.
I'm going to look at the double-untyped a bit more and also merge the fb
types into the dbpedia toplevels.
More information about the Developers
mailing list