[Developers] Dbpedia types vs FB types

Paul Houle paul at ontology2.com
Fri Jul 31 19:35:24 UTC 2009


Paul Houle wrote:
>       I'm looking at my sample some more.  Here's the distribution of 
> toplevel types from the dbpedia ontology
>
> +-----------------------------------+----------+
> | type                              | count(*) |
> +-----------------------------------+----------+
> | SupremeCourtOfTheUnitedStatesCase |        3 |
> | Website                           |        4 |
> | Event                             |       21 |
> | Infrastructure                    |       47 |
> | Work                              |      525 |
> | Organisation                      |      649 |
> | Place                             |      712 |
> | Person                            |     2208 |
> | NULL                              |     6961 |
> +-----------------------------------+----------+
>
>   
    I used the new simplified dump from metaweb to do the same thing 
with freebase.  Lacking a proper schema dump,  I simply assumed that the 
toplevel type was the most prevalent type (other than /common/topic) 
that applies to a topic:

+---------------------------------------------------+----------+
| url                                               | count(*) |
+---------------------------------------------------+----------+
| /people/person                                    |     4066 |
| NULL                                              |     3756 |
| /location/location                                |     1211 |
| /business/employer                                |      827 |
| /film/film                                        |      427 |
| /projects/project_focus                           |      268 |
| /time/event                                       |       46 |
| /organization/organization                        |       46 |
| /transportation/road                              |       44 |
| /architecture/museum                              |       41 |
| /broadcast/broadcast                              |       40 |
| /music/artist                                     |       33 |
| /time/recurring_event                             |       30 |
| /music/album                                      |       27 |
| /book/written_work                                |       25 |
| /book/periodical                                  |       22 |
| /education/educational_institution                |       14 |
| /base/dance/topic                                 |       12 |
| /business/business_location                       |       11 |
| /tv/tv_program                                    |       11 |
| /sports/sports_team                               |        9 |
| /boats/ship                                       |        9 |
| /metropolitan_transit/transit_line                |        7 |
| /base/amusementparks/topic                        |        7 |
| /business/company                                 |        7 |
| /book/author                                      |        6 |
| /visual_art/artwork                               |        5 |
| /user/robert/area_codes/topic                     |        5 |
| /book/book_subject                                |        5 |
| /food/dish                                        |        4 |
| /architecture/structure                           |        4 |
| /transportation/bridge                            |        4 |
| /business/shopping_center                         |        4 |
| /sports/sports_facility                           |        3 |
| /film/film_location                               |        3 |
| /medicine/hospital                                |        3 |
| /music/genre                                      |        3 |
| /award/award                                      |        3 |
| /music/composition                                |        3 |
| /award/award_winner                               |        3 |
| /protected_sites/protected_site                   |        3 |
| /award/award_category                             |        2 |
| /government/government_agency                     |        2 |
| /tv/tv_network                                    |        2 |
| /base/disaster2/topic                             |        2 |
| /user/skud/legal/topic                            |        2 |
| /education/school                                 |        2 |
| /internet/website                                 |        2 |
| /base/dance/dance_company                         |        2 |
| /government/governmental_body                     |        2 |
| /architecture/landscape_project                   |        2 |
| /biology/organism                                 |        2 |
| /geography/body_of_water                          |        2 |
| /theater/theater_company                          |        2 |
| /book/school_or_movement                          |        2 |
| /user/skud/names/namesake                         |        2 |
| /military/armed_force                             |        1 |
| /projects/project                                 |        1 |
| /user/iubookgirl/default_domain/academic_library  |        1 |
| /geography/island                                 |        1 |
| /influence/influence_node                         |        1 |
| /base/fblinux/topic                               |        1 |
| /film/writer                                      |        1 |
| /user/rcheramy/default_domain/nickname            |        1 |
| /award/award_presenting_organization              |        1 |
| /architecture/unrealized_design                   |        1 |
| /base/americancomedy/comedy_venue                 |        1 |
| /base/collectives/topic                           |        1 |
| /games/game                                       |        1 |
| /broadcast/radio_station                          |        1 |
| /cvg/cvg_developer                                |        1 |
| /base/omgfun/festival_series                      |        1 |
| /award/award_nominee                              |        1 |
| /user/petroleumj/default_domain/subway_station    |        1 |
| /business/job_title                               |        1 |
| /user/skud/flags/topic                            |        1 |
| /visual_art/art_subject                           |        1 |
| /user/tsegaran/random/topic                       |        1 |
| /book/magazine                                    |        1 |
| /user/techgnostic/default_domain/periodical       |        1 |
| /food/brewery_brand_of_beer                       |        1 |
| /geography/bay                                    |        1 |
| /metropolitan_transit/transit_system              |        1 |
| /internet/website_owner                           |        1 |
| /visual_art/art_owner                             |        1 |
| /computer/software_developer                      |        1 |
| /fictional_universe/fictional_character_creator   |        1 |
| /venture_capital/venture_investor                 |        1 |
| /base/omgfun/topic                                |        1 |
| /award/hall_of_fame                               |        1 |
| /base/exhibitions/topic                           |        1 |
| /base/symbols/topic                               |        1 |
| /architecture/architectural_structure_owner       |        1 |
| /aviation/airliner_accident                       |        1 |
| /guid/9202a8c04000641f800000000af896ba            |        1 |
| /user/guidewire/default_domain/online_music_store |        1 |
| /library/public_library_system                    |        1 |
| /user/gogza/default_domain/recurring_event        |        1 |
| /base/americancomedy/topic                        |        1 |
+---------------------------------------------------+----------+


(Note that this is over a list of about 11k topics that I'm doing work 
on to improve the classification of before I feed it into the next stage 
of my production pipeline)

Freebase has types for about twice the number of people,  and has about 
half the number of untypeds as dbpedia.  The freebase "toplevels" I'm 
generating are completely uncontrolled so they you get some strange ones 
towards the bottom:  the "prevalance" filter has gotten rid of a large 
number of references to certain common junk types such as the "Jungle" 
type that you find all over the place in Freebase.

Note that the URL structure of "commons" types on FB tends to be

{problem_domain}/{type}

so you tend to see things like "book/author" where there is no 
inheritance relation between book and author.  You also see "/base/..." 
types and "/user/.." types which represent namespaces inside FB.

I'm going to look at the double-untyped a bit more and also merge the fb 
types into the dbpedia toplevels.




More information about the Developers mailing list