[Data-modeling] keeping Freebase topics and Wikipedia pages in sync; uncertainty in who is the composer
Raymond Yee
raymond.yee at gmail.com
Tue Jul 21 22:50:07 UTC 2009
I've been putting in some work on J. S. Bach base that I started a while
ago (http://jsbach.freebase.com/) -- specifically identifying Freebase
topics corresponding to Bach cantatas, associating them with the topic
/music/composition , the composer /en/johann_sebastian_bach and a BWV
(/base/jsbach/bach_composition/bwv). To help in the reconciliation
process, I scraped the wikipedia page
http://en.wikipedia.org/wiki/List_of_cantatas_by_Johann_Sebastian_Bach
looking for cantatas that have their own Wikipedia pages, figured out
their "curid" with which I could then identify them in Freebase.
For example:
a) BWV 1 corresponds to
http://en.wikipedia.org/wiki/Wie_sch%C3%B6n_leuchtet_der_Morgenstern
,which has a curid of 1505635 (curid is discoverable in the page source
i.e., (var wgArticleId = "1505635"); or via the wikpedia API)
b) with the curid, you can then look up
http://www.freebase.com/view/wikipedia/en_id/1505635 -- which is the
same as http://www.freebase.com/view/en/wie_schon_leuchtet_der_morgenstern
For those approximately 80 cantatas with Wikipedia pages, I've now made
the ties to Bach and the BWV field
(http://www.freebase.com/view/base/jsbach/views/bach_composition). Of
course, that still leaves a lot of cantatas that don't have either
Wikipedia pages or Freebase topics. (e.g.,
http://www.jsbach.org/bwv45.html) Ideally (in my mind at least), there
should be a Wikipedia page for each Bach work and a Freebase topic and a
clear tie between them. There is, in fact, a proposal to create at
least Wikipedia stubs for each cantata
(http://en.wikipedia.org/wiki/Talk:List_of_cantatas_by_Johann_Sebastian_Bach#Proposal_to_write_stubs_on_each_cantata).
How should I proceed? My proposed course of action is:
1) I go ahead with creating Freebase topics for the cantatas w/o any
Freebase IDs currently
2) Start filling out the data as I can find them or as I can recruit
help to fill them in for all the cantatas.
3) As I gather enough data to create Bach stuff articles on the
Wikipedia, do so.
4) Wait for Freebase to discover the new Bach cantata pages and then
flag them for merging.
Does that make sense? (I was thinking of focusing on creating the
Wikipedia articles first, hope that they don't get deleted, and then
wait for Freebase to pick them up....)
In doing this upload of Bach cantata data into Freebase, I ran into the
issue of how to deal with work misattributed to J.S. Bach. According to
http://en.wikipedia.org/wiki/BWV: "The BWV catalogue is occasionally
updated, with newly discovered works added at its end, though spurious
works do not have their numbers removed." An example is BWV 15:
http://www.freebase.com/view/en/denn_du_wirst_meine_seele_nicht_in_der_holle_lassen
-- "BWV 15, is a church cantata spuriously attributed to Johann
Sebastian Bach but most likely composed by Johann Ludwig Bach." What I
did to model this is:
1) still have /base/jsbach/bach_composition/bwv = 15 for
/en/denn_du_wirst_meine_seele_nicht_in_der_holle_lassen -- but not the
type /base/jsbach/bach_composition
and
2) go ahead with setting /music/composition/composer to Johann
Ludwig Bach
What do you think?
Thanks,
-Raymond
More information about the Data-modeling
mailing list