[Data-modeling] keeping Freebase topics and Wikipedia pages in sync; uncertainty in who is the composer

Raymond Yee raymond.yee at gmail.com
Tue Jul 21 22:50:07 UTC 2009


I've been putting in some work on J. S. Bach base that I started a while 
ago (http://jsbach.freebase.com/) -- specifically identifying Freebase 
topics corresponding to Bach cantatas, associating them with the topic 
/music/composition , the composer /en/johann_sebastian_bach and a BWV 
(/base/jsbach/bach_composition/bwv).  To help in the reconciliation 
process,  I scraped the wikipedia page 
http://en.wikipedia.org/wiki/List_of_cantatas_by_Johann_Sebastian_Bach 
looking for cantatas that have their own Wikipedia pages, figured out 
their "curid" with which I could then identify them in Freebase. 

For example:

a) BWV 1 corresponds to 
http://en.wikipedia.org/wiki/Wie_sch%C3%B6n_leuchtet_der_Morgenstern 
,which has a curid of 1505635  (curid is discoverable in the page source 
i.e., (var wgArticleId = "1505635"); or via the wikpedia API)
b) with the curid, you can then look up 
http://www.freebase.com/view/wikipedia/en_id/1505635 -- which is the 
same as  http://www.freebase.com/view/en/wie_schon_leuchtet_der_morgenstern

For those approximately 80 cantatas with Wikipedia pages, I've now made 
the ties to Bach and the BWV field 
(http://www.freebase.com/view/base/jsbach/views/bach_composition).  Of 
course, that still leaves a lot of cantatas that don't have either 
Wikipedia pages or Freebase topics. (e.g., 
http://www.jsbach.org/bwv45.html)   Ideally (in my mind at least), there 
should be a Wikipedia page for each Bach work and a Freebase topic and a 
clear tie between them.  There is, in fact, a proposal to create at 
least Wikipedia stubs for each cantata 
(http://en.wikipedia.org/wiki/Talk:List_of_cantatas_by_Johann_Sebastian_Bach#Proposal_to_write_stubs_on_each_cantata).  


How should I proceed?  My proposed course of action is:

1) I go ahead with creating Freebase topics for the cantatas w/o any 
Freebase IDs currently

2) Start filling out the data as I can find them or as I can recruit 
help to fill them in for all the cantatas.

3) As I gather enough data to create Bach stuff articles on the 
Wikipedia, do so.

4) Wait for Freebase to discover the new Bach cantata pages and then 
flag them for merging.

Does that make sense?  (I was thinking of focusing on creating the 
Wikipedia articles first, hope that they don't get deleted, and then 
wait for Freebase to pick them up....)

In doing this upload of Bach cantata data into Freebase, I ran into the 
issue of how to deal with work misattributed to J.S. Bach.  According to 
http://en.wikipedia.org/wiki/BWV:  "The BWV catalogue is occasionally 
updated, with newly discovered works added at its end, though spurious 
works do not have their numbers removed."  An example is BWV 15: 
http://www.freebase.com/view/en/denn_du_wirst_meine_seele_nicht_in_der_holle_lassen  
-- "BWV 15, is a church cantata spuriously attributed to Johann 
Sebastian Bach but most likely composed by Johann Ludwig Bach."  What I 
did to model this is:

1) still have /base/jsbach/bach_composition/bwv = 15 for 
/en/denn_du_wirst_meine_seele_nicht_in_der_holle_lassen  -- but not the 
type /base/jsbach/bach_composition

and

2) go ahead with setting     /music/composition/composer  to   Johann 
Ludwig Bach

What do you think?

Thanks,
-Raymond



More information about the Data-modeling mailing list