[Data-modeling] Multi-disc release normalization prototype

Christopher R. Maden crism at metaweb.com
Thu Apr 17 05:40:04 UTC 2008


OK: 1.7 million triples later, a whole bunch of stuff is changed on sandbox.

<URL: 
http://sandbox.freebase.com/view/guid/9202a8c04000641f8000000007e4a109 > 
is a Mass Data Operation with a brief description of what happened.

I took all the releases which MusicBrainz gave a title ending in “(disc 
1)” or “(bonus disc)” or similar things and juggled them around.  See 
for example Pink Floyd’s _The Wall_ <URL: 
http://sandbox.freebase.com/view/en/the_wall > or The Beatles’ _The 
Beatles_ <URL: 
http://sandbox.freebase.com/view/guid/9202a8c04000641f8000000002f7114f >.

Take a look at your favorite multi-disc album or box set.  What you 
should see is a single album; its releases should be similarly simple. 
However, those releases should also be Multi-Part Musical Release 
instances, and should have Musical Release Components that are the 
individual discs.

Please let me know if you find any serious problems!

There are two things I already know about:

1) When MusicBrainz has slightly different titles for the discs, we did 
not reconcile them.  There are too many similarly-named albums out 
there, and I wanted to err on the conservative side.  As MusicBrainz’s 
data gets cleaner, this repeated reconciliation will pick that data up.

2) I did not change the names of discs.  In some cases, a past 
reconciliation caused disc 2 of both _The Wall_ and _The Beatles_ to 
merge with the Wikipedia article for the album.  The shorter name has 
persisted through a bunch of gardening operations.  A future version of 
this process will correct the disc names in most cases.

~Chris
-- 
Christopher R. Maden
Data Architect
Metaweb Technologies, Inc.
<URL: http://www.metaweb.com/ >


More information about the Data-modeling mailing list