[Data-modeling] Library of Congress and Dewey Classifications

Jeff Prucher jeff at metaweb.com
Tue Mar 18 17:41:16 UTC 2008


I'd love to get some input on if/how we should be storing Dewey Decimal and
Library of Congress classifications for books in Freebase; there are
currently properties for these values on the "book edition" type. I have
several questions about this.

1) Is "book edition" the right place for this data? Would it make more sense
to put it on "book" instead?  The argument for edition seems to be that
there is some variation between classifications (especially Dewey) for
different editions of the same book. However, the more data I look at, the
more this seems to be pretty arbitrary. That is to say, it's very common for
the same edition of a book to have different classification codes at
different libraries, especially as you get further to the right in the
classification.  The high-level classification (the first three-to-six
digits in Dewey, a somewhat longer string in LC) is, however, pretty stable
(with some variation) even across editions. This suggests that maybe a
non-unique property on "book" is the better way to do this.

2) To what degree of precision should we be storing these values? Since
there is a great deal of variation at the more precise levels, would it make
sense to only capture the higher-level values?

Here are some example Dewey numbers for the 1994 Modern Library edition of
Adam Smith's "An inquiry into the nature and causes of the wealth of
nations" from different libraries:
330.153 
330.15
330.1
330
330 S642i 1994
330.153 SMI 1994
330.15 SMITH 1994
330.153 S642w, 1994
330.15 Sm51i
330.153 Sm51 1994 

The obvious thing to do would be to cut the value after the space, since the
additional data seems largely to do with disambiguating the edition from
others in that library. That would still leave us with four different
values, which in this case are at least hierarchical, but I've seen others
(can't find them right now, unfortunately) where different libraries had
values like "330" and "808".

3) Is it worth even trying to capture these values?  What advantages are
there that would make it worth trying to wrangle this rather messy dataset?

Thanks,
Jeff Prucher




More information about the Data-modeling mailing list