[Data-modeling] Library of Congress and Dewey Classifications

Jeff Prucher jeff at metaweb.com
Wed Mar 26 23:57:04 UTC 2008


Benjamin Good wrote:
> Regarding the discussion so far, creating topics for each of 
> the classifications makes sense to me as it makes it possible 
> to start linking the concepts represented by the terms and 
> the codes together in meaningful ways.  I think I don't 
> really understand the alternative very well, or why it would 
> be unfeasible to enter every DD classification as a unique 
> topic in freebase.  Could you elaborate a bit ?

There are a few issues. One is simply that of scale -- even if we decided to
model only the xxx.xxx level of DDC numbers, it's still probably over a
million topics ("probably" because, while not every number is used, we would
presumably be modeling these in a phylogeny pattern: 900 contains 910
contains 912 etc.). I couldn't guess what the corresponding number for LoC
classifications is.  So that's a lot of data, especially since we're
considering adding links down and up the tree, and across to other
classification systems. Making lots of links is what Freebase is about, so
this is not necessarily a bad thing, and a generic "classification system"
schema that could be used across disciplines is very intriguing; I'm just
trying to make sure that it's the right thing to do in this case before
creating a data-monster.

Another issue is what kind of data to collect for these systems. Both DDC
and LoC map down to very fine levels of detail; LoC seems to map very
specifically to individual editions of books; DDC seems a bit more haphazard
at the edition level, depending on the coding library's needs. But in both
cases, it seems like there's a point in the code after which the numbers and
letters are for disambiguation and alphabetization, rather than
categorization. So we could (potentially) decide to only create topics for
the more "meaningful" levels, for some possibly arbitrary meaning of
"meaningful". This is what I think Scott was suggesting downthread -- encode
the top 3-digit DDC codes as topics, and then either do something else to
capture the more detailed values, which might still be of value as foreign
keys, say, or just omit the finer-grained classifications altogether.
WorldCat, for example, shows detail up to a certain point and then omits the
rest. Part of this has implications in the data-model for books -- if we
truncate at an arbitrarily high level, the properties should be on "book";
if we don't truncate, the properties would go on edition. If we denormalize
and do both, then there would be properties on both book and book edition.

The other option is to treat these classification essentially as foreign
keys and encode them only as text strings, which nobody seems to much like,
but I mention it for completeness.

I don't think Mesh has (based a very brief perusal of their site) the second
problem that I mentioned for LoC and DDC, so as far as encoding the Mesh
hierarchy, it would only be a matter of scale and schema design (i.e. --
design for Mesh explicitly, or create a generic classification schema).

Clear as mud, right?

Jeff P.


> best regards
> -Ben
> 
> 
> On Mar 24, 2008, at 4:15 PM, Jeff Prucher wrote:
> 
> > Here are my current thinking on these classifications right now:
> >
> > 1) DDC and LCC properties should be non-unique, and should stay on 
> > "book edition". (Even though Dewey doesn't uniquely map to 
> editions, 
> > keeping at the book edition level makes it easier to import 
> data from 
> > MARC and MARC-like sources, which don't collapse editions into 
> > individual
> > books.)
> >
> > 2) There is some interest in having at least the high-level 
> codes as 
> > topics, rather than strings, but we haven't come to any agreement 
> > about how this might work. It seems that making EVERY 
> possible Dewey 
> > or LoC classification into a topic is not practical, and 
> may get into 
> > copyright issues anyway. But only making the 1000 three 
> digit Dewey #s 
> > or however-many two-letter LoC numbers doesn't really help 
> Tim K. with 
> > his discovery of books on similar topics. However, it could be used 
> > for Ed's proposed cross-mapping of subjects, if such a 
> thing turns out 
> > to be feasible.
> >
> > Further thoughts?
> >
> > Jeff
> >
> >> -----Original Message-----
> >> From: data-modeling-bounces at freebase.com
> >> [mailto:data-modeling-bounces at freebase.com] On Behalf Of Ed Laurent
> >> Sent: Wednesday, March 19, 2008 11:22 AM
> >> To: Freebase data modeling mailing list
> >> Subject: Re: [Data-modeling] Library of Congress and Dewey 
> >> Classifications
> >>
> >> Understood. Maybe I was getting a little carried away.
> >> However, querying a library for a particular book could require 
> >> knowledge of it's complete Dewey code and version 
> depending on how it 
> >> is referenced by the library.
> >>
> >> I guess my potentially useful point was that anyone's concept of 
> >> <subject topic> may differ from someone else's. Defining and cross 
> >> referencing similar subject ontologies (e.g., Dewey versions, land 
> >> cover classification systems, species
> >> concepts) is therefore very important so that people can 
> find <other 
> >> books on subject> even if the book topics are linked to a subject 
> >> only through a similar but different ontology.
> >> This will likely require that categories or subject topics (e.g., 
> >> topics of "Book subject" type) are listed as topics of a defined 
> >> ontology type (see topics listed in Classification system 
> >> <http://www.freebase.com/view/user/spatialed/default_domain/cl
> >> assification_system> ) and that the subject topics are cross 
> >> referenced to similar subject topics of different ontologies (see 
> >> Equals, Overlaps, Contains, and Contained by properties of 
> >> Classification code 
> >> <http://www.freebase.com/view/schema/user/spatialed/default_do
> >> main/classification_code> ). This is one way that "higher-order 
> >> semantics" that Robert referred to in the Events thread can be 
> >> defined.
> >>
> >> -Ed
> >>
> >>
> >>
> >> On Wed, Mar 19, 2008 at 1:59 PM, Jeff Prucher <jeff at metaweb.com> 
> >> wrote:
> >>
> >>
> >>
> >>
> >> 	> -----Original Message-----
> >> 	> From: data-modeling-bounces at freebase.com
> >> 	
> >> 	> [mailto:data-modeling-bounces at freebase.com] On Behalf Of Ed 
> >> Laurent
> >> 	
> >> 	> I'm wondering how useful the browsing option of "All books in
> >> 	> Dewey Decimal 303" would be on a day-to-day basis compared to
> >> 	> "I'm looking for The Catcher in the Rye at my local library
> >> 	> and want to know where to find it". Is finding a book in your
> >> 	> local library an appropriate use of Freebase? It's not much
> >> 	> different than asking "I'm looking for car manufacturers in
> >> 	> my city and want to know where to find them".  That seems to
> >> 	> be appropriate.
> >> 	
> >> 	
> >> 	I'd say that "I'm looking for library branches in my 
> city" is more 
> >> analogous
> >> 	to "I'm looking for car manufacturers in my city".
> >> Finding a book in your
> >> 	local library is more akin to querying the current stock of an 
> >> auto-parts
> >> 	store. Since Freebase can never hope to be as good as 
> your local 
> >> library's
> >> 	actual website in terms of being able to find out what 
> books they 
> >> have, I
> >> 	don't think it's an appropriate use. What would be appropriate, 
> >> though, is
> >> 	to have a way either to query which libraries have a particular 
> >> item, or to
> >> 	query a specific library for an item. We don't 
> currently have this
> >> 	capability, but it would be very, very cool.
> >> 	
> >>
> >>
> >> 	> Freebase could be very useful for not just linking topics but
> >> 	> also for linking the ways they are categorized. As users
> >> 	> become more aware of the various ways that topics are
> >> 	> categorized and the reasons behind differences in systems
> >> 	> that define the categories (especially systems that are well
> >> 	> used and well defined) they should be able to develop more
> >> 	> and more comprehensive and mutually exclusive type properties
> >> 	> and topic attributes.
> >> 	
> >> 	
> >> 	This is a very good point, and well worth keeping in 
> mind as we deal 
> >> not
> >> 	only with book data, but many other types of data as well.
> >> 	
> >> 	Jeff
> >> 	
> >>
> >> 	_______________________________________________
> >> 	Data-modeling mailing list
> >> 	Data-modeling at freebase.com
> >> 	http://lists.freebase.com/mailman/listinfo/data-modeling
> >> 	
> >>
> >>
> >>
> >
> > _______________________________________________
> > Data-modeling mailing list
> > Data-modeling at freebase.com
> > http://lists.freebase.com/mailman/listinfo/data-modeling
> 
> _______________________________________________
> Data-modeling mailing list
> Data-modeling at freebase.com
> http://lists.freebase.com/mailman/listinfo/data-modeling
> 



More information about the Data-modeling mailing list