[Data-modeling] Library of Congress and Dewey Classifications
Jonathan W. Lowe
jlowe at giswebsite.com
Thu Mar 27 07:01:34 UTC 2008
Regarding:
"I'm just trying to make sure that it's the right thing to do in this
case before creating a data-monster."
...and...
"It would be helpful when considering the creation of such data monsters
to know what the boundaries really are of the freebase platform."
I'd like to extend Ben's technical question to include business
considerations.
Even with generous VC funding, Metaweb has finite resources with which
to assimilate mountains of data. Other than technical "boundaries," what
criteria do Metaweb's executives weigh before loading or rejecting any
large data collection?
Is future business value a consideration? If so, how does Metaweb
estimate and quantify such a potential return?
For instance, if some madman proposed loading all 8.7 million US census
block boundaries and their associated statistics on population, age,
race and households, what value would Metaweb weigh against the
infrastructure costs and staff time to accomplish such a data load? How
would that value be estimated and by whom?
- Jonathan
On Wed, 2008-03-26 at 17:47 -0700, Benjamin Good wrote:
> It would be helpful when considering the creation of such data
> monsters to know what the boundaries really are of the freebase
> platform. What would the consequences be if we added a million new
> topics? Might be nice to have a set of guidelines for developers
> looking to do large-scale integration work. This information would
> also help in making decisions about what level of detail is best.
> Naively, it seems that if there were no limits, the more detailed the
> classifications that were represented in freebase the better - the
> applications that used them could then decide on the desired level of
> granularity to present.
>
> Regarding the adoption of a generic classification system, I think
> this would be a really useful activity to accomplish. From a very
> casual inspection, I found 2 same, same but different models for this
> in freebase already:
>
> for the gene ontology classification
> http://www.freebase.com/view/guid/9202a8c04000641f800000000522373f
>
> for the more generic "classification code" discussed already in this
> thread
> http://www.freebase.com/view/guid/9202a8c04000641f8000000006c86fe0
>
> What do you think of basing a unified, upper classification system
> Type on the SKOS model?
> http://www.w3.org/2004/02/skos/intro
>
> -Ben
>
>
> On Mar 26, 2008, at 4:57 PM, Jeff Prucher wrote:
>
> >
> > Benjamin Good wrote:
> >> Regarding the discussion so far, creating topics for each of
> >> the classifications makes sense to me as it makes it possible
> >> to start linking the concepts represented by the terms and
> >> the codes together in meaningful ways. I think I don't
> >> really understand the alternative very well, or why it would
> >> be unfeasible to enter every DD classification as a unique
> >> topic in freebase. Could you elaborate a bit ?
> >
> > There are a few issues. One is simply that of scale -- even if we
> > decided to
> > model only the xxx.xxx level of DDC numbers, it's still probably
> > over a
> > million topics ("probably" because, while not every number is used,
> > we would
> > presumably be modeling these in a phylogeny pattern: 900 contains 910
> > contains 912 etc.). I couldn't guess what the corresponding number
> > for LoC
> > classifications is. So that's a lot of data, especially since we're
> > considering adding links down and up the tree, and across to other
> > classification systems. Making lots of links is what Freebase is
> > about, so
> > this is not necessarily a bad thing, and a generic "classification
> > system"
> > schema that could be used across disciplines is very intriguing; I'm
> > just
> > trying to make sure that it's the right thing to do in this case
> > before
> > creating a data-monster.
> >
> > Another issue is what kind of data to collect for these systems.
> > Both DDC
> > and LoC map down to very fine levels of detail; LoC seems to map very
> > specifically to individual editions of books; DDC seems a bit more
> > haphazard
> > at the edition level, depending on the coding library's needs. But
> > in both
> > cases, it seems like there's a point in the code after which the
> > numbers and
> > letters are for disambiguation and alphabetization, rather than
> > categorization. So we could (potentially) decide to only create
> > topics for
> > the more "meaningful" levels, for some possibly arbitrary meaning of
> > "meaningful". This is what I think Scott was suggesting downthread
> > -- encode
> > the top 3-digit DDC codes as topics, and then either do something
> > else to
> > capture the more detailed values, which might still be of value as
> > foreign
> > keys, say, or just omit the finer-grained classifications altogether.
> > WorldCat, for example, shows detail up to a certain point and then
> > omits the
> > rest. Part of this has implications in the data-model for books --
> > if we
> > truncate at an arbitrarily high level, the properties should be on
> > "book";
> > if we don't truncate, the properties would go on edition. If we
> > denormalize
> > and do both, then there would be properties on both book and book
> > edition.
> >
> > The other option is to treat these classification essentially as
> > foreign
> > keys and encode them only as text strings, which nobody seems to
> > much like,
> > but I mention it for completeness.
> >
> > I don't think Mesh has (based a very brief perusal of their site)
> > the second
> > problem that I mentioned for LoC and DDC, so as far as encoding the
> > Mesh
> > hierarchy, it would only be a matter of scale and schema design
> > (i.e. --
> > design for Mesh explicitly, or create a generic classification
> > schema).
> >
> > Clear as mud, right?
> >
> > Jeff P.
> >
> >
> >> best regards
> >> -Ben
> >>
> >>
> >> On Mar 24, 2008, at 4:15 PM, Jeff Prucher wrote:
> >>
> >>> Here are my current thinking on these classifications right now:
> >>>
> >>> 1) DDC and LCC properties should be non-unique, and should stay on
> >>> "book edition". (Even though Dewey doesn't uniquely map to
> >> editions,
> >>> keeping at the book edition level makes it easier to import
> >> data from
> >>> MARC and MARC-like sources, which don't collapse editions into
> >>> individual
> >>> books.)
> >>>
> >>> 2) There is some interest in having at least the high-level
> >> codes as
> >>> topics, rather than strings, but we haven't come to any agreement
> >>> about how this might work. It seems that making EVERY
> >> possible Dewey
> >>> or LoC classification into a topic is not practical, and
> >> may get into
> >>> copyright issues anyway. But only making the 1000 three
> >> digit Dewey #s
> >>> or however-many two-letter LoC numbers doesn't really help
> >> Tim K. with
> >>> his discovery of books on similar topics. However, it could be used
> >>> for Ed's proposed cross-mapping of subjects, if such a
> >> thing turns out
> >>> to be feasible.
> >>>
> >>> Further thoughts?
> >>>
> >>> Jeff
> >>>
> >>>> -----Original Message-----
> >>>> From: data-modeling-bounces at freebase.com
> >>>> [mailto:data-modeling-bounces at freebase.com] On Behalf Of Ed Laurent
> >>>> Sent: Wednesday, March 19, 2008 11:22 AM
> >>>> To: Freebase data modeling mailing list
> >>>> Subject: Re: [Data-modeling] Library of Congress and Dewey
> >>>> Classifications
> >>>>
> >>>> Understood. Maybe I was getting a little carried away.
> >>>> However, querying a library for a particular book could require
> >>>> knowledge of it's complete Dewey code and version
> >> depending on how it
> >>>> is referenced by the library.
> >>>>
> >>>> I guess my potentially useful point was that anyone's concept of
> >>>> <subject topic> may differ from someone else's. Defining and cross
> >>>> referencing similar subject ontologies (e.g., Dewey versions, land
> >>>> cover classification systems, species
> >>>> concepts) is therefore very important so that people can
> >> find <other
> >>>> books on subject> even if the book topics are linked to a subject
> >>>> only through a similar but different ontology.
> >>>> This will likely require that categories or subject topics (e.g.,
> >>>> topics of "Book subject" type) are listed as topics of a defined
> >>>> ontology type (see topics listed in Classification system
> >>>> <http://www.freebase.com/view/user/spatialed/default_domain/cl
> >>>> assification_system> ) and that the subject topics are cross
> >>>> referenced to similar subject topics of different ontologies (see
> >>>> Equals, Overlaps, Contains, and Contained by properties of
> >>>> Classification code
> >>>> <http://www.freebase.com/view/schema/user/spatialed/default_do
> >>>> main/classification_code> ). This is one way that "higher-order
> >>>> semantics" that Robert referred to in the Events thread can be
> >>>> defined.
> >>>>
> >>>> -Ed
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Mar 19, 2008 at 1:59 PM, Jeff Prucher <jeff at metaweb.com>
> >>>> wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> > -----Original Message-----
> >>>> > From: data-modeling-bounces at freebase.com
> >>>>
> >>>> > [mailto:data-modeling-bounces at freebase.com] On Behalf Of Ed
> >>>> Laurent
> >>>>
> >>>> > I'm wondering how useful the browsing option of "All books in
> >>>> > Dewey Decimal 303" would be on a day-to-day basis compared to
> >>>> > "I'm looking for The Catcher in the Rye at my local library
> >>>> > and want to know where to find it". Is finding a book in your
> >>>> > local library an appropriate use of Freebase? It's not much
> >>>> > different than asking "I'm looking for car manufacturers in
> >>>> > my city and want to know where to find them". That seems to
> >>>> > be appropriate.
> >>>>
> >>>>
> >>>> I'd say that "I'm looking for library branches in my
> >> city" is more
> >>>> analogous
> >>>> to "I'm looking for car manufacturers in my city".
> >>>> Finding a book in your
> >>>> local library is more akin to querying the current stock of an
> >>>> auto-parts
> >>>> store. Since Freebase can never hope to be as good as
> >> your local
> >>>> library's
> >>>> actual website in terms of being able to find out what
> >> books they
> >>>> have, I
> >>>> don't think it's an appropriate use. What would be appropriate,
> >>>> though, is
> >>>> to have a way either to query which libraries have a particular
> >>>> item, or to
> >>>> query a specific library for an item. We don't
> >> currently have this
> >>>> capability, but it would be very, very cool.
> >>>>
> >>>>
> >>>>
> >>>> > Freebase could be very useful for not just linking topics but
> >>>> > also for linking the ways they are categorized. As users
> >>>> > become more aware of the various ways that topics are
> >>>> > categorized and the reasons behind differences in systems
> >>>> > that define the categories (especially systems that are well
> >>>> > used and well defined) they should be able to develop more
> >>>> > and more comprehensive and mutually exclusive type properties
> >>>> > and topic attributes.
> >>>>
> >>>>
> >>>> This is a very good point, and well worth keeping in
> >> mind as we deal
> >>>> not
> >>>> only with book data, but many other types of data as well.
> >>>>
> >>>> Jeff
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Data-modeling mailing list
> >>>> Data-modeling at freebase.com
> >>>> http://lists.freebase.com/mailman/listinfo/data-modeling
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>> _______________________________________________
> >>> Data-modeling mailing list
> >>> Data-modeling at freebase.com
> >>> http://lists.freebase.com/mailman/listinfo/data-modeling
> >>
> >> _______________________________________________
> >> Data-modeling mailing list
> >> Data-modeling at freebase.com
> >> http://lists.freebase.com/mailman/listinfo/data-modeling
> >>
> >
> > _______________________________________________
> > Data-modeling mailing list
> > Data-modeling at freebase.com
> > http://lists.freebase.com/mailman/listinfo/data-modeling
>
> _______________________________________________
> Data-modeling mailing list
> Data-modeling at freebase.com
> http://lists.freebase.com/mailman/listinfo/data-modeling
>
More information about the Data-modeling
mailing list