[Data-modeling] Complete rethinking of the publishing schemas, or oops!

Jeff Prucher jeff at metaweb.com
Thu Mar 27 22:37:28 UTC 2008


Freebase is working on doing a large import of data for books, and in the
process people have discovered some problems with the current models for
books, book editions, and authors. (By "current models", I mean the ones
that we pushed out last week, sad to say.) Fortunately, I think that the
revision I'm proposing here is actually better, anyway. I'm just sorry I
didn't think of it earlier.

The biggest problem is the CVT I introduced between "author" and "written
work". This turns out to burn a large amount of data for compartively little
gain; it's the sort of thing that would be fine if we were only going to
have books numbering in the thousands, but the current load is probably
going to be much larger, and as Freebase expands, it will just keep growing.
So saving a little space now seems like a good option.  The new model I'm
proposing keeps the "author" and "written work" types, but links them
directly via two simple properties, rather than one property with a CVT: On
"written work", the properties are "author" and "editor"; the reverse
properties on "author" are "works written" and "works edited". This
maintains the distinction between the two roles, but without the CVT. The
other authorial types we were linking from the CVT (poet, reviewer,
playwright, etc.) refer more to the final product than the type of person
doing the writing, so I don't think they're strictly necessary. (It was a
fairly arbitrary set, anyway -- the difference between a poet and a
playwright is probably not significantly greater than that between a
novelist and journalist, say.) So in the newest model, the mode of
authorship is entirely omitted (except for editor) from the author/written
work relationship. The mode of authorship in a given instance can be
determined by the cotypes on the written work, if so desired.

To accommodate the data-load better, I've also added a new property to "book
edition", which will link it directly to the "author" type, rather than
having the only link to "author" be through the "book" type. This is a bit
of a denormalization, but allows us to accurately associate book editions to
authors without necessarily having to reconcile different editions of the
same book together. (This reconciliation is desirable -- don’t get me wrong,
but it can be very difficult.) This property essentially mimics the way
"musical artist" and "musical track" are related in Freebase -- artists are
linked directly to both albums (which contain tracks) and the tracks
themselves. It also makes the book schema more easily compatible with MARC,
and probably other bibliographic schemata, which does not reconcile separate
book editions together, and additionally will hopefully make it easier for
naïve users to input the books on their shelves directly without having to
figure out the whole book/book edition relationship. Ideally, of course, I'd
like to see all book editions reconciled to their books, but this can be
done post-hoc either by automated processes or geeky bibliophiles -- I mean,
the community. :)

A final property being added to book, which is completely incidental to the
other problems, is "contributing authors". This is another denormalization
of sorts -- in the original schema, the only way to indicate that someone
had contributed to a book was via the "contents" property that connects
"published work" and "publication". This requires that the user know the
name of the work that is collected in the book, which is not always
available, and in some cases (such as textbooks and some reference books)
not really applicable. 

I'm thinking about leaving the illustrated work/illustration
instance/illustrator CVT relationship as it is, since I'm not convinced
removing the CVT will really help matters that much, but I'd like to hear
people's thoughts on that as well.

The affected types are:
http://sandbox.freebase.com/view/schema/book/author
http://sandbox.freebase.com/view/schema/book/written_work
http://sandbox.freebase.com/view/schema/book/book_edition
http://sandbox.freebase.com/view/schema/book/book

I put in some sample data to show the new relationships:
http://sandbox.freebase.com/view/en/jonathan_lethem

Please let me know what you think.

Thanks,

Jeff Prucher
Type Librarian & Ontologist
Metaweb Technologies, Inc. 



More information about the Data-modeling mailing list