[Data-modeling] keeping Freebase topics and Wikipedia pages in sync; uncertainty in who is the composer
Brian Karlak
zenkat at metaweb.com
Wed Jul 22 17:51:23 UTC 2009
On Jul 22, 2009, at 7:44 AM, Iain Sproat wrote:
> And more frequent wikipedia imports to freebase can only help. Is
> it possible to use the wikipedia api to get more frequent updates -
> what are the issues involved here?
The first issue is spam/vandalism detection. These are usually found
by the wikipedia community process within a few days, so we wait at
least 3 days before creating a new topic in freebase.
The second issue the size and complexity of the data processing. We
use the wikipedia API to track every edit event to every page in
wikipedia, in near real-time. We get between 700K and 1M of these
every two weeks.
To run the data pipeline, we need to scrape the entire HTML and XML
dump for each touched article, parse them all through WEX, run the key-
based reconciliation to freebase, and the update/create the topics.
This takes a lot of time, and is most effectively done batch-wise on
hadoop. We think that we can reasonably run it every week -- more
than that and we spend more time scraping and processing churn than we
do getting new articles.
One thing we have discussed is creating a "current awareness" queue
for new articles. This would be a list of wikipedia article names
which we know are not spam which should be created ASAP. This would
be a new pipeline running in parallel with the current batch pipeline,
so it would require a bit amount of work to implement. However, it
could be useful for both "current awareness" articles as well as
syncing community contributions.
Brian
More information about the Data-modeling
mailing list