[Developers] WEX and Text Search

Vivek Puri vp at startupsquad.com
Thu Jun 5 09:38:57 UTC 2008


I dont think wikipedia gives text only dumps. Either they have the wiki tags
or html tags in them. So yeah, strip_tags the output from wikipedia(if you
are using php)


On Thu, Jun 5, 2008 at 3:15 AM, John Giannandrea <jg at metaweb.com> wrote:

>
> Winton Davies wrote:
> > I'm looking for the quickest way to create a full text inverted index
> > search of Wikipedia. I've not been hearing good things about MySQL
> > Text search, and don't see a really easy way to load the Page dumps.
>
> The easiest way to do this is probably write an XML parser for the
> wikipedia supplied text dumps and load it into lucene or nutch.
> In fact Wikimedia must have already done that, and since they are
> mostly open src....
>
> -jg
>
> _______________________________________________
> Developers mailing list
> Developers at freebase.com
> http://lists.freebase.com/mailman/listinfo/developers
>



-- 
Vivek Puri
GTalk: vp at vivekpuri.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.freebase.com/pipermail/developers/attachments/20080605/83191146/attachment.htm 


More information about the Developers mailing list