Forage 0.1 Released - Offline Wikipedia Example

24th February 2008

Today the first release of Forage is out. It comes with core support for Solr, Xapian and ZSL and faceting support for Solr. Faceting support for Xapian will hopefully be arriving with the release of Xapian 1.1 in the near future. You can download Forage here. In this post I'm going to walk through the main example found in the Forage source code, offline Wikipedia.

This example is a bit more involved than the previous examples as it shows the need for using backends other than ZSL when dealing with large sets of data. To run these examples you're going to need between 15Gb and 30Gb of disk space, the uncompressed wikipedia download alone is 14Gb. You're also going to need a computer which you can leave chugging away for a few hours at a time, we're dealing with some pretty serious amounts of data here.

The Wikipedia Example

The example I'm going to walk through today is to produce an offline, vertical Wikipedia search engine for a particular field. You could do a full Wikipedia search engine but that would take quite a bit more time and space. As you may know, I'm also quite interested in food and cooking so my vertical is going to be food related, but we'll come to that later.

Step 1. Getting Wikipedia

The first step in indexing Wikipedia is to download it's entire content. Go to the Wikipedia data dump download page and have a look at what's there, the one we'll be using is enwiki-latest-pages-articles.xml.bz2. This file contains all Wikipedia article and redirect pages in an easy to parse XML format. This is a big file (the one I downloaded was 3.2Gb) so don't even try to download it through your browser, use something like wget, also make sure the tool you use has the ability to restart downloads, you don't want to get 80% through the download and then have to restart. With wget it's as easy shown below.

 
rob@home:~/wikipedia# wget http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

This will take a while, even if you have a fast connection Wikipedia won't serve it up at lightning speeds, I left it over night and it was done in the morning.

Once you have the XML dump of Wikipedia you need to unpack it and prepare it. It is compressed with bzip2 so you will need to do unpack it, which again, is as easy as calling bunzip2.

 
rob@home:~/wikipedia# bunzip enwiki-latest-pages-articles.xml.bz2

This will take a while and hammer the CPU so go make a cup of tea and read the paper for a few minutes...

Once it has unpacked you will have one enormous XML file (about 14Gb). Unless you have configured PHP specifically it will not be able to handle files this big (if you have then you probably know about it so you can skip to the next section). The easiest solution to this problem is to just chop the file up into lots of smaller files, UNIX provides a very good tool for this, split. We need files less than 1Gb so let's split it into 900Mb chunks. The following command will give us 15-16 files named enwiki-chunks-aa, enwiki-chunks-ab, etc.

 
rob@home:~/wikipedia# split --bytes=900m enwiki-latest-pages-articles.xml enwiki-chunks-

Step 2. Preparing The Data

Now that we have our Wikipedia data in a form that PHP can read we can move to the Forage install directory and start playing with the example. Change directory to the examples/wikipedia directory of your Forage install. First we want to filter down the full Wikipedia dump to our vertical, this is done with the writer.php script. If you want to index Wikipedia in it's entirety you can skip straight to the Indexing section This script will parse the split Wikipedia dump extracting articles by their category and inserting those which match your patterns into a stripped down xml file. You can get a brief manual of this script with the following command.

 
rob@home:~/forage/examples/wikipedia# php writer.php --help

For this script to work you need to have a categories.lst file containing PCRE regular expressions, one per line. The categories.lst I'm using is in the Forage release and looks like this.

 
#food#i
#cook#i
#recipe#i
#meat#i
#vegetable#i
#poultry#i

So, for me to run this script with verbose output and write the output to the same path as the wikipedia dump I would use the following command. Which will give you a much more manageable XML file containing just the data we're going to index on a subset of the articles in the main dump. This will take a seriously long time, it has to parse over 4 million documents, extract their categories and match them against the set of regular expressions before deciding whether or not it should go into the vertical search. When I ran it on my laptop it took almost 9 hours!

 
rob@home:~/forage/examples/wikipedia# php writer.php --verbose --categories=./categories.lst --target=~/wikipedia/vertical-food.xml ~/wikipedia/enwiki-chunks*

Step 4: Indexing

At this point we have a stripped down XML document containing Wikipedia articles for our chosen vertical. Next step is to get them into an index so that we can search over them. For this example I'm going to use Solr because it is the only backend to support faceting at the moment. I will not discuss the installation and configuration of Solr beyond providing the required schema.xml. To run the indexer we need to use the provided parser which works the same way as writer.php but rather than writing to another XML file it writes to an index through Forage. The following command will index our newly created food vertical with Solr (we have to use source='mini' because writer.php uses a stripped XML format, if we were indexing straight from the raw Wikipedia dump we would use source='full').

 
rob@home:~/forage/examples/wikipedia# php parser.php --verbose --index='solr:127.0.0.1:8080' --source='mini' ~/wikipedia/vertical-food.xml

This is pretty damned quick actually, on my little laptop again, it managed to index almost 8000 documents just under 4 minutes, that works out at about 35 documents per second.

Step 5: Searching

This tutorial wouldn't be much if it didn't get on to searching so here we are, we have extracted almost 8000 articles from Wikipedia which we are interested in and we have indexed them, via Forage, in Solr, now it's time to search over them. Let's start off with a simple query for biscuits.

 
rob@home:~/forage/examples# php searcher.php --index='solr:127.0.0.1:8080' --limit=5 biscuit
Broken biscuits (2.7441695)
Bourbon biscuit (2.553938)
Empire biscuit (2.4821944)
Malted milk (biscuit) (2.3644874)
Category:United Biscuits brands (2.3402355)

Comes back in no time at all! There are a couple of other features shown in the examples, let's take a look at them now. We can sort the results by something other than the score, in the Solr schema.xml we've got a special field to help sorting by title called sort_title.

 
rob@home:~/forage/examples# php searcher.php --index='solr:127.0.0.1:8080' --limit=10 --sortasc=sort_title biscuit
2007 pet food recalls (0.24132447)
5-in-1 ration (0.5515988)
AFC Enterprises (0.6894985)
ANZAC biscuit (2.1496434)
Alfajor (0.8359725)

Finally we're going to take a look at the faceting example. With the Wikipedia data we're faceting on the category that appears at the bottom of all Wikipedia pages using the example script faceting.php. Let's look at a quick default example with our search term, biscuit. Note that this example doesn't output the actual document titles because it would clutter the output but they are still available in the ForageResponse object as in all the other examples.

 
rob@home:~/forage/examples# php faceting --index='solr:127.0.0.1:8080' biscuit
Querying for: biscuit
Total: 262
Cookies (38)
Brand name cookies (27)
British snack foods (23)
Food manufacturers of the United Kingdom (14)
 
Filtering on 'Cookies'
Total: 38
Christmas food (4)
Scottish cuisine (3)
Australian snack foods (2)
Brand name cookies (2)
 
Filtering on 'Christmas food' and 'Cookies'
Total: 4
Austrian cuisine (1)
Danish cuisine (1)
Dutch cuisine (1)
French desserts (1)

You can see here that the first search for 'biscuit' returned a total of 262 documents, and the top four facets are shown. We then take the first category and filter on that which reduces the result set to 38 documents. We then take the top category again and add it to our filter and we now only have 4 documents. A perfect example of how faceting can allow us to drill down through large number of results really quickly.

I hope you enjoyed this little walk through a real-world example of using Forage and it's features. This example is bundled with the release so, with a few minutes work and a few hours waiting for your computer you could have your own offline vertical Wikipedia search engine, and have learnt how to implement search on your site to boot!

Leave a reply