Rob's Blog Coding, cooking and, well, not much else

24Feb/080

Forage 0.1 Released – Offline Wikipedia Example

Today the first release of Forage is out. It comes with core support for Solr, Xapian and ZSL and faceting support for Solr. Faceting support for Xapian will hopefully be arriving with the release of Xapian 1.1 in the near future. You can download Forage here. In this post I'm going to walk through the main example found in the Forage source code, offline Wikipedia.

This example is a bit more involved than the previous examples as it shows the need for using backends other than ZSL when dealing with large sets of data. To run these examples you're going to need between 15Gb and 30Gb of disk space, the uncompressed wikipedia download alone is 14Gb. You're also going to need a computer which you can leave chugging away for a few hours at a time, we're dealing with some pretty serious amounts of data here.

The Wikipedia Example

The example I'm going to walk through today is to produce an offline, vertical Wikipedia search engine for a particular field. You could do a full Wikipedia search engine but that would take quite a bit more time and space. As you may know, I'm also quite interested in food and cooking so my vertical is going to be food related, but we'll come to that later.

Step 1. Getting Wikipedia

The first step in indexing Wikipedia is to download it's entire content. Go to the Wikipedia data dump download page and have a look at what's there, the one we'll be using is enwiki-latest-pages-articles.xml.bz2. This file contains all Wikipedia article and redirect pages in an easy to parse XML format. This is a big file (the one I downloaded was 3.2Gb) so don't even try to download it through your browser, use something like wget, also make sure the tool you use has the ability to restart downloads, you don't want to get 80% through the download and then have to restart. With wget it's as easy shown below.

rob@home:~/wikipedia# wget -c http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

This will take a while, even if you have a fast connection Wikipedia won't serve it up at lightning speeds, I left it over night and it was done in the morning.

Once you have the XML dump of Wikipedia you need to unpack it and prepare it. It is compressed with bzip2 so you will need to do unpack it, which again, is as easy as calling bunzip2.

rob@home:~/wikipedia# bunzip enwiki-latest-pages-articles.xml.bz2

This will take a while and hammer the CPU so go make a cup of tea and read the paper for a few minutes...

Once it has unpacked you will have one enormous XML file (about 14Gb). Unless you have configured PHP specifically it will not be able to handle files this big (if you have then you probably know about it so you can skip to the next section). The easiest solution to this problem is to just chop the file up into lots of smaller files, UNIX provides a very good tool for this, split. We need files less than 1Gb so let's split it into 900Mb chunks. The following command will give us 15-16 files named enwiki-chunks-aa, enwiki-chunks-ab, etc.

rob@home:~/wikipedia# split --bytes=900m enwiki-latest-pages-articles.xml enwiki-chunks-

Step 2. Preparing The Data

Now that we have our Wikipedia data in a form that PHP can read we can move to the Forage install directory and start playing with the example. Change directory to the examples/wikipedia directory of your Forage install. First we want to filter down the full Wikipedia dump to our vertical, this is done with the writer.php script. If you want to index Wikipedia in it's entirety you can skip straight to the Indexing section This script will parse the split Wikipedia dump extracting articles by their category and inserting those which match your patterns into a stripped down xml file. You can get a brief manual of this script with the following command.

rob@home:~/forage/examples/wikipedia# php writer.php --help

For this script to work you need to have a categories.lst file containing PCRE regular expressions, one per line. The categories.lst I'm using is in the Forage release and looks like this.

#food#i
#cook#i
#recipe#i
#meat#i
#vegetable#i
#poultry#i

So, for me to run this script with verbose output and write the output to the same path as the wikipedia dump I would use the following command. Which will give you a much more manageable XML file containing just the data we're going to index on a subset of the articles in the main dump. This will take a seriously long time, it has to parse over 4 million documents, extract their categories and match them against the set of regular expressions before deciding whether or not it should go into the vertical search. When I ran it on my laptop it took almost 9 hours!

rob@home:~/forage/examples/wikipedia# php writer.php --verbose --categories=./categories.lst --target=~/wikipedia/vertical-food.xml ~/wikipedia/enwiki-chunks*

Step 4: Indexing

At this point we have a stripped down XML document containing Wikipedia articles for our chosen vertical. Next step is to get them into an index so that we can search over them. For this example I'm going to use Solr because it is the only backend to support faceting at the moment. I will not discuss the installation and configuration of Solr beyond providing the required schema.xml. To run the indexer we need to use the provided parser which works the same way as writer.php but rather than writing to another XML file it writes to an index through Forage. The following command will index our newly created food vertical with Solr (we have to use source='mini' because writer.php uses a stripped XML format, if we were indexing straight from the raw Wikipedia dump we would use source='full').

rob@home:~/forage/examples/wikipedia# php parser.php --verbose --index='solr:127.0.0.1:8080' --source='mini' ~/wikipedia/vertical-food.xml

This is pretty damned quick actually, on my little laptop again, it managed to index almost 8000 documents just under 4 minutes, that works out at about 35 documents per second.

Step 5: Searching

This tutorial wouldn't be much if it didn't get on to searching so here we are, we have extracted almost 8000 articles from Wikipedia which we are interested in and we have indexed them, via Forage, in Solr, now it's time to search over them. Let's start off with a simple query for biscuits.

rob@home:~/forage/examples# php searcher.php --index='solr:127.0.0.1:8080' --limit=5 biscuit
Broken biscuits (2.7441695)
Bourbon biscuit (2.553938)
Empire biscuit (2.4821944)
Malted milk (biscuit) (2.3644874)
Category:United Biscuits brands (2.3402355)

Comes back in no time at all! There are a couple of other features shown in the examples, let's take a look at them now. We can sort the results by something other than the score, in the Solr schema.xml we've got a special field to help sorting by title called sort_title.

rob@home:~/forage/examples# php searcher.php --index='solr:127.0.0.1:8080' --limit=10 --sortasc=sort_title biscuit
2007 pet food recalls (0.24132447)
5-in-1 ration (0.5515988)
AFC Enterprises (0.6894985)
ANZAC biscuit (2.1496434)
Alfajor (0.8359725)

Finally we're going to take a look at the faceting example. With the Wikipedia data we're faceting on the category that appears at the bottom of all Wikipedia pages using the example script faceting.php. Let's look at a quick default example with our search term, biscuit. Note that this example doesn't output the actual document titles because it would clutter the output but they are still available in the ForageResponse object as in all the other examples.

rob@home:~/forage/examples# php faceting --index='solr:127.0.0.1:8080' biscuit
Querying for: biscuit
Total: 262
Cookies (38)
Brand name cookies (27)
British snack foods (23)
Food manufacturers of the United Kingdom (14)

Filtering on 'Cookies'
Total: 38
Christmas food (4)
Scottish cuisine (3)
Australian snack foods (2)
Brand name cookies (2)

Filtering on 'Christmas food' and 'Cookies'
Total: 4
Austrian cuisine (1)
Danish cuisine (1)
Dutch cuisine (1)
French desserts (1)

You can see here that the first search for 'biscuit' returned a total of 262 documents, and the top four facets are shown. We then take the first category and filter on that which reduces the result set to 38 documents. We then take the top category again and add it to our filter and we now only have 4 documents. A perfect example of how faceting can allow us to drill down through large number of results really quickly.

I hope you enjoyed this little walk through a real-world example of using Forage and it's features. This example is bundled with the release so, with a few minutes work and a few hours waiting for your computer you could have your own offline vertical Wikipedia search engine, and have learnt how to implement search on your site to boot!

Tagged as: , , No Comments
15Feb/080

Faceted Search With Forage

Today I added support for faceted search to Forage. Faceted search is a way of drilling down into search results by filtering on particular fields or categories. The example below is relatively simple in that it only has one facet, category, but there is no reason why you can't have multiple different 'facets'. This is, in fact, done quite often and very successfully in product searches.

And on to the example

This example is an extension of the previous example, only this time we'll be using one of the BBC's news feeds. The reason we're going to use BBC's new feeds this time is that they have an extra element in the feed, category. We're going to use this category as a facet field and then list the most popular categories in the feed.

<?php
require_once dirname(__FILE__) . '/../lib/Forage.php';
require_once 'Zend/Feed.php';

$feed   = Zend_Feed::import('http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml');

// we're now using Solr as the engine, more about this later...
$forage = Forage::create('solr:127.0.0.1:8080');

foreach ($feed as $item) {
  $document = new ForageDocument();
  $document->add('title',       (string)$item->title())
           ->add('link',        (string)$item->link(), array('indexed'=>false))
           ->add('description', (string)$item->content(), array('stored'=>false))
           // we mark the facet field as such at index time
           ->add('category',    (string)$item->category(), array('facet'=>true));

  $forage->add($document);
}
$forage->flush();

// create an empty query and tell it that we want to
// facet on 'category' before searching
$query    = $forage->getQuery();
$query->setFacetFields(array('category'));
$response = $forage->search($query);

echo "Total Results: " . $response->getTotal() . "\n";

// get the category facet from the response
$facet = $response->getFacet('category');
$i=0;
// loop over the category values showing the top three
// along with the number of documents it apears in.
foreach ($facet->values as $value) {
  if ($i++>3) {
    break;
  }
  echo $value->value . " (" . $value->count . ")\n";
}

At the moment it's only supported by the Solr engine but Xapian will be adding faceting support in version 1.1. You can check the code out of subversion and more detailed documentation is in the wiki.

Tagged as: , , No Comments
6Feb/080

Introducing Forage – Search Abstraction for PHP

Recently I've been working on a search abstraction library for PHP called Forage. The idea is
to bring to search what we've had for relational databases for quite a while, abstraction. On Friday I put up a preview release with three
backends; Solr, Xapian and Zend Search Lucene. At the moment it has the bare minimum of features but there will be more soon. In this post
I'm going to talk a little about the motivation for the project and then walk through a short example.

So why do we need search abstraction?

The reasons for wanting an abstraction library for search are pretty much the same as for databases. Ease of integration and resilience to change.

Ease of integration

If you have one interface which provides access to multiple backends then a framework (or other application) can use this interface and then allow
the user to choose which backend to use depending on their needs and abilities. It also allows the users of the framework to scale their solutions
as they grow, this is really the second point though.

Resilience to change

If you have one interface which provides access to multiple backends then once you've implemented your solution you can change the backend if you
need to. With relational databases this is rarely done but with search, certainly in PHP at the moment, there is a bit more of a need for it. Let's
say you have a small site which does something cool. You need a search solution up and running very quickly without rocking the boat too much so
you use ZSL and it works very well. However, your site starts to get more popular (as sites which do cool things do) and it starts to creak,
you decide you need to scale up to a more capable solution such as Solr. If you're not using an abstraction layer, at this point you have to
re-implement your search module. With Forage you just need to set up your Solr server and change the DSN from 'zsl:/path/to/index' to 'solr:host:port/path'
and re-index. Job done!

Enough talk, let's play!

To show you how easy it is implementing search with Forage let's run through a little example. For this example I'm going to index some data
out of an RSS feed. I'll be using Zend_Feed from the Zend Framework and for the backend
to Forage I'm going to use Xapian. I'm just going to index all the items and then run a search over the index.

<?php
require_once 'Zend/Feed.php';
require_once 'Forage/Forage.php';

// import the feed
$feed   = Zend_Feed::import('http://rss.slashdot.org/Slashdot/slashdot');

// initialise forage
$forage = Forage::create('xapian:/var/xapian/slashdot');

// iterate over the feed items
foreach ($feed as $item) {
  // create a new document
  $document = new ForageDocument();

  // add some fields to it
  $document->add('title', (string)$item->title()) // will be both indexed and stored
           // won't be indexed but will be stored
           ->add('link', (string)$item->link(), array('indexed'=>false))
           // will be indexed but won't be stored
           ->add('description', (string)$item->content(), array('stored'=>false); 

  // add the document to the index
  $forage->add($document);
}
// flush the changes to the index
$forage->flush();

// search over the index
$results = $forage->search('yahoo microsoft');
foreach ($results as $document) {
  echo $document['title'] . "\n";
}

That's not bad is it? A feed indexing program in under 70 lines of code. If you're interested then get over to the
Forage download page and give it a whirl, and if you can, get involved.

Tagged as: , , No Comments