Rob's Blog Coding, cooking and, well, not much else

6Oct/098

Chunking Large Queries with Iterators in PHP

When executing large queries it's usually best not to load the whole result set in one go. Memory isn't infinite and PHP isn't renowned for handling it very well. So the obvious answer is to chunk the large query in to lots of smaller queries. This is a simple method I use for hiding the fact the query is being chunked behind an iterator.

We want something to which we can just provide a PDO object, an SQL query and the chunk size. We should then be able to iterate over the resulting object as though it were a single result set. Check out the code for ChunkedQueryIterator.

$sql    = "SELECT * FROM test LIMIT :offset, :limit";
$limit  = 1000;
$result = new ChunkedQueryIterator($pdo, $sql, $limit);
foreach ($result as $row) {
  // do stuff
}

This works well but the interface is a little clunky. What would be really nice is if we could execute chunked queries directly from our PDO object, like in the example below. We can achieve this with another common design pattern, the decorator. We only need to add the one method used above, here's the source for ChunkedQueryDecorator.

$pdo = new ChunkedQueryDecorator($pdo);
foreach ($pdo->chunkedQuery($sql, $limit) as $row) {
  // do stuff
}

How do you handle large database queries? Let me know in the comments...

15Mar/080

Playing with Carrot2 Clustering in PHP

The Carrot2 clustering engine has been on my radar for a couple of months now. It calls itself a 'search results clustering engine' which means that provided with a set of search results (titles and snippets) it will give back that same set grouped into clusters. In this post I'm going to show you how you can use Carrot2 and PHP to cluster your search results.

Carrot2 Clustering Beer

You can see a demo of Carrot2 running on their site which clusters results from a variety of popular search engines. The online demo, however, isn't what interests me about this project, there are other, better, clustering search engines out there like Clusty. What I'm interested in is the fact that Carrot2 is an open source project which can be downloaded and run locally. For integrating with PHP it is probably easiest to use the Document Clustering Server available from the download page. The DCS exposes the interface over REST or XML-RPC. Although there is an example of using RPC with PHP, for this post I'm going to integrate using the REST interface.

Setting Up The Document Clustering Server

Before I get any further I should point out that you need to have a recent version of Java, I'm using Java 1.5. The first step to getting the DCS up and running is downloading it from the download page and unpacking it. The zip file doesn't contain a top level directory so you'll need to unpack it into one. Once you have it unpacked start up the server as directed in the readme.txt and test with the index.html found in the root directory.

Carrot2 DCS

I've knocked together a quick and dirty class for interacting with the REST interface to show you how easy Carrot2 is to work with. The example below is much like the Forage faceting example. A bunch of feeds are downloaded and fed into Carrot2, then we show a summary of the first 5 clusters.

<?php
require 'Carrot2.class.php';
require 'Zend/Feed.php';

$urls = array(
  'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/business/rss.xml',
  'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/entertainment/rss.xml',
  'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/education/rss.xml'
);

$carrot = Carrot2::createDefault();

foreach ($urls as $url) {
  echo "Loading field: " . $url . "\n";
  $feed = Zend_Feed::import($url);
  foreach ($feed as $item) {
    $carrot->addDocument(
      (string)$item->link(),
      (string)$item->title(),
      (string)$item->description()
    );
  }
}
$i=0;
foreach ($carrot->clusterQuery() as $cluster) {
  if ($i++>5) break;
  echo $cluster . "\n";
}
?>
Gives:
Loading field: http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/business/rss.xml
Loading field: http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/entertainment/rss.xml
Loading field: http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/education/rss.xml
Schools (9)
Head Teachers (6)
Funds (3)
Language (2)
Gold Hits (2)
Savings Scheme (2)

It seems to work really well for these feeds. I have found that it doesn't always work so well. If you're dealing with small result sets or sets with too much diversity Carrot2 won't be able to extract clusters and you'll end up with a few clusters with only one item in and loads of documents in the '(other)' group.

Tagged as: , No Comments
24Feb/080

Forage 0.1 Released – Offline Wikipedia Example

Today the first release of Forage is out. It comes with core support for Solr, Xapian and ZSL and faceting support for Solr. Faceting support for Xapian will hopefully be arriving with the release of Xapian 1.1 in the near future. You can download Forage here. In this post I'm going to walk through the main example found in the Forage source code, offline Wikipedia.

This example is a bit more involved than the previous examples as it shows the need for using backends other than ZSL when dealing with large sets of data. To run these examples you're going to need between 15Gb and 30Gb of disk space, the uncompressed wikipedia download alone is 14Gb. You're also going to need a computer which you can leave chugging away for a few hours at a time, we're dealing with some pretty serious amounts of data here.

The Wikipedia Example

The example I'm going to walk through today is to produce an offline, vertical Wikipedia search engine for a particular field. You could do a full Wikipedia search engine but that would take quite a bit more time and space. As you may know, I'm also quite interested in food and cooking so my vertical is going to be food related, but we'll come to that later.

Step 1. Getting Wikipedia

The first step in indexing Wikipedia is to download it's entire content. Go to the Wikipedia data dump download page and have a look at what's there, the one we'll be using is enwiki-latest-pages-articles.xml.bz2. This file contains all Wikipedia article and redirect pages in an easy to parse XML format. This is a big file (the one I downloaded was 3.2Gb) so don't even try to download it through your browser, use something like wget, also make sure the tool you use has the ability to restart downloads, you don't want to get 80% through the download and then have to restart. With wget it's as easy shown below.

rob@home:~/wikipedia# wget -c http://download.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

This will take a while, even if you have a fast connection Wikipedia won't serve it up at lightning speeds, I left it over night and it was done in the morning.

Once you have the XML dump of Wikipedia you need to unpack it and prepare it. It is compressed with bzip2 so you will need to do unpack it, which again, is as easy as calling bunzip2.

rob@home:~/wikipedia# bunzip enwiki-latest-pages-articles.xml.bz2

This will take a while and hammer the CPU so go make a cup of tea and read the paper for a few minutes...

Once it has unpacked you will have one enormous XML file (about 14Gb). Unless you have configured PHP specifically it will not be able to handle files this big (if you have then you probably know about it so you can skip to the next section). The easiest solution to this problem is to just chop the file up into lots of smaller files, UNIX provides a very good tool for this, split. We need files less than 1Gb so let's split it into 900Mb chunks. The following command will give us 15-16 files named enwiki-chunks-aa, enwiki-chunks-ab, etc.

rob@home:~/wikipedia# split --bytes=900m enwiki-latest-pages-articles.xml enwiki-chunks-

Step 2. Preparing The Data

Now that we have our Wikipedia data in a form that PHP can read we can move to the Forage install directory and start playing with the example. Change directory to the examples/wikipedia directory of your Forage install. First we want to filter down the full Wikipedia dump to our vertical, this is done with the writer.php script. If you want to index Wikipedia in it's entirety you can skip straight to the Indexing section This script will parse the split Wikipedia dump extracting articles by their category and inserting those which match your patterns into a stripped down xml file. You can get a brief manual of this script with the following command.

rob@home:~/forage/examples/wikipedia# php writer.php --help

For this script to work you need to have a categories.lst file containing PCRE regular expressions, one per line. The categories.lst I'm using is in the Forage release and looks like this.

#food#i
#cook#i
#recipe#i
#meat#i
#vegetable#i
#poultry#i

So, for me to run this script with verbose output and write the output to the same path as the wikipedia dump I would use the following command. Which will give you a much more manageable XML file containing just the data we're going to index on a subset of the articles in the main dump. This will take a seriously long time, it has to parse over 4 million documents, extract their categories and match them against the set of regular expressions before deciding whether or not it should go into the vertical search. When I ran it on my laptop it took almost 9 hours!

rob@home:~/forage/examples/wikipedia# php writer.php --verbose --categories=./categories.lst --target=~/wikipedia/vertical-food.xml ~/wikipedia/enwiki-chunks*

Step 4: Indexing

At this point we have a stripped down XML document containing Wikipedia articles for our chosen vertical. Next step is to get them into an index so that we can search over them. For this example I'm going to use Solr because it is the only backend to support faceting at the moment. I will not discuss the installation and configuration of Solr beyond providing the required schema.xml. To run the indexer we need to use the provided parser which works the same way as writer.php but rather than writing to another XML file it writes to an index through Forage. The following command will index our newly created food vertical with Solr (we have to use source='mini' because writer.php uses a stripped XML format, if we were indexing straight from the raw Wikipedia dump we would use source='full').

rob@home:~/forage/examples/wikipedia# php parser.php --verbose --index='solr:127.0.0.1:8080' --source='mini' ~/wikipedia/vertical-food.xml

This is pretty damned quick actually, on my little laptop again, it managed to index almost 8000 documents just under 4 minutes, that works out at about 35 documents per second.

Step 5: Searching

This tutorial wouldn't be much if it didn't get on to searching so here we are, we have extracted almost 8000 articles from Wikipedia which we are interested in and we have indexed them, via Forage, in Solr, now it's time to search over them. Let's start off with a simple query for biscuits.

rob@home:~/forage/examples# php searcher.php --index='solr:127.0.0.1:8080' --limit=5 biscuit
Broken biscuits (2.7441695)
Bourbon biscuit (2.553938)
Empire biscuit (2.4821944)
Malted milk (biscuit) (2.3644874)
Category:United Biscuits brands (2.3402355)

Comes back in no time at all! There are a couple of other features shown in the examples, let's take a look at them now. We can sort the results by something other than the score, in the Solr schema.xml we've got a special field to help sorting by title called sort_title.

rob@home:~/forage/examples# php searcher.php --index='solr:127.0.0.1:8080' --limit=10 --sortasc=sort_title biscuit
2007 pet food recalls (0.24132447)
5-in-1 ration (0.5515988)
AFC Enterprises (0.6894985)
ANZAC biscuit (2.1496434)
Alfajor (0.8359725)

Finally we're going to take a look at the faceting example. With the Wikipedia data we're faceting on the category that appears at the bottom of all Wikipedia pages using the example script faceting.php. Let's look at a quick default example with our search term, biscuit. Note that this example doesn't output the actual document titles because it would clutter the output but they are still available in the ForageResponse object as in all the other examples.

rob@home:~/forage/examples# php faceting --index='solr:127.0.0.1:8080' biscuit
Querying for: biscuit
Total: 262
Cookies (38)
Brand name cookies (27)
British snack foods (23)
Food manufacturers of the United Kingdom (14)

Filtering on 'Cookies'
Total: 38
Christmas food (4)
Scottish cuisine (3)
Australian snack foods (2)
Brand name cookies (2)

Filtering on 'Christmas food' and 'Cookies'
Total: 4
Austrian cuisine (1)
Danish cuisine (1)
Dutch cuisine (1)
French desserts (1)

You can see here that the first search for 'biscuit' returned a total of 262 documents, and the top four facets are shown. We then take the first category and filter on that which reduces the result set to 38 documents. We then take the top category again and add it to our filter and we now only have 4 documents. A perfect example of how faceting can allow us to drill down through large number of results really quickly.

I hope you enjoyed this little walk through a real-world example of using Forage and it's features. This example is bundled with the release so, with a few minutes work and a few hours waiting for your computer you could have your own offline vertical Wikipedia search engine, and have learnt how to implement search on your site to boot!

Tagged as: , , No Comments
15Feb/080

Faceted Search With Forage

Today I added support for faceted search to Forage. Faceted search is a way of drilling down into search results by filtering on particular fields or categories. The example below is relatively simple in that it only has one facet, category, but there is no reason why you can't have multiple different 'facets'. This is, in fact, done quite often and very successfully in product searches.

And on to the example

This example is an extension of the previous example, only this time we'll be using one of the BBC's news feeds. The reason we're going to use BBC's new feeds this time is that they have an extra element in the feed, category. We're going to use this category as a facet field and then list the most popular categories in the feed.

<?php
require_once dirname(__FILE__) . '/../lib/Forage.php';
require_once 'Zend/Feed.php';

$feed   = Zend_Feed::import('http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/front_page/rss.xml');

// we're now using Solr as the engine, more about this later...
$forage = Forage::create('solr:127.0.0.1:8080');

foreach ($feed as $item) {
  $document = new ForageDocument();
  $document->add('title',       (string)$item->title())
           ->add('link',        (string)$item->link(), array('indexed'=>false))
           ->add('description', (string)$item->content(), array('stored'=>false))
           // we mark the facet field as such at index time
           ->add('category',    (string)$item->category(), array('facet'=>true));

  $forage->add($document);
}
$forage->flush();

// create an empty query and tell it that we want to
// facet on 'category' before searching
$query    = $forage->getQuery();
$query->setFacetFields(array('category'));
$response = $forage->search($query);

echo "Total Results: " . $response->getTotal() . "\n";

// get the category facet from the response
$facet = $response->getFacet('category');
$i=0;
// loop over the category values showing the top three
// along with the number of documents it apears in.
foreach ($facet->values as $value) {
  if ($i++>3) {
    break;
  }
  echo $value->value . " (" . $value->count . ")\n";
}

At the moment it's only supported by the Solr engine but Xapian will be adding faceting support in version 1.1. You can check the code out of subversion and more detailed documentation is in the wiki.

Tagged as: , , No Comments
6Feb/080

Introducing Forage – Search Abstraction for PHP

Recently I've been working on a search abstraction library for PHP called Forage. The idea is
to bring to search what we've had for relational databases for quite a while, abstraction. On Friday I put up a preview release with three
backends; Solr, Xapian and Zend Search Lucene. At the moment it has the bare minimum of features but there will be more soon. In this post
I'm going to talk a little about the motivation for the project and then walk through a short example.

So why do we need search abstraction?

The reasons for wanting an abstraction library for search are pretty much the same as for databases. Ease of integration and resilience to change.

Ease of integration

If you have one interface which provides access to multiple backends then a framework (or other application) can use this interface and then allow
the user to choose which backend to use depending on their needs and abilities. It also allows the users of the framework to scale their solutions
as they grow, this is really the second point though.

Resilience to change

If you have one interface which provides access to multiple backends then once you've implemented your solution you can change the backend if you
need to. With relational databases this is rarely done but with search, certainly in PHP at the moment, there is a bit more of a need for it. Let's
say you have a small site which does something cool. You need a search solution up and running very quickly without rocking the boat too much so
you use ZSL and it works very well. However, your site starts to get more popular (as sites which do cool things do) and it starts to creak,
you decide you need to scale up to a more capable solution such as Solr. If you're not using an abstraction layer, at this point you have to
re-implement your search module. With Forage you just need to set up your Solr server and change the DSN from 'zsl:/path/to/index' to 'solr:host:port/path'
and re-index. Job done!

Enough talk, let's play!

To show you how easy it is implementing search with Forage let's run through a little example. For this example I'm going to index some data
out of an RSS feed. I'll be using Zend_Feed from the Zend Framework and for the backend
to Forage I'm going to use Xapian. I'm just going to index all the items and then run a search over the index.

<?php
require_once 'Zend/Feed.php';
require_once 'Forage/Forage.php';

// import the feed
$feed   = Zend_Feed::import('http://rss.slashdot.org/Slashdot/slashdot');

// initialise forage
$forage = Forage::create('xapian:/var/xapian/slashdot');

// iterate over the feed items
foreach ($feed as $item) {
  // create a new document
  $document = new ForageDocument();

  // add some fields to it
  $document->add('title', (string)$item->title()) // will be both indexed and stored
           // won't be indexed but will be stored
           ->add('link', (string)$item->link(), array('indexed'=>false))
           // will be indexed but won't be stored
           ->add('description', (string)$item->content(), array('stored'=>false); 

  // add the document to the index
  $forage->add($document);
}
// flush the changes to the index
$forage->flush();

// search over the index
$results = $forage->search('yahoo microsoft');
foreach ($results as $document) {
  echo $document['title'] . "\n";
}

That's not bad is it? A feed indexing program in under 70 lines of code. If you're interested then get over to the
Forage download page and give it a whirl, and if you can, get involved.

Tagged as: , , No Comments
20Jan/080

Playing with Project Euler and PHP – Part Two

This is the second instalment of me playing with Project Euler and PHP, you can check out problem one
here. In this post I'll be using the same format as before, three different
solutions to the same problem, a simple loop oriented one, a loopless one and an over engineered, object oriented one.



Problem 2: Find the sum of all the even-valued terms in the Fibonacci sequence which do not exceed one million.

As you can see from the Problem link above, in the Fibonacci sequence, the next number is the sum of previous two numbers. This is very easy to do with
a loop, we just provide it with the first two numbers and continue with the loop until the maximum is reached, keeping track of the sum of even terms as
we go. One piece of code which may not be familiar here is the variable swapping, to avoid using a temporary variable I have chosen to use
list and array.

<?php
$target = 1000000;
$sum    = 0;
for ($one=1, $two=2; $two<$target;) {
  list($one, $two) = array($two, $one+$two);
  if ($one%2==0) {
    $sum += $one;
  }
}
echo $sum . "\n";
?>

The second solution to this problem is the loopless one. This is actually one that a lot of you will be familiar with as it's one of those
sequences which is often used as an example of recursion; exactly what I'm going to use it for here. It's essentialy the same process as in the first
solution. I have used the same variable names to make it easier to see what it's doing.

<?php
function fibonacci($one, $two, $target, $sum=0)
{
  if ($two>=$target) {
    return $sum;
  }
  if ($two%2==0) {
    $sum += $target;
  }
  return fibonacci($two, $one+$two, $target, $sum);
}
echo fibonacci(1, 2, 1000000) . "\n";
?>

The third solution is the over engineered one, for this solution I'm going to use some of the same classes and interfaces used for the first problem.
Specifically I'll be using the FactorCondition and SumAction classes and the
Action and Condition interfaces. This problem doesn't lend itself to an
OO solution as well as problem one, so I'm going to have to handle this slightly differently. The executor
is going to accept two actions, one to sum values (as in the first problem) and one to encapsulate the Fibonacci sequence; and two conditions, one
to control when to stop and one to control the Fibonacci action and two conditions, one to control the fibonacci action. Let's get down to
implementing the components. I will not show the code for previously implemented classes or interfaces.

<?php
class FibonacciAction implements Action
{
  private $one = 1;
  private $two = 2;
  public function execute($value)
  {
    list($this->one, $this->two) = array($this->two, $this->one+$this->two);
  }
  public function result()
  {
    return $this->two;
  }
}
class LimitCondition implements Condition
{
  private $limit;
  public function __construct($limit)
  {
    $this->limit = $limit;
  }
  public function test($value)
  {
    return $this->limit>$value;
  }
}
?>

Once we have all the components we need to develop the executor. This executor is a little more flexible than the one used for project one as it
seperates the stop condition from the action condition and seperates the two actions out. Hopefully this model will be more useful with later problems.

<?php
class FibonacciAction implements Action
{
  private $one = 1;
  private $two = 2;
  public function execute($value)
  {
    list($this->one, $this->two) = array($this->two, $this->one+$this->two);
  }
  public function result()
  {
    return $this->two;
  }
}
class LimitCondition extends Condition
{
  private $limit;
  public function __condition($limit)
  {
    $this->limit = $limit;
  }
  public function test($value)
  {
    return $this->limit>$value;
  }
}
?>

Finally all there is to do is tie everything together and execute it.

<?php
$executor = new Executor();
$executor->setStopper(new LimitCondition(1000000));
$executor->setCondition(new FactorCondition(2));
$executor->setAction(new SumAction());
$executor->setAlgorithm(new FibonacciAction());
echo $executor->execute() . "\n";
?>

So there we have it, three more solutions to one of the Euler Project's maths problems.

Tagged as: , No Comments
20Jan/080

Playing with Project Euler and PHP

I recently discovered Project Euler and fell in love. It is absolutely brilliant, concise, interesting and often fiendishly difficult mathematical problems which can be solved programatically. A little later I heard about Jorge Ortiz's post using the Project Euler problems to introduce functional programming techniques in Scala, which got me wondering how many different ways I could solve some of these problems with PHP.

So far I've tackled the problems in three different ways, a traditional PHPesque way, probably the most familiar to most PHP developers; a loopless way, using functional programming paradigms; and finally an absurdly over-engineered solution.

Problem 1: Add all the natural numbers below 1000 that are multiples of 3 or 5.

This is a pretty easy problem all round, in a traditional 'loopy' way it might look like this. We just loop through all the numbers adding those which are factors of 3 or five to the sum.

<?php
$sum = 0;
for ($i=1; $i<100000; $i++) {
  if ($i%3==0 || $i%5==0) {
    $sum+=$i;
  }
}
echo $sum . "\n";

Doing this without loops, although not immediately obvious is also quite easy to understand. We create a range, fitler out any elements which aren't factors of 3 or 5 and sum those which are left.

<?php
$sum = array_sum(
  array_filter(
    range(1, 99999),
    create_function('$i', 'return $i%3==0 || $i%5==0;')
  )
);
echo $sum . "\n";
[php]

<p>Here's another loopless solution using array reduce and a function which keeps track of the sum.</p>
[php]<?php
$sum = array_reduce(
  range(1, 99999),
  create_function('$sum, $i', 'return ($i%3==0 || $i%5==0) ? $sum+$i : $sum;')
);
echo $sum . "\n";

OK, now that we've done two sensible (and hopefully interesting ones) I think it's time for something a little tounge-in-cheek. Yes, it's time for the over-engineered solution. This solution is based around three interfaces, Iterator (from the SPL), Condition and Action.

<?php
interface Condition
{
  public function test($value);
}

interface Action
{
  public function apply($value);
  public function result();
}

So, first we need a basic Iterator to generate our numbers (from 1 to 100000 in increments of 1).

<?php
class LinearGenerator implements Iterator
{
  private $start, $end, $increment;
  private $current;

  public function __construct($start, $end, $increment)
  {
    $this->start     = $start;
    $this->end       = $end;
    $this->increment = $increment;
  }

  public function current()
  {
    return $this->current;
  }

  public function key()
  {
    return $this->current;
  }

  public function next()
  {
    $this->current += $this->increment;
  }

  public function rewind()
  {
    $this->current = $this->start;
  }

  public function valid()
  {
    return isset($this->current) && $this->current <= $this->end;
  }
}

Next we need to sort out our conditions. This is a little more complex as we need two different types of condition, one for testing whether a number is a multiple of of another number and one for tying together two other conditions in an 'OR'.

<?php
class FactorCondition implements Condition
{
  private $factor;
  public function __construct($factor)
  {
    $this->factor = $factor;
  }

  public function test($value)
  {
    return $value % $this->factor == 0;
  }
}

class OrCondition implements Condition
{
  private $conditions = array();
  public function add(Condition $condition)
  {
    $this->conditions[] = $condition;
  }

  public function test($value)
  {
    foreach ($this->conditions as $condition) {
      if ($condition->test($value)) {
        return true;
      }
    }
    return false;
  }
}

Almost there now, next we need the component which will keep track of the sum so far, the Action.

<?php
class SumAction implements Action
{
  private $sum = 0;
  public function apply($value)
  {
    $this->sum += $value;
  }
  public function result()
  {
    return $this->sum;
  }
}

Finally we need something to pull it all together, the Executor.

<?php
class Executor
{
  private $generator, $condition, $action;
  public function setGenerator(Iterator $generator)
  {
    $this->generator = $generator;
  }
  public function setCondition(Condition $condition)
  {
    $this->condition = $condition;
  }
  public function setAction(Action $action)
  {
    $this->action = $action;
  }
  public function execute()
  {
    foreach ($this->generator as $value) {
      if ($this->condition->test($value)) {
        $this->action->apply($value);
      }
    }
    return $this->action->result();
  }
}

Now that we've put together our enterprise-ready, future-proof, dynamic solution; let's use it.

<?php
$executor = new Executor();
$executor->setGenerator(new LinearGenerator(1, 1000, 1));
$condition = new OrCondition();
$condition->add(new FactorCondition(3));
$condition->add(new FactorCondition(5));
$executor->setCondition($condition);
$executor->setAction(new SumAction());

echo $executor->execute();

Have you got any other solutions to the problem?

Tagged as: , No Comments