Rob's Blog Coding, cooking and, well, not much else

15Mar/080

Playing with Carrot2 Clustering in PHP

The Carrot2 clustering engine has been on my radar for a couple of months now. It calls itself a 'search results clustering engine' which means that provided with a set of search results (titles and snippets) it will give back that same set grouped into clusters. In this post I'm going to show you how you can use Carrot2 and PHP to cluster your search results.

Carrot2 Clustering Beer

You can see a demo of Carrot2 running on their site which clusters results from a variety of popular search engines. The online demo, however, isn't what interests me about this project, there are other, better, clustering search engines out there like Clusty. What I'm interested in is the fact that Carrot2 is an open source project which can be downloaded and run locally. For integrating with PHP it is probably easiest to use the Document Clustering Server available from the download page. The DCS exposes the interface over REST or XML-RPC. Although there is an example of using RPC with PHP, for this post I'm going to integrate using the REST interface.

Setting Up The Document Clustering Server

Before I get any further I should point out that you need to have a recent version of Java, I'm using Java 1.5. The first step to getting the DCS up and running is downloading it from the download page and unpacking it. The zip file doesn't contain a top level directory so you'll need to unpack it into one. Once you have it unpacked start up the server as directed in the readme.txt and test with the index.html found in the root directory.

Carrot2 DCS

I've knocked together a quick and dirty class for interacting with the REST interface to show you how easy Carrot2 is to work with. The example below is much like the Forage faceting example. A bunch of feeds are downloaded and fed into Carrot2, then we show a summary of the first 5 clusters.

<?php
require 'Carrot2.class.php';
require 'Zend/Feed.php';

$urls = array(
  'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/business/rss.xml',
  'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/entertainment/rss.xml',
  'http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/education/rss.xml'
);

$carrot = Carrot2::createDefault();

foreach ($urls as $url) {
  echo "Loading field: " . $url . "\n";
  $feed = Zend_Feed::import($url);
  foreach ($feed as $item) {
    $carrot->addDocument(
      (string)$item->link(),
      (string)$item->title(),
      (string)$item->description()
    );
  }
}
$i=0;
foreach ($carrot->clusterQuery() as $cluster) {
  if ($i++>5) break;
  echo $cluster . "\n";
}
?>
Gives:
Loading field: http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/business/rss.xml
Loading field: http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/entertainment/rss.xml
Loading field: http://newsrss.bbc.co.uk/rss/newsonline_uk_edition/education/rss.xml
Schools (9)
Head Teachers (6)
Funds (3)
Language (2)
Gold Hits (2)
Savings Scheme (2)

It seems to work really well for these feeds. I have found that it doesn't always work so well. If you're dealing with small result sets or sets with too much diversity Carrot2 won't be able to extract clusters and you'll end up with a few clusters with only one item in and loads of documents in the '(other)' group.

Tagged as: , No Comments