Extending Lucene's Scoring
21st November 2008
Lucene's tf-idf scoring algorithm is fast and effective and is undeniably one of the features that has made Lucene the most popular text search library around today. Not only does it provide really effective text ranking but it also allows us to provide boosts to different parts of the process. We can boost documents, fields and even query components. This is great when we know that particular documents or fields are more important than others at index time, with premium results or a title field for example. And boosting query components can be even more powerful. However, sometimes we need even more.
Sometimes we need to affect a document's score with an external variable. Take blog or news search for example, we want the documents to be scored by relevancy, obviously, however we would also like a document's age have an effect. It isn't really possible to achieve this with just boosts, so what can we do?
As of Lucene 2.2 there's been a little documented package that can be really helpful here. org.apache.lucene.search.function allows us to build queries that affect the score of a document in ways that we define. By the end of this post we're going to have a simple score modifier that takes a document's age into account.
Jiles van Gurp wrote a great post a few months back about using FieldScoreQuery and CustomScoreQuery to bring a documents age into play. FieldScoreQuery is used to interpret a field as a float and use it to derive a score. We can create a simple 'agescore' field of the form "0.yyyyMMddhhmm" and use the FieldScoreQuery to give newer documents a higher score.
TermQuery termQuery = new TermQuery(new Term("title", "foo")); FieldScoreQuery scoreQuery = new FieldScoreQuery("agescore", FieldScoreQuery.Type.FLOAT); Query query = new CustomScoreQuery(termQuery, scoreQuery);
In the example above we combine a simple TermQuery with our FieldScoreQuery using the CustomScoreQuery.
Although this model works well it's not perfect. The difference between a document that's a week old and one that's two weeks old should not be the same as the difference between a document that's a year old and one that's a year and a week old. We could improve on this model by basing the derived score on the log of the document's age. To achieve this we need to dive a little deeper into the function package. FieldScoreQuery is based on the class ValueSourceQuery, this class gets values for documents using a ValueSource. In the example below we create a custom ValueSource similar to the ones used by FieldScoreQuery, however, rather than just returning the value we return a weight based on the reciprocal of the log of the documents age.
public class AgeFieldSource extends FieldCacheSource { private int now; public AgeFieldSource(String field) { super(field); now = (int)(System.currentTimeMillis() / 1000); } @Override public boolean cachedFieldSourceEquals(FieldCacheSource other) { return other.getClass() == MyFieldSource.class; } @Override public int cachedFieldSourceHashCode() { return Integer.class.hashCode(); } @Override public DocValues getCachedFieldValues(FieldCache cache, String field, IndexReader reader) throws IOException { int[] times = cache.getInts(reader, field); float[] weights = new float[times.length]; for (int i=0; i<times.length; i++) { // Here be the nuts and bolts weights[i] = new Double(1/Math.log((now - times[i]) / 3600)).floatValue(); } final float[] arr = weights; return new DocValues() { public float floatVal(int doc) { return (float) arr[doc]; } public int intVal(int doc) { return (int) arr[doc]; } public String toString(int doc) { return description() + '=' + intVal(doc); } }; } }
And the example below shows how we can use this class.
TermQuery termQuery = new TermQuery(new Term("title", "foo")); ValueSourceQuery valueQuery = new ValueSourceQuery(new AgeFieldSource("created")); CustomScoreQuery query = new CustomScoreQuery(termQuery, valueQuery);
This model seems to work pretty well but your mileage may vary. I'd love to head other peoples experiences with this package and thoughts on this solution in the comments.

One comment so far
Leave a reply