Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 11 Next »

Combined time and relevant sorting is the "magic solution" to ranking results which we believe will deliver the best results, especially for projects where the most recent items are usually more important than older inforrmation

Sorting by time and relevance

The results of a given query are sorted by relevance score. Relevance score is computed based on some factors like term frequency, document frequency, where the querying terms matched in the document (body, title, summary...). The limitation of sorting purely by relevance is it ignores the important of up-to-date content such as news articles.

Using the combination of time and relevance in your project definition, Squirro will also sort the result list by using the created_at field of each item i.e. the most recent documents are put first in the result list. The limitation of of sorting purely by time field is that the most recent documents are often poor matches for the original query.

How Squirro calculates time and relevance

To combine time and relevance to the ranking, we use a score we call time_relevance which combines both time and relevance factors into the final score using the following formula:

time_relevance_score formula
time_relevance_score = relevance_score * (base + range / (range + decay * age_days^2))

Where;

  • relevance_score: original relevance score of document wrt the query
  • base: parameter use to adjust the impact of relevance. Higher this value, higher impact of relevance score (if you set this value very high, e.g 100, you can see impact of time factor is very minor, relevance score contributes mostly in final score)
  • range: range of decay, higher this value, longer decay range (if range = 0 you will see final score depends only on relevance score)
  • decay: decay speed, higher this value, quicker decay value (if decay = 0 you will see final score depends only on relevance score)
  • age_days: number of days from now since created_at time of documents.

Here are the example of the Squirro decay curve with variant age_days and base=0.05, range=30, decay=0.15 (the default values configured in Squirro).

Looking at this curve you can see the nature of time impact on result list:

  • for very recent items (1 to 3 days) the time factor value is very high and significantly boosts final score up
  • the time factor reduces quickly to indicate that older documents are less important
  • tail of curve (after 90 days) has low value and is flat, which indicates that very old documents ranked low and within this range, age of document has nearly same impact on score i.e. after 90 days they will be treated equally, independent of their age so relevance becomes the dominatining factor in ranking results,.

Beside above parameters we introduce as well low_relevance and old_period parameters to handle edge cases where very new document with very low relevance or very old document with very high relevance should be always put at the end of result list. Default value is 180 days for old_period and 0.25 for low_relevance; looking at above curve of relevance scores you can see the relevance scores are quite flat after value 0.25)

How to Configure

Squirro search results can be sorted in by time_relevance as default on a per-project basis by setting the Advanced Options.

The factors of ranking formula (base, range, decay, low_relevance, old_period) can be tuned in in common.ini, in the [ranking] section

But why use not Elasticsearch's decay function?

Expert Elasticsearch users may be familiar with it's decay function which scores a document with a function that decays depending on the distance of a numeric field value of the document from a user given origin. This function is useful for some use cases like finding hotel which close to a geo point, finding a restaurant with a cheap price. However is not easy to apply this function to Squirro results for two main reasons;

  • the decay function limits the number of parameters you can use in the search, which conflicts which Squirro facetted search
  • the decay value decays too fast and difficult to adapt to relevance score produced by Squirro

The figure below shows Squirro relevance score with respect to the rank of documents in result list for the query "interest rate" on 100'000 news from The New York Times collection. The curve indicates that there are no clear score separation between high relevance and less relevance document. Applying Elasticsearch decay function may bring less relevant (but very new) document to the top of result list.

  • No labels