Smart Filter Tutorial

 

This tutorial goes through an example for working with Smart Filters. It starts by creating a simple Smart Filter and continually expanding it.

Table of Contents

Setup

Download Example Data

Before starting, download the file news.xls to your computer.

Set up Example Project

First set up a project that can be used to test the Smart Filter logic.

  1. Log into Squirro with a user that is allowed to create projects.

  2. Create a new project and specify a project title (e.g. "News").

  3. On the "Data" tab of the new project choose "Add Data". Select the "Data Import" tab and click on "CSV/Excel" to launch the import wizard.

  4. In the Map Fields step all the fields can be left at the default mapping. Simply press "Next" again. On the final screen confirm that Squirro is about to import 179 items, then press "Import".

  5. The items will now start to show up on the "Search" tab.

Get Started

Copy & Paste Text

When creating a new Smart Filter, it's often a good idea to start with the encyclopedia definition of the concept. In this example we are going to create a Smart Filter for profit & loss information about companies. To start with, we use the introductory paragraphs of the Income statement Wikipedia article - copied below for easier reference:

Income statement (source Wikipedia)

An income statement (US English) or profit and loss account (UK English) (also referred to as a profit and loss statement (P&L), statement of profit or loss, revenue statement, statement of financial performance, earnings statement, operating statement, or statement of operations) is one of the financial statements of a company and shows the company’s revenues and expenses during a particular period. It indicates how the revenues (money received from the sale of products and services before expenses are taken out, also known as the “top line”) are transformed into the net income (the result after all revenues and expenses have been accounted for, also known as “net profit” or the “bottom line”). It displays the revenues recognized for a specific period, and the cost and expenses charged against these revenues, including write-offs (e.g., depreciation and amortization of various assets) and taxes. The purpose of the income statement is to show managers and investors whether the company made or lost money during the period being reported.

One important thing to remember about an income statement is that it represents a period of time like the cash flow statement. This contrasts with the balance sheet, which represents a single moment in time.

Charitable organizations that are required to publish financial statements do not produce an income statement. Instead, they produce a similar statement that reflects funding sources compared against program expenses, administrative costs, and other operating commitments. This statement is commonly referred to as the statement of activities. Revenues and expenses are further categorized in the statement of activities by the donor restrictions on the funds received and expended.

The income statement can be prepared in one of two methods. The Single Step income statement takes the simpler approach, totaling revenues and subtracting expenses to find the bottom line. The more complex Multi-Step income statement (as the name implies) takes several steps to find the bottom line, starting with the gross profit. It then calculates operating expenses and, when deducted from the gross profit, yields income from operations. Adding to income from operations is the difference of other revenues and other expenses. When combined with income from operations, this yields income before taxes. The final step is to deduct taxes, which finally produces the net income for the period measured.

To start with, copy this text to the clipboard and paste it directly into the Squirro search field. Because this pasted text is quite large, Squirro does not use it to execute a simple full text search, but instead creates an ad-hoc Smart Filter.

Note: the number of words that triggers the Smart Filter creation is configurable and set by default to 10.

To create this result screen, Squirro analyzed the pasted content, extracted the most relevant terms, scored those terms, then compared every document in the index to this concepts and returned the most relevant results.

Tip: You can also drag and drop a pdf file into the search box of Squirro UI to create ad-hoc Smart Filter from textual content in that pdf file with condition that file size must be smaller than 10MB.

Understanding the Tag Cloud

The tag cloud at the top shows the extracted terms for the Smart Filter. There are a number of things going on in this tag cloud:

  • The color indicates whether the terms were found in any of the matched results. Orange terms matched in at least one of the documents, whereas grey terms didn't contribute any matches.

  • The size of the term is relative to how many matches there were. The bigger a term is, the more documents contain the given term.

  • When hovering over a term, a tooltip will appear with information about how many documents matches the term.

The terms also have a weight attached to them, which drives how much priority is given to the term when matching it in documents. That weight is not visible in the tag cloud, but it can be seen in the Advanced screen in the Fingerprint tab.

Noise Level

This filter will return about 6 results that talk about the defined concept. This is based on the default noise level which is "0.1" - that means only results are returned that match very precisely.

Click on "Noise Level" to reveal the noise level slider. By changing the value there, the precision of the results returned can be changed. Change the noise level up to 0.5 and notice how that changes the number of results.

Verifying Results

To verify if the right results are returned, the default sorting method isn't very useful - results are sorted by relevance out of the box. So even if the noise level is increased, the same results will always be on top. A better way to find how the noise level affects your result quality, is to sort the results by date.

Additionally the explain mode can be enabled to understand why a result is included in the result set.

As a result, every item shows which terms led to the item being included in the search result.

Excluding

The Smart Filter may have identified terms that are not relevant to your concept. You don't have to worry about every single term, especially if they are of low weight, because the noise level helps with only returning relevant matches. But irrelevant terms can be excluded.

There are two ways of doing this:

  • In the tag cloud click on a term and in the appearing menu click Exclude.

  • Alternatively in the Advanced screen hover over a term and press Exclude.

In both cases the exclusion can be undone by clicking on the term in the list of excludes.

For this tutorial try to exclude a few of the terms and see how this affects the results.

Advanced Smart Filters

Create Empty Smart Filter

So far we've been working with an ad-hoc Smart Filter. While it would be possible to save this Smart Filter and convert it into a permanent one and give it a name, the system name would always remain anonymous (something like "smart-filter-7"). This name is visible in the query (In the form of e.g. "smartfilter:smart-filter-7:0.5"). This isn't a big problem, but when creating a Smart Filter that is to be used for search tagging later, having a descriptive name helps.

The best way to achieve this is to click the Create Smart Filter link in the left navigation.

In a dialog you are now prompted for a name for the Smart Filter. For this example call it "Profit & Loss" and confirm the dialog with Next.

An empty Smart Filter is created. Notice how the search field now contains a more readable query of the form "smartfilter:profit-loss:0.1".

You can now paste the same definition that was used earlier into the big text box and confirm by pressing Upload. Note how you could also upload documents such as PDFs or Microsoft Office documents and use those as training content.

After pressing Upload, the tag cloud appears once more and the content upload form is moved to the right side.

Additional Training

You can now add documents to the Smart Filter. The form is still available and you can copy & paste content from other definitions - or still upload documents as well. But we'll now take a different route and start training from the items that are already in the Squirro index.

Select a news story, such as "Wells Fargo Posts 14% Profit Increase". In the detail view there is a link "Add to Smart Filter". Click on that, then press "As positive definition".

The tag cloud will change and more results are displayed. This process can now be repeated with other items to fine tune the Smart Filter.

In the advanced screen you can change the number of entities that are included in the Smart Filter. This defaults to 30 terms. But if a concept is smaller or bigger than this default, you can change this setting. Look at the weight-sorted list of terms and check if there is a cut-off at some point. By reducing the number of entities, you can get rid of the low-quality terms. By increasing it, you can get a broader match.

By default only the latest 20 training documents are used to train the Smart Filter. This can be changed in the configuration with the max_contributing_items setting (see fingerprint.ini).

Make sure to Save the Smart Filter one you are relatively happy with it.

Negative Training

Next to the "As positive definition" link there is also the negative version. This allows training of Smart Filters based on negative examples that should rather be excluded. So if there is a bad result in the content, the Smart Filter can be trained to not include that kind of results.

For example, after creating the initial Smart Filter, you can train it by excluding the document "U.S. Trust Study Reveals Disconnects in Philanthropic Conversations Between HNW Individuals and Professional Advisors". The result is then as follows:

Before:

After:



Manual Smart Filter

The automated training of Smart Filters has its limits. At some point adding more training documents doesn't improve the quality of the Smart Filter, because most terms have been covered and are at the topic. At this point, the Smart Filter can be converted into a manual one.

In the Properties tab in the Advanced Smart Filter settings, press "Switch to manual training". This will convert the Smart Filter into a taxonomy. It is possible to later get back to the automated Smart Filter - but any manual changes are irrevocably lost in that case.

In the example let's replace the automatically created manual Smart Filter with this taxonomy:

"income statement",5.6 "income from",4.6 "bottom line",4.5 amortization,4.1 "annual report" "quarterly report" earning profit loss "beat estimate"~3

A lot of the irrelevant terms have been removed in this example. Instead we have added a few new ones, such as "annual report" and others. To validate the results, close the dialog and press Save. You will now see results based on this new Smart Filter definition.

The "beat estimate" example shows a common Smart Filter technique: proximity search. In this example the Smart Filter returns any items, that contain the two words "beat" and "estimate" within three words of each-other. Notice how the terms have to be stemmed manually - the reference sections contains a helper script to do that automatically. In the next iteration of the Smart Filters, the product will be extended to do that automatically.

Using the Smart Filter

The Smart Filter created above can be used in a number of ways.

First of all it is available in the user interface to be selected by any user. To use the Smart Filter, users don't need to know about all the work that went into creating the Smart Filter.

Alert

As with any search result, an alert can be created to automatically notify about new items that match the concept.

Dashboards

The Smart Filter can serve as a foundation for a dashboard. In the result view with a Smart Filter selected use the Create Dashboard in the Save dropdown. This creates a new dashboard visualizing the results for the selected Smart Filter. This can also be done manually by copying the query and using it as the query of a dashboard. The query will have the format of "smartfilter:SMARTFILTER_NAME:NOISE_LEVEL" - for example "smartfilter:profit-loss:0.9".

Search Tagging

When selecting a Smart Filter, the query changes so that the Smart Filter is included there - for example to "smartfilter:profit-loss:0.9". This same query can be used for search tagging. For example to tag all new news stories with Profit & Loss, paste the query and then assign a keyword such as "Topic:Profit & Loss".

Conclusion

In this tutorial you have gone through creating a simple ad-hoc Smart Filter, to improving the quality through training, converting it to a manual Smart Filter, all the way to using it in search tagging.

Smart Filters are a powerful technology that can be used to for concept search in projects. There are many use cases for which Smart Filters are a good solution, such as for example:

  • Tagging documents with topics that are relevant in a project. This can be used for providing different selections to users - especially when combined with search tagging.

  • By creating Smart Filters through the API you can also use it as a kind of "more like this" search. For example based on a support ticket find similar support tickets from the past.

The Smart Filter reference covers some of the topics covered here in more detail.