Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 4 Next »

Intro

For the Squirro SmartFilters to produce accurate results, a list of term frequency of all documents in your various indexes has to be maintained.
This is called GDF or Global Document Frequency or GDFS 

Squirro offers a good starting GDF set for all supported languages.

If you're indexing generic news items, then the starting GDF will yield great results.
If you are however indexing very specific content, then it is highly recommended to frequently recalculate the GFS .

Download

Squirro VersionDownload
Squirro 2.4.3squirro_gdfs_util_2.4.3-1.zip


Usage Example

The utility Its configured through a ini file.
It needs to run on a squirro storage node. This is where the elasticsearch process is running.
If you have multiple storage nodes, you only need to run it on one. 

Here is an example:

#location of the es data folder
elasticsearch_data_folder = /var/lib/elasticsearch
 
#space seperated list of indexes, or all
indexes = all
 
#where the data will be saved
target_folder = /tmp
 
#how many files per language should be created
files_per_language = 8
 
#should numbers and floats be removed?
remove_numbers = true
 
#terms with less than this amount of documents will be deleted from the gfds list
frequency_lower_limit = 10
 
#languages to extract
languages = en

Once this is setup, invoke the utility like so:

./create_gdfs.py

Sample Output:

2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO     Starting process (version 2.4.3).
2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO     Looking for shards in '/apps/squirro/elasticsearch/'
2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO     Blacklisted: ['squirro_v7_fp', 'squirro_v7_filter']
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     Using all indexes to create gfds files
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     Using these shard folders:
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7/0/index
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/2/index
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/5/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/0/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/1/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/2/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/3/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/4/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/5/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gcabla/2/index
...
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     Using these languages:
2016-09-14 09:08:17,497 create_gdfs.py[22561] INFO     -> en
2016-09-14 09:08:17,497 create_gdfs.py[22561] INFO     Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7/0/index en -o /tmp/en > /dev/null
2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO     Processed 42696 documents
2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO     Found 1046 terms
2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO     Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/2/index en -o /tmp/en > /dev/null
2016-09-14 09:08:18,680 create_gdfs.py[22561] INFO     Processed 42696 documents
2016-09-14 09:08:18,680 create_gdfs.py[22561] INFO     Found 1046 terms
2016-09-14 09:08:18,681 create_gdfs.py[22561] INFO     Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/5/index en -o /tmp/en > /dev/null
...
2016-09-14 09:08:39,535 create_gdfs.py[22561] INFO     Processed 1769 documents
2016-09-14 09:08:39,536 create_gdfs.py[22561] INFO     Found 19733 terms
2016-09-14 09:08:39,549 create_gdfs.py[22561] INFO     Found 168327 unfiltered terms
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO     8368 terms left after filtering
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO      - 155060 got removed due to too low frequenzy
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO      - 4899 got removed due to being numbers
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO     Creating 8 files per language
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO     Up to 1046 terms per file
2016-09-14 09:08:39,599 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en0.json
2016-09-14 09:08:39,601 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en1.json
2016-09-14 09:08:39,604 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en2.json
2016-09-14 09:08:39,607 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en3.json
2016-09-14 09:08:39,610 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en4.json
2016-09-14 09:08:39,612 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en5.json
2016-09-14 09:08:39,615 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en6.json
2016-09-14 09:08:39,618 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en7.json
2016-09-14 09:08:39,620 create_gdfs.py[22561] INFO     All done!

Update the SmartFilter (aka Fingerprint) Service:

The final step is to update the fingerprint service on all Squirro Cluster nodes:

The default location for the files is:

/var/lib/squirro/fingerprint/gdfs

Always backup the existing files before overwriting them


Once you've updated the files, you need to restart the fingerprint service:

service sqfingerprintd restart





  • No labels