Intro
For the Squirro SmartFilters to produce accurate results, a list of term frequency of all documents in your various indexes has to be maintained.
This is called GDF or Global Document Frequency or GDFS
Squirro offers a good starting GDF set for all supported languages.
If you're indexing generic news items, then the starting GDF will yield great results.
If you are however indexing very specific content, then it is highly recommended to frequently recalculate the GFS .
Download
Squirro Version | Download |
---|---|
Squirro 2.4.3 | squirro_gdfs_util_2.4.3.zip |
Usage Example
The utility Its configured through a ini file.
It needs to run on a squirro storage node. This is where the elasticsearch process is running.
If you have multiple storage nodes, you only need to run it on one.
Here is an example:
#location of the es data folder elasticsearch_data_folder = /var/lib/elasticsearch #space seperated list of indexes, or all indexes = all #where the data will be saved target_folder = /tmp #how many files per language should be created files_per_language = 8 #should numbers and floats be removed? remove_numbers = true #terms with less than this amount of documents will be deleted from the gfds list frequency_lower_limit = 10 #languages to extract languages = en
Once this is setup, invoke the utility like so:
./create_gdfs.py
Sample Output:
2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO Starting process (version 2.4.3). 2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO Looking for shards in '/apps/squirro/elasticsearch/' 2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO Blacklisted: ['squirro_v7_fp', 'squirro_v7_filter'] 2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO Using all indexes to create gfds files 2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO Using these shard folders: 2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7/0/index 2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/2/index 2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/5/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/0/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/1/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/2/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/3/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/4/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/5/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gcabla/2/index ... 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO Using these languages: 2016-09-14 09:08:17,497 create_gdfs.py[22561] INFO -> en 2016-09-14 09:08:17,497 create_gdfs.py[22561] INFO Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7/0/index en -o /tmp/en > /dev/null 2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO Processed 42696 documents 2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO Found 1046 terms 2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/2/index en -o /tmp/en > /dev/null 2016-09-14 09:08:18,680 create_gdfs.py[22561] INFO Processed 42696 documents 2016-09-14 09:08:18,680 create_gdfs.py[22561] INFO Found 1046 terms 2016-09-14 09:08:18,681 create_gdfs.py[22561] INFO Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/5/index en -o /tmp/en > /dev/null ... 2016-09-14 09:08:39,535 create_gdfs.py[22561] INFO Processed 1769 documents 2016-09-14 09:08:39,536 create_gdfs.py[22561] INFO Found 19733 terms 2016-09-14 09:08:39,549 create_gdfs.py[22561] INFO Found 168327 unfiltered terms 2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO 8368 terms left after filtering 2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO - 155060 got removed due to too low frequenzy 2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO - 4899 got removed due to being numbers 2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO Creating 8 files per language 2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO Up to 1046 terms per file 2016-09-14 09:08:39,599 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en0.json 2016-09-14 09:08:39,601 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en1.json 2016-09-14 09:08:39,604 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en2.json 2016-09-14 09:08:39,607 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en3.json 2016-09-14 09:08:39,610 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en4.json 2016-09-14 09:08:39,612 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en5.json 2016-09-14 09:08:39,615 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en6.json 2016-09-14 09:08:39,618 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en7.json 2016-09-14 09:08:39,620 create_gdfs.py[22561] INFO All done!
Update the SmartFilter (aka Fingerprint) Service:
The final step is to update the fingerprint service on all Squirro Cluster nodes:
The default location for the files is:
/var/lib/squirro/fingerprint/gdfs
Always backup the existing files before overwriting them
Once you've updated the files, you need to restart the fingerprint service:
service sqfingerprintd restart