Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

If you're indexing generic news items, then the starting GDF will yield great results.
If you are however indexing very specific content, then it is highly recommended to frequently recalculate the GFS .

Download

Squirro VersionDownload
Squirro 2.4.3squirro_gdfs_util_2.4.3.zip


Usage Example

The utility Its configured through a ini file.
It needs to run on a squirro storage node. This is where the elasticsearch process is running.
If you have multiple storage nodes, you only need to run it on one. 

...

Code Block
2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO     Starting process (version 2.4.3).
2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO     Looking for shards in '/apps/squirro/elasticsearch/'
2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO     Blacklisted: ['squirro_v7_fp', 'squirro_v7_filter']
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     Using all indexes to create gfds files
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     Using these shard folders:
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7/0/index
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/2/index
2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/5/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/0/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/1/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/2/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/3/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/4/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/5/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gcabla/2/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_hyhrz4m6s3obirnow3zilq/0/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_hyhrz4m6s3obirnow3zilq/1/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_hyhrz4m6s3obirnow3zilq/2/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_hyhrz4m6s3obirnow3zilq/3/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_hyhrz4m6s3obirnow3zilq/4/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_hyhrz4m6s3obirnow3zilq/5/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_lxhtoeuzrzo8wk5pzuh9qg/0/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_lxhtoeuzrzo8wk5pzuh9qg/1/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_lxhtoeuzrzo8wk5pzuh9qg/2/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_lxhtoeuzrzo8wk5pzuh9qg/3/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_lxhtoeuzrzo8wk5pzuh9qg/4/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_lxhtoeuzrzo8wk5pzuh9qg/5/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_xauxoau6sju_wj35r5ywra/0/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_xauxoau6sju_wj35r5ywra/1/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_xauxoau6sju_wj35r5ywra/2/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_xauxoau6sju_wj35r5ywra/3/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_xauxoau6sju_wj35r5ywra/4/index
2016-09-14 09:08:17,496 create_gdfs.py[22561]
INFO     -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_xauxoau6sju_wj35r5ywra/5/index
2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO     Using these languages:
2016-09-14 09:08:17,497 create_gdfs.py[22561] INFO     -> en
2016-09-14 09:08:17,497 create_gdfs.py[22561] INFO     Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7/0/index en -o /tmp/en > /dev/null
2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO     Processed 42696 documents
2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO     Found 1046 terms
2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO     Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/2/index en -o /tmp/en > /dev/null
2016-09-14 09:08:18,680 create_gdfs.py[22561] INFO     Processed 42696 documents
2016-09-14 09:08:18,680 create_gdfs.py[22561] INFO     Found 1046 terms
2016-09-14 09:08:18,681 create_gdfs.py[22561] INFO     Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/5/index en -o /tmp/en > /dev/null
...
...
2016-09-14 09:08:39,535 create_gdfs.py[22561] INFO     Processed 1769 documents
2016-09-14 09:08:39,536 create_gdfs.py[22561] INFO     Found 19733 terms
2016-09-14 09:08:39,549 create_gdfs.py[22561] INFO     Found 168327 unfiltered terms
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO     8368 terms left after filtering
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO      - 155060 got removed due to too low frequenzy
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO      - 4899 got removed due to being numbers
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO     Creating 8 files per language
2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO     Up to 1046 terms per file
2016-09-14 09:08:39,599 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en0.json
2016-09-14 09:08:39,601 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en1.json
2016-09-14 09:08:39,604 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en2.json
2016-09-14 09:08:39,607 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en3.json
2016-09-14 09:08:39,610 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en4.json
2016-09-14 09:08:39,612 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en5.json
2016-09-14 09:08:39,615 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en6.json
2016-09-14 09:08:39,618 create_gdfs.py[22561] INFO     Writing 1046 terms into /tmp/en7.json
2016-09-14 09:08:39,620 create_gdfs.py[22561] INFO     All done!

...

Code Block
service sqfingerprintd restart

...