...
If you're indexing generic news items, then the starting GDF will yield great results.
If you are however indexing very specific content, then it is highly recommended to frequently recalculate the GFS .
Download
Squirro Version | Download |
---|---|
Squirro 2.4.3 | squirro_gdfs_util_2.4.3.zip |
Usage Example
The utility Its configured through a ini file.
It needs to run on a squirro storage node. This is where the elasticsearch process is running.
If you have multiple storage nodes, you only need to run it on one.
...
Code Block |
---|
2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO Starting process (version 2.4.3). 2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO Looking for shards in '/apps/squirro/elasticsearch/' 2016-09-14 09:08:17,476 create_gdfs.py[22561] INFO Blacklisted: ['squirro_v7_fp', 'squirro_v7_filter'] 2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO Using all indexes to create gfds files 2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO Using these shard folders: 2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7/0/index 2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/2/index 2016-09-14 09:08:17,495 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/5/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/0/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/1/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/2/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/3/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/4/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gca/5/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_df-wuwgpqu-cnbxope4gcabla/2/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_hyhrz4m6s3obirnow3zilq/0/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_hyhrz4m6s3obirnow3zilq/1/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_hyhrz4m6s3obirnow3zilq/2/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_hyhrz4m6s3obirnow3zilq/3/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_hyhrz4m6s3obirnow3zilq/4/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_hyhrz4m6s3obirnow3zilq/5/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_lxhtoeuzrzo8wk5pzuh9qg/0/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_lxhtoeuzrzo8wk5pzuh9qg/1/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_lxhtoeuzrzo8wk5pzuh9qg/2/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_lxhtoeuzrzo8wk5pzuh9qg/3/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_lxhtoeuzrzo8wk5pzuh9qg/4/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_lxhtoeuzrzo8wk5pzuh9qg/5/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_xauxoau6sju_wj35r5ywra/0/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_xauxoau6sju_wj35r5ywra/1/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_xauxoau6sju_wj35r5ywra/2/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_xauxoau6sju_wj35r5ywra/3/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_xauxoau6sju_wj35r5ywra/4/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO -> /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_xauxoau6sju_wj35r5ywra/5/index 2016-09-14 09:08:17,496 create_gdfs.py[22561] INFO Using these languages: 2016-09-14 09:08:17,497 create_gdfs.py[22561] INFO -> en 2016-09-14 09:08:17,497 create_gdfs.py[22561] INFO Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7/0/index en -o /tmp/en > /dev/null 2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO Processed 42696 documents 2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO Found 1046 terms 2016-09-14 09:08:18,042 create_gdfs.py[22561] INFO Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/2/index en -o /tmp/en > /dev/null 2016-09-14 09:08:18,680 create_gdfs.py[22561] INFO Processed 42696 documents 2016-09-14 09:08:18,680 create_gdfs.py[22561] INFO Found 1046 terms 2016-09-14 09:08:18,681 create_gdfs.py[22561] INFO Executing: java -jar global_dfs.jar /apps/squirro/elasticsearch/data/squirro-prod2/nodes/0/indices/squirro_v7_6c3apqtdt1eg64djsbi3nw/5/index en -o /tmp/en > /dev/null ... ... 2016-09-14 09:08:39,535 create_gdfs.py[22561] INFO Processed 1769 documents 2016-09-14 09:08:39,536 create_gdfs.py[22561] INFO Found 19733 terms 2016-09-14 09:08:39,549 create_gdfs.py[22561] INFO Found 168327 unfiltered terms 2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO 8368 terms left after filtering 2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO - 155060 got removed due to too low frequenzy 2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO - 4899 got removed due to being numbers 2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO Creating 8 files per language 2016-09-14 09:08:39,597 create_gdfs.py[22561] INFO Up to 1046 terms per file 2016-09-14 09:08:39,599 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en0.json 2016-09-14 09:08:39,601 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en1.json 2016-09-14 09:08:39,604 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en2.json 2016-09-14 09:08:39,607 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en3.json 2016-09-14 09:08:39,610 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en4.json 2016-09-14 09:08:39,612 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en5.json 2016-09-14 09:08:39,615 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en6.json 2016-09-14 09:08:39,618 create_gdfs.py[22561] INFO Writing 1046 terms into /tmp/en7.json 2016-09-14 09:08:39,620 create_gdfs.py[22561] INFO All done! |
...
Code Block |
---|
service sqfingerprintd restart |
...