The built-in “Nlp Keyphrase Tagger” pipes items through a configurable SpaCy Pipeline to perform Key-Phrase Extraction and additionally Named Entity Recognition as well as Rule-Based Sentiment Analysis.
Table of Contents |
---|
General Configuration
The Pipelet is configurable within the pipeline Editor.
Input Fields
What fields should be considered for further analysis.
fields_to_consider :
Comma separated list of fields (default:title,body
)
Reduce processing time for large documents
To reduce processing time of big PDFs, consider only a subset of pages.
process_pages
:dynamic
: Chosen relative to document size (default)
Take at least10
pages, but at most√total_pages
all
: Take all pages.int
: Take firstN
pages.
Additionally, it is possible to specify a hard limit of characters to be processed at most. This helps to reduce processing time especially for large non-binary documents like HTML, or Emails (“flat-items”).
max_characters_to_process
:all
: Analyse full contentint
: Take firstN
characters, default is50000
Language Support
Per default english (en_core_web_sm) and german (de_core_news_sm) models are installed on Squirro instances.
Install additional language models, for example Japanese (see available Spacy Models)
python -m spacy download ja_core_news_sm
language_models
: Update SpaCy language model mapping (the language code is expected to be found in facetlanguage
, see Language Detection ) .
Key-Phrase Extraction
Extract highest ranked key-phrases based on the TextRank algorithm.
Key phrases are selected and ranked from a pool of recognised Noun Chunks and recognised Named Entities per item.
Configuration
tag_phrases
: Enable / Disable key-phrase taggingtag_top_k_phrases
: Amount of phrases to tagdynamic
: Total amount of phrases selected relative to document size (between 20 - 70)10
: Take N highest ranked phrases as specified
tag_topics
: Enable simple topic-tagging based on key-phrases
Enrichment
Key phrases are stored within the nlp_tag__phrases
facet.
The item’s Title
is also added to the nlp_tag__phrases
facet (as-is, without processing).
Application
Content based auto-completion (type-ahead)
Significant-terms aggregation on search results
Simple Topic Detection
With configuration tag_topics:True
, the pool of ranked key-phrases is used to extract cleaned, deduplicated phrases referred to as “topics” (stored in the nlp_tag__topics
facet).
Concept
Code Block | ||
---|---|---|
| ||
1) Cleaning Steps:
- Remove terms with specific Part-of-Speech (POS) tag, like `adjectives`, `determiners` or `punctuation`.
- Remove terms containing (almost) only number characters, like `33120x`
- De-Duplicate:
- Do not use phrases that belong to a specific Named Entity, like ["PRODUCT", "EVENT", "PERSON"] (configurable)
- Do not use phrases that have overlapping terms as already stored "topics"
2) Select 20 phrases evenly across all ranks (as determined via TextRank) |
Named Entity Recognition
(Optional)
Store recognised entities within their corresponding facet, like .
Configuration
tag_entities
: Enable entity (NER) tagging.collect_entities
: Specify NER tags to be added. (Check support on installed Label Scheme).tag_entities_per_type
: Amount of entities (per type) to be added to their corresponding facet.
Enrichment
One facet per entity, like Location = [Europe, London]
Sentiment Analysis
Applies rule based sentiment analysis (vaderSentiment) that is specifically attuned to sentiments expressed in social media or domains like NY Times editorials, movie reviews, and product reviews.
It doesn’t require any training data but is constructed from a generalizable, valence-based, human-curated gold standard sentiment lexicon.
Configuration
tag_sentiment
: Enable rule-based sentiment tagging (for english language only)
Enrichment
Overall Sentiment Label
facet:sentiment_pretrained
One sentiment label (neutral, positive, negative
) per document.Sentiment analysis is applied per sentence
Sentences with neutral sentiment are skipped
Overall Sentiment Score
facet:nlp_tag__sentiment_score
Float value within [-1,+1]Sentiment Assessment
facet:positive_terms, facet:negative_terms
A sentiment phrase consists of the valence-term and it’s context. \
Examples
Positive Product Feedback
Input
...
Output
Code Block |
---|
{
'sentiment_pretrained': ['positive'],
'positive_terms': ['truly understand', 'insight gained'],
'negative_terms': [],
'nlp_tag__phrases': ['structured data analysis', 'unstructured email content' ]
} |
→ That review showcases the combined insights gained through sentiment-assessment and key-phrase extraction.
Negative Feedback
Input
“This was not a good experience”
Output
...
This page can now be found at Discover Steps on the Squirro Docs site.