Executive Summary

The data processing pipeline can be instructed to convert The content conversion enrichment converts incoming content to HTML. This is used to extract textual content from PDF, office and other binary file formats.

Enrichment name	content-conversion
Stage	content
Enabled by default	Yes

Architecture

...

Overview

The content-conversion step is used to convert incoming content to HTML. For supported document formats, incoming documents are split into individual pages, each represented as a separate HTML document.

The converted content is used to set the body attribute.

By default the processing step is enabled.

Image Removed

Image Added

Supported Content MIME Types

The following content MIME types are supported for conversion.

File Extension	Mime Type	Pages Support	Display Support
`.pdf`	`application/pdf`	Yes	Full
`.doc`	`application/msword`	No	HTML only
`.docx`	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	No	HTML only
`.xls`	`application/vnd.ms-excel`	No	HTML only
`.xlsx`	`application/vnd.openxmlformats-officedocument.spreadsheetml.sheet`	No	HTML only
`.ppt`	`application/vnd.ms-powerpoint`	No	HTML only
`.pptx`	`application/vnd.openxmlformats-officedocument.presentationml.presentation`	No	HTML only
`.rtf`	`text/rtf`	No	HTML only
`.odt`	`application/vnd.oasis.opendocument.text`	No	HTML only
`.ods`	`application/vnd.oasis.opendocument.spreadsheet`	No	HTML only
`.odp`	`application/vnd.oasis.opendocument.presentation`	No	HTML only
`.sxw`	`application/vnd.sun.xml.writer`	No	HTML only

Configuration

Examples

...

There are no configuration options for this enrichment, with the exception of the enabled property to enable and disable it.

Examples

The following examples reference all use the Python SDK.

Item Uploader

The following example details how to enable content conversion.

Code Block

language	py
linenumbers	true

from squirro_client import ItemUploader
# processing config to convert content
processing_config = {
    'content-conversion': {
        'enabled': True,
    },
}
uploader = ItemUploader(..., processing_config=config)
text_body = """
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Cras aliquet
venenatis blandit. Phasellus dapibus mi eu metus maximus, nec malesuada urna
congue. Vivamus in cursus risus. Sed neque ligula, lobortis in sollicitudin
quis, efficitur eu metus. Pellentesque eu nunc sit amet turpis bibendum
volutpat eu ac ante. Nam posuere eleifend rhoncus. Vivamus purus tellus,
interdum ac semper euismod, scelerisque ut ipsum. Phasellus ut convallis nunc,
quis finibus velit. Class aptent taciti sociosqu ad litora torquent per
conubia nostra, per inceptos himenaeos. Maecenas euismod placerat diam, at
pellentesque quam eleifend ac. Nunc quis est laoreet, hendrerit dui vel,
ornare sem. Integer volutpat ullamcorper orci quis accumsan. Proin
pellentesque vulputate pellentesque. Sed sapien ante, elementum sed lorem vel,
bibendum tristique arcu.
"""
items = [
    {
        'body': html_body,
        'title': 'Item 01',
    },
]
uploader.upload(items)

In the example above the processing pipeline is instructed to convert the textual content to HTML and use it as the item body to show how the content conversion enrichment step can be used.

Document Uploader

The following example details how to enable content conversion and upload binary documents (original example document can be found hereand use the content conversion step (which is on in this example as its enabled by default).

Code Block

language	py
linenumbers	true

from squirro_client import DocumentUploader

uploader = DocumentUploader(...…)

# upload a .pdf document which is split into individual pages, the document is
# fully displayed
uploader.upload('connected_campus_using_ews_api.pdf')

# upload a .docx document for which the HTML output will be displayed
uploader.upload('connected_campus_using_ews_api.docx')

uploader.flush()

In the example above the processing pipeline is instructed to convert the binary content to HTML and use it as the item body. The original document display and Squirro document display are depicted below.

...

The following example details how to enable disable content conversion for a new feed data source.

Code Block

language	py
linenumbers	true

from squirro_client import SquirroClient

client = SquirroClient(None, None, cluster='https://next.squirro.net/')
client.authenticate(refresh_token='293d…a13b')

# processing config to convert content
processing_config = {
    'content-conversion': {
        'enabled': TrueFalse,
    },
}
# source configuration
config = {
    'url': 'http://newsfeed.zeit.de/index',
    'processing': processing_config
}
# create new source subscription
client.new_subscription(
    project_id='...…', object_id='default', provider='feed', config=config)

...

Code Block

language	py
linenumbers	true

from squirro_client import SquirroClient

client = SquirroClient(None, None, cluster='https://next.squirro.net/')
client.authenticate(refresh_token='293d…a13b')

# getGet existing source configuration (including processing configuration)
source = client.get_project_source(project_id='...…', source_id='...…')
config = source.get('config', {})
processing_config = config.get('processing_config', {})

# modifyModify processing configuration
processing_config['content-conversion'] = {
    'enabled': True,
}
config['processing'] = processing_config
client.modify_project_source(project_id='...…', source_id='...…', config=config)

In the example above the processing pipeline is instructed to convert content for every new incoming item and use it as the item body.

Versions Compared

Old Version 3

New Version 4

Key

Executive Summary

Table of Contents

Architecture

Overview

Supported Content MIME Types

Configuration

Examples

Examples

Item Uploader

Document Uploader

Page Comparison

Versions Compared

Old Version 3

New Version 4

Key

Executive Summary

Table of Contents

Architecture

Overview

Supported Content MIME Types

Configuration

Examples

Examples

Item Uploader

Document Uploader