Process Files
The process files step converts various document formats into machine-readable text. Most flows begin with this step, but you can place the step anywhere in a flow. If the preceding step outputs .ibdoc
files, the OCR operation reverts to the original source files.
You can define processing configuration options using an existing digitization profile, connected as a module in the process files step, or you can specify processing options directly in the step.
If you’re developing a solution that included digitizing documents with non-default options, re-use the digitization profile from the Documents stage of your Solution Builder project. Reusing this profile ensures that production documents processed by your flow are optimally digitized.
Parameters and settings
Use these parameters and settings to control how the process files step operates.
Use Reader module
If you want to use a digitization profile that you developed in Solution Builder or Reader, select True and then select the digitization profile in the Reader Module field.
To use default digitization options, or to specify configuration within the process files step, select False.
Processing functions
Processing function indicates the document type for input files. To automatically process documents based on file extension, select Auto-extract. Otherwise, select the option that matches your input files.
-
PDF documents (.pdf) with machine-readable text
-
Images/Scanned documents (.pdf, .png, .jpeg, .tiff)
-
Microsoft Excel documents (.xlsx)
-
Microsoft Word documents (.docx)
-
Microsoft Rich Text Format documents (.rtf)
-
Microsoft PowerPoint documents (.pptx)
-
Email messages (.eml, .msg)
-
Web Pages (.html, .mht)
-
Text-based files (.csv, .txt)
For more details, see Supported file types and Supported settings by file type.
OCR model type
OCR model type specifies the processing engine to use for digitization. The Default (Microsoft OCR) model is suitable for most use cases, but you can select a different model depending on your circumstances.
-
Abbyy – Deprecated and scheduled for removal in 24.04. Use MSFT OCR instead.
-
Google Vision (Cloud) – Appropriate for lower-quality scanned input documents, including third- and fourth-generation documents with fuzzy images, shadows, and folds.
-
Microsoft Read OCR – Appropriate for high-accuracy extraction, and handwritten or foreign language documents. This engine returns a character-based OCR confidence score that can be used in Refiner.
-
Tesseract – Appropriate for high-quality scanned input documents.
Scripts directory
Scripts directories contain custom Python scripts intended to be run as part of the process files step. Most commonly, scripts are used to register custom image filters. Each .py file must contain a custom registration function. For detailed information, see Custom image filters.
Layout scope
Layout scope defines how layout analysis is performed on each document.
-
Per page – Performs analysis once per page. This option is appropriate for documents with relatively self-contained pages, and it works well with a variety of layouts, including mixed vertical and horizontal orientation, different page sizes, and formatting that varies per page.
-
Per document – Performs analysis on the entire document. This option can be used to extract information from tables that span multiple pages of text because the output alignment preserves layout across multiple pages.
Layout algorithm
Layout algorithm defines how text is arranged in the output document.
-
Spatial – Arranges output text from top to bottom, left to right. This option is appropriate for documents such as letters, forms, news articles, and invoices. The default, V2.0, is suitable for most standard documents. For documents with a mix of vertical and horizontal text, use V3.0.
-
By Paragraph – Arranges output text according to inferred paragraph flow. This option is best for multi-column documents.
Page range
Page range defines which pages in a multipage document to digitize, for example 1 or 2–10. Wild cards aren’t supported. If left blank, all pages are digitized.
Encryption config
If your input documents are password-protected PDFs, specify passwords in a runtime_config
JSON input object provided as a value to the passwords
key.
The runtime_config
is a JSON object of type Dict[Text, Any]
.
To provide passwords in the runtime configuration:
-
Create a dictionary with keys that specify filenames of encrypted PDFs and values that specify the corresponding passwords.
-
Provide the dictionary as a value to the key
passwords
in theruntime_config
JSON file. -
Define the
instabase
andpdf
namespaces.
For example:
{
"instabase": {
"pdf": {
"passwords": {
"input_file1.pdf": "password1",
"input_file2.pdf": "password2"
}
}
}
}
OCR config
OCR configuration specifies general processing, pre-processing, and OCR processing options for digitization.
General processing
To configure options in JSON format, select Show advanced view.
Setting | Description | JSON | Value |
---|---|---|---|
Force Image OCR | Specifies whether to treat the document as an image. Required true when using visual extraction functions. If you set force_image_ocr to true , you can’t set extract_all_pdf_layers to true . |
force_image_ocr |
false (default), true |
Write Converted Image | Specifies whether to save per-page image files to disk. | write_converted_image |
true (default), false |
Write Thumbnail | Specifies whether to generate thumbnail images, which can speed page loading while annotating documents. | write_thumbnail |
true (default), false |
Write Model Training Image | Specifies whether to save a grayscale, resized version of the original page images to disk in a new model_training_images directory. Images are saved as a version of the original image, scaled down to fit in a 1024 x 1024 box. These smaller images are more suitable for training machine learning models. |
write_model_training_image |
false (default), true |
Extract All PDF Layers | Specifies whether to extract text elements from the machine-readable PDF page and any text within the image layer of the same PDF page. Not supported with paragraph layout algorithm. Extracting all layers guarantees the highest fidelity text results, but it’s resource intensive because it runs OCR on all pages. If you set force_image_ocr to true , you can’t set extract_all_pdf_layers to true . |
extract_all_pdf_layers |
false (default), true |
Produce Metadata List | Specifies whether page layouts and metadata are set within .ibdoc files. |
produce_metadata_list |
true (default), false |
Produce Word Metadata | Specifies whether to include word confidence and position information in the .ibmsg file. The word metadata is required for confidence calculations and overlaying words onto the doc in the Review OCR app. | produce_word_metadata |
true (default), false |
Remove Space Wordpolys from IBDOC | Specifies whether to remove empty spaces and words with no text from .ibdoc files. |
remove_space_wordpolys |
true (default), false |
Cache PDF Results | Specifies whether to cache results for PDF files. | cache_pdf_results |
false (default), true |
Output Formats | Specifies the file format for output OCR text files. | output_formats |
.csv, .txt |
Repair PDFs | Specifies whether to rewrite a PDF before processing to remove possible PDF corruption. | repair_pdfs |
false (default), true |
Skip Text Extraction | Specifies whether to skip text extraction, which reduces runtime. Useful for flows that focus on entity detection. | skip_text_extraction |
false (default), true |
Force Column Width CSV to PDF | Specifies whether to dynamically size columns to prevent truncation in PDFs generated from CSV. | force_column_width_csv_to_pdf |
true (default), false |
Preprocessing
Setting | Description | JSON | Value |
---|---|---|---|
Remove Boxes | Specifies whether to attempt to remove boxes from the document before processing text, which can sometimes improve OCR. | remove_boxes |
false (default), true |
Remove Boxes over Height % | Specifies the minimum height, as a percentage of the page, of vertical lines to remove when performing box removal. | remove_boxes_over_height_percent |
float; default: 0.2 (20 percent) |
Remove Boxes over Width % | Specifies the minimum width, as a percentage of the page, of horizontal lines to remove when performing box removal. | remove_boxes_over_width_percent |
float; default: 0.2 (20 percent) |
Correct Color Inversion | Specifies whether to correct color-inverted images. | correct_inversion |
false (default), true |
Detect Blurry Files | Specifies whether to detect blurry input files and return their blur factors. | detect_blurry_files |
false (default), true |
Image Filters | Specifies any built-in or custom image filters to run and the arguments to pass in. | image_filters |
JSON dictionary of filters to run. |
OCR processing
Available OCR options are displayed based on selected model type.
Setting | Description | JSON | Value |
---|---|---|---|
OCR Timeout (secs) | Specifies duration in seconds to wait for a response from the OCR service before timing out. | ocr_timeout |
integer; default: 300 seconds |
Correct Resolution | Specifies whether to attempt to resize the image for OCR processing. Usually, this method is inferior to correct_resolution_auto . |
correct_resolution |
false (default), true |
Auto-Correct Resolution | Specifies whether to attempt to automatically change the image resolution for optimal OCR processing. Usually, this method is preferable to correct_resolution . |
correct_resolution_auto |
false (default), true |
Correct Orientation | Specifies whether to attempt to correct page rotations of 90, 180, and 270 degrees. | correct_orientation |
false (default), true |
Page Dewarp | Specifies whether to attempt to correct skew and warp in the image. | dewarp_page |
false (default), true |
Reorient words | Specifies whether to attempt to transform the coordinates of the words so that the formatted text output is correct under rotation. | reorient_words |
false (default), true |
Languages | Specifies which language models the OCR uses. This helps OCR efficiency and accuracy. | languages |
A list of language models to use when running OCR. See Supported language codes. Default: en , the English language model. |
Fonts | For the deprecated Abbyy OCR model only, specifies which text types the OCR reads. By default, the text types are automatically selected. Available fonts are [Normal , Typewriter , Matrix , Index , OCR-A , OCR-B , MICR-E13B , MICR-CMC7 , Gothic , Receipt ] |
fonts |
A comma-separated list of strings specifying the text types. |
Detect Barcodes | Specifies whether to attempt to extract barcode information from the document. | detect_barcodes |
false (default), true |
Find Lines | Specifies whether to detect lines in the documents. | find_lines |
false (default), true |
Model-Specific Settings | Specifies what model-specific setting to use, if any. See Model-specific settings for details. | model_specific_settings |
- Default: none - hq_v1 : Settings for the Abbyy OCR model (deprecated).- lq_v1 : Settings for the Google Vision (Cloud) OCR model.- marx_v1 : Settings for the Microsoft OCR model. |
Model-specific settings
Options to configure Model-Specific Settings depend on OCR model type.
To configure OCR settings for Abby, use the hq_v1
JSON object with these keys:
-
Profile Name – A named profile that corresponds to a set of flags to optimize the OCR model for a specific purpose. Supported profile names are:
-
doc_style_accurate – Includes the structure, style, and text in the document.
-
doc_style_fast – Includes the structure, style, and text in the document, optimized for speed above accuracy.
-
doc_text_accurate – Includes text embedded in logos and standard text detection. Excludes style and document structure information.
-
doc_text_fast – Includes text embedded in logos and standard text detection. Excludes style and document structure information, optimized for speed above accuracy.
-
text_only_accurate – Includes maximal text detection, including small text areas of low quality. Tables, photos, style, and document structure aren’t analyzed.
-
text_only_fast – Tables, photos, style, and document structure aren’t analyzed. Optimized for speed above accuracy.
-
To configure OCR settings for Google OCR, use the lq_v1
JSON object with these keys:
-
Feature Type – Controls the types of images the low-quality OCR model is optimized for.
-
general – Detects and extracts text from any image.
-
document – Optimized for dense text and documents.
-
To configure OCR settings for Microsoft, use the marx_v1
JSON object with these keys:
-
Version – Controls the Microsoft OCR version.
-
v2 – Uses the lite deployment.
-
v3 – Uses the max deployment.
-
For example, entering the following parameters in Model-Specific Settings configures the model to use Microsoft OCR Max:
{
"marx_v1": {
"version": "v3"
}
}
Native PDF settings
Native PDF settings let you to control the generation of native PDFs for image, PDF, or .tiff documents.
Specify PDF settings in JSON only, in the OCR configuration advanced view.
-
native_pdf
- A JSON object that specifies options for generating native PDFs.-
write_native_pdf
- Specifies whether to generate native PDFs for the given input documents. Setting tofalse
generates no PDF. -
resolution_dpi
- An integer between 72 and 300 that sets the resolution DPI of the output PDFs. Applicable only ifwrite_native_pdf
istrue
.
-
For example, in the following code sample, native PDFs are generated with a 300 DPI resolution.
"native_pdf": {
"write_native_pdf": true,
"resolution_dpi": 300
}
Filter settings
Filter settings enable you to reprocess records with a specified class using different processing settings. You can specify filter settings only if your flow includes a previous step that outputs .ibdoc
files, because page range settings take precedence over filter settings.
Specify filter settings in JSON only, in the OCR configuration advanced view.
-
filter_settings
- A JSON object that specifies whether to reprocess records with a specified class using different processing settings.-
class_type
- Specifies a list of class labels you want to reprocess. -
skip_other_classes
- Specifies whether to include records from other class labels in the output.ibdoc
or not. -
merge_pages
- Specifies whether a reprocessed record has a single record in the output.ibdoc
or multiple records based on the number of pages in that record.
-
For example, in the following code sample, records belonging to class 1040 and W-4 are reprocessed. Other records aren’t reprocessed, but because skip_other_classes
is false
, these records get an entry in the resulting .ibdoc
without any change. The reprocessed pages from a record are merged to create a single labeled record in the resulting .ibdoc
because merge_pages
is true
.
"filter_settings": {
"class_type": [
"1040", "W-4"
],
"skip_other_classes": false,
"merge_pages": true
}
Custom image filters
Use a scripts directory to register custom image filters in a Python file, for example scripts.py
.
Implementing a custom image filter
An image filter is a Python class that implements the following interface:
from PIL import Image
FilterConfigDict = TypedDict('FilterConfigDict',
params=Dict[Text, Text]
)
class ImageFilter(object):
"""
Interface for an image filter.
"""
def __init__(self, filter_config):
# type: (Dict[Text, FilterConfigDict) -> None
raise NotImplementedError(u'To be implemented.')
def execute(self, img):
# type: (Image) -> Tuple[Image, Text]
raise NotImplementedError(u'To be implemented.')
Your custom image filter is responsible for taking a pillow image, running any processing on it, and returning a tuple containing the processed image and any errors.
Passing parameters
The image filter constructor takes in a filter_config
of type FilterConfigDict
using this logic:
-
Read from the image filters config in the OCR config.
-
Overwrite what was passed in from the image filters config by any overlapping key in the runtime_config. This information is stored in the
params
part ofFilterConfigDict
.
Image filters are enabled in a flow by specifying them in the OCR config of a process files step. An image_filters
flag specifies the filters and passes additional parameters.
{
"image_filters": [
{
"filter_name": "example_filter_1",
"parameters": {
"additional_parameter_1": 1,
"additional_parameter_2": true,
"additional_parameter_3": "test"
}
},
{
"filter_name": "example_filter_2"
}
]
}
Image filters that are passed in the image_filters
list (example_filter_1
and example_filter_2
in this example) are applied in the process files step. The parameters are passed to their respective image filter objects in the param
key of the filter_config
dictionary. For example, you can access additional_parameter_1
in the example_filter_1
object through self.filter_config['params']['additional_parameter_1]
.
Registering a custom image filter
To make your custom image filter available, register it in the same Python module that defines it by creating a special register_image_filter
function.
This function returns a Python dictionary of the following form:
def register_image_filter():
# type: () -> Dict
return {
'<FILTER-NAME>': {
'class': <IMAGE-FILTER-CLASS>
}
}
Example: Background removal
This example image filter is configured to filter the background from an image.
from PIL import Image
import cv2
import numpy as np
class BackgroundRemoval(object):
def __init__(self, filter_config):
self.filter_config = filter_config
# intensities of interest
self.ioi = self.filter_config['params']['ioi']
def _execute_bg_removal(self, img_rgb):
if len(img_rgb.shape) > 2:
gray = cv2.cvtColor(img_rgb, cv2.COLOR_BGR2GRAY)
else:
gray = img_rgb
for intensity in self.ioi:
ret, thresh = cv2.threshold(gray, intensity, 255, cv2.THRESH_BINARY_INV)
bw = thresh > 0
w, h = bw.shape[::-1]
if (bw.sum() * 100)/(w * h) > 2:
break
# noise removal
kernel = np.ones((1, 1), np.uint8)
opening = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=2)
# sure background area
sure_fg = cv2.dilate(opening, kernel, iterations=3)
# Marker labelling
ret, markers = cv2.connectedComponents(sure_fg)
# Add one to all labels so that sure foreground is not 0, but 1
markers = markers + 1
# Now, mark the region of unknown with zero
markers[sure_fg == 0] = 0
markers[markers > 0] = 255
# Apply
img_rgb[markers == 0] = [255, 255, 255]
img_rgb = cv2.dilate(img_rgb, kernel, borderValue=255, iterations=2)
img_rgb = cv2.morphologyEx(img_rgb, cv2.MORPH_CLOSE, kernel, iterations=2)
return img_rgb
def execute(self, img):
rgb_img = img.convert('RGB')
cv_img = np.array(rgb_img)
cv_img = cv2.cvtColor(cv_img, cv2.COLOR_RGB2BGR)
proc_img = self._execute_bg_removal(cv_img)
ret_img = Image.fromarray(cv2.cvtColor(proc_img, cv2.COLOR_BGR2RGB))
return ret_img, None
def register_image_filter():
# type: () -> Dict
return {'background-removal': {'class': BackgroundRemoval}}