About Refiner
Note: this product was called “Refiner 5” until the 21.4.0 release.
Refiner is a multi-modal document extraction platform that lets you work with documents directly to extract text and visual information in a unified, intuitive, and extensible interface.
Extraction is a diverse activity. Documents can be long or short. Structured or unstructured. Text-heavy, table heavy, bar-code covered, and everything in between.
Most extraction tools don’t manage diversity well. The Refiner platform manages the full diversity of extraction with these capabilities:
-
Uses open file formats for describing images, documents, training data, and extraction.
-
Includes a plugin system for extraction models to interact with these file formats. Instabase ships with about 40 pre-written extraction models.
-
Establishes rules for how these models behave to preserve audit trails, explainability, and composition.
This platform approach provides a stable base for all extraction and is extensible. Many extraction plugins are provided, including Refiner functions and selected visual functions.
Supported extraction forms
Refiner supports these main forms of extraction:
-
Spatial Extraction
Extract the phrase to the right of the text“First Name:"
-
Regular Expressions
Extract the phrase captured by$Amount: $([0-9.]+)^
-
Semantic
Extract theADDRESS
below the text“Shipped To”
-
Visual
Is theCHECKBOX
associated with“Married - Filing Jointly”
checked? -
Metadata
What is theFILENAME
associated with this data? -
Structure
Generate aLIST
of every bullet point under the headingRisks
-
Plugins (UDFs)
Run my custom model and return its result.
For hands-on guidance to working with Refiner, see the text extraction guide and visual extraction guide.
When to use Refiner
Use Refiner to do additional refinement on data extracted by deep learning models, from post-processing model output to adding custom business logic.
Refiner features:
-
In-app visualization of the development dataset
-
Detailed in-app documentation of Refiner functions
-
Support for visual processing (checkboxes, signature, image crop, and so on)
-
Full UDF authoring, executing, and debugging support
Getting started
Creating Refiner programs
For Solution Builder projects, a Refiner program will be created for you through the Solution Builder UI.
To create a Refiner program outside of a Solution Builder project:
-
Open the Refiner app.
-
Click Create Classic Refiner.
-
Fill out the fields in the creation dialog.
Navigating Refiner
Layouts provide different ways to view documents, records, fields, and output in Refiner.
-
Document Shows document viewer, and field list.
-
Split Shows output table, document viewer, and field list.
-
Table Shows output table.
-
Custom You can drag to show or hide panels and save the layout.
When the document viewer is shown, you can toggle between the document image and the extracted OCR text.
Select records to run
Use the records drawer to control which records appear in the output table and the run process.
-
To open the records drawer, select the Select Records button. The Available Records drawer opens with all the available records.
-
Click a record to preview the first 3 pages of the document.
-
Select All or select one or more record checkboxes to include the selected records in the output table and the run process.
Selecting only a subset of records is useful to speed up your development because unselected records are excluded from the run process.
Refiner functions
When you start typing a Refiner function, in-product documentation is shown. To view an in-app Formula list, use Help > Formula list.
Extracting text fields
To extract a text field:
-
In the right panel, click + New Field.
- Replace the provided
field_
name with a self-describing unique field name, then press Enter or click outside of the field to apply the name change. Duplicate field names are not supported. For helper fields, prefix the field name with a double underscore (__
). The double underscore is a naming convention that prevents helper fields from being generated in the output and downstream applications.
- Replace the provided
-
In the bottom panel, define your field.
-
Leave the Field type with the default Text Field.
-
Optional: Select an Output type. The supported types are: No type, Text, Float, Integer, List, Image, Table, Dict. Defining a type lets you filter output display by the selected output type.
-
To use the Target Comparison feature, select a Target name to map the new field to a field in the targets file.
-
Optional: Add a Field description.
-
-
Optional: To enable Target Comparison, move the Run with targets slider to the right.
-
In the bottom panel, enter the Refiner formula, and then click Run Field.
-
The results show for each field. If the Target Comparison feature is enabled for the run, the mapped fields are indicated with a purple bar in the records list and in the fields pane.
-
Click Save to save adding this field to your program.
Unsaved Changes displays below the Save button if Refiner program changes are not saved.
Extracting visual fields
Extraction is supported for these visual field types:
-
Image Crop: Returns a base64-encoded image section of the document. Set the field Type to Image and Run Field to render the image in the Output tab.
-
Checkbox: Returns a boolean indicating whether the checkbox or radio button is ticked.
-
Signature: Returns a boolean indicating whether a signature is present.
To add a visual field:
Visual fields require one or more anchor fields.
-
Anchor fields are Refiner functions with a consistent location in the documents.
-
Visual extraction uses anchor fields to figure out where in a record to locate the image, checkbox, or signature you want to extract.
-
Anchor fields must exist on the same page as the visual field.
-
You must run the anchor fields before you create the corresponding visual fields.
-
In the right panel, click + New Field.
-
In the fields panel, select + New Field to add an anchor field.
-
In the bottom panel, enter the field name using a double underscore (
__
) prefix. For example,__anchor_signedcheckbox
- The double underscore is a naming convention that prevents helper fields from being generated in the output and downstream applications.
-
Click Run Field to run the anchor field.
-
In the right panel, click + New Field and change the field type to Visual Field.
-
Select the Visual Function Type type of visual field to extract from the list: Image Crop, Checkbox, or Signature.
-
Click in Anchor Fields and select the previously defined anchor field.
-
Click Run Field to run the field on all documents.
-
Click Save to save adding this field to your program.
Using text fields to process visual fields
After you extract a visual field, you can use visual Refiner functions in text fields to process the visual field. Start typing a visual Refiner function to view the in-product documentation.
-
image_crop_relative
-
image_decode_checkbox
-
image_decode_signature
Field execution
Fields in the field panel are processed in order with Run All.
Right-click a field to perform these actions:
-
Move field to the top
-
Move field up
-
Move field down
-
Move field to the bottom
-
Duplicate the field
-
Create a field above
-
Create a field below
You can run all fields or just the selected field:
-
Click Run All to run the Refiner functions on all fields.
-
Click Run Field to run only the selected field.
View options
Use the View menu to filter views.
-
Show hidden fields toggles the display in the output table. Hidden fields follow the field name prefix convention of a double underscore (
__
). -
Show annotations for selected fields only toggles the display for annotations.
Integrating a completed Refiner program
To integrate a completed program into a Flow, add the Apply Refiner step to your Flow and select the .ibrefiner
file.
Keyboard shortcuts
Refiner supports the following keyboard shortcuts to turbocharge your development process.
To view in-app keyboard shortcuts, use Help > Keyboard shortcuts.
Name | Shortcut | Description |
---|---|---|
Save | Command+S / Control+S | Save the program |
Run Current Field | Command+Enter / Control+Enter | Run the current field only |
Run (all fields) | Command+Shift+Enter or Control+Shift+Enter | Run all fields |
Formula List | Command+/ or Control+/ | Display searchable formula list |
Next Record | Down Arrow Key | Go to the next record row |
Previous Record | Up Arrow Key | Go to the previous record row |
Next Field | Right Arrow Key | Go to the next field |
Previous Field | Left Arrow Key | Go to the previous field |
Fixed structure documents
The structure of fixed structure documents is known in advance and doesn’t change. Fixed structure documents have labels associated with fields to extract.
High extraction accuracy is instantly achievable, tolerates OCR errors, and has built-in provenance tracking support.
Examples: ADP Paystub, Bank of America Bank Statement, CA Driver License. Pre-built solutions handle fixed structure documents well and are available in the Marketplace.
Action | Example | Method | Limitations |
---|---|---|---|
Extract text | Extract the phrase to the right of the text First Name: |
Spatial functions, such as scan_right , scan_below , or scan_box . Or use regular expressions |
Doesn’t support scan_above , scan_left |
Detect visual feature | Is the CHECKBOX associated with Married - Filing Jointly checked? |
Visual functions | Doesn’t support faces or logos. |
Crop image | Extract the image corresponding to signature above Authorizer |
Visual functions | Doesn’t support faces or logos. |
Extract structured information | Generate a list of every bullet point underneath the heading Risks |
Use spatial functions to extract region, then split by delimiter, such as new line or comma | Doesn’t support special view to handle structured information |
Specify different output format | Ensure dates are in mm-dd-yy format |
UDF | |
Validate | Ensure expiry_date is after today’s date |
UDF |
Variable structure documents
The structure of variable structure documents can change between documents. These documents have labels associated with the fields to extract.
Decent extraction accuracy is achievable, tolerates OCR errors, and has built-in provenance tracking support.
Examples: US W2, Bill of Lading, Invoices, Bill of Lading, and so on.
Action | Example | Method | Limitations |
---|---|---|---|
Extract text around label | Extract the phrase around the text Invoice Number: or Invoice No. |
Use regular exprerssions or more flexible spatial functions, such as scan_near or scan_box |
Is text-based, does not support spatial scanning in the document image domain |
Extract text semantically | Extract the first address on the document as shipper_address |
Use nlp_token_find methods for Instabase’s default entities such as name , global-address , date , create entities from uploaded datasets, and create your own entirely |
Existing Token Matchers are being continually improved, and new ones being added |
Crop image | Extract the image corresponding to signature above Authorizer |
Visual functions | Can’t be done in a long-tail manner, relative direction from anchor is fixed |
Extract structured information | Generate a list of every bullet point underneath the heading Risks |
First use the text-based high-variability technique above to find the right labels, then scan_below |
More to come on table extraction |
Specify different output format | Ensure dates are in mm-dd-yy format |
UDF | More native output format selection coming |
Validate | Ensure expiry_date is after today’s date |
UDF | More native validation formulas coming |
Advanced extraction
Advanced extraction includes provenance tracking and UDFs.
Provenance tracking
Provenance Tracking is the task of tracking the origin of some object to determine where some output came from within its input. Provenance tracking maps from output text coordinates to INPUT_COL
text coordinates.
Provenance Tracking is automatically set up when your Refiner programs are created from the Recipe Book or Training Projects.
Adding a UDF
Although Refiner is powerful, it might not support all of your extraction requirements, particularly if you want to integrate specific business logic. You can extend capabilities by writing UDFs, and referencing the script directory in the Settings panel. See UDFs to provide custom code in Flow and Refiner.
Troubleshooting
Tips on isolating and resolving problems with Refiner.
What to do when files don’t load?
-
Refresh the page.
-
Reselect the IBOCR/IBDOC folder with File > Open Folder to make sure the path is still valid.
-
When selecting the input folder, make sure that the folder contains valid
.ibdoc
(IBDOC) files that do not containrefined_phrases
. The input folder is typically in the projectout/s2_map_records
folder. -
Try creating a new Refiner program by right-clicking the IBOCR/IBDOC folder that you want included. Create a new Refiner program to verify the file system and the
.ibdoc
files. Make sure that the upstream resources are in the expected location.
Note: You might get a An unexpected error occurred
warning if files aren’t in the designated location.
What to do when the page goes blank?
-
Open the JavaScript console, take a screenshot, and attach to a bug report. Provide details about what you were doing and enough information to help us reproduce the problem.
-
Refresh the page.
Debug formulas and UDFs when you receive an error field in cell
You can see the error message directly in the cell, and that message can give you a clue.
Common errors:
-
Make sure your Refiner formula does not have double quotation marks (
"
), use a single quote instead ('
). -
Make sure your parentheses are well-matched.
-
Make sure your regular expressions are valid and do the right thing. You can use an external website such as regexr.com to test your expressions.
-
Make sure you’re providing the correct values for the parameters that the Refiner functions accept. Use the in-app documentation for function usage information.
-
If any UDFs are involved, make sure there are no errors.
Log messages in the UDF log
To isolate problems in UDFs, you can log messages using the logging
module:
import logging
logging.info('my message here')
Known problems
- Error messages are not precise or readable. For example, a quotation issue produces “not all text consumed” errors, which can be confusing.