Libraries for UDFs
Use Instabase libraries and objects to read files and manipulate data in UDFs.
Reading the Flow output
Use the load_from_str
function from the ParsedIBOCRBuilder class to read the intermediary Flow output:
from instabase.ocr.client.libs import ibocr
def custom_function_fn(content, input_filepath, clients,
root_output_folder, *args, **kwargs):
builder, err = ibocr.ParsedIBOCRBuilder.load_from_str(input_filepath, content)
if err:
raise IOError(u'Could not load file: {}'.format(input_filepath))
The ParsedIBOCRBuilder provides access to the underlying records using these interfaces:
def custom_function_fn(content, input_filepath, clients,
root_output_folder, *args, **kwargs):
builder, err = ibocr.ParsedIBOCRBuilder.load_from_str(input_filepath, content)
if err:
raise IOError(u'Could not load file: {}'.format(input_filepath)
for ibocr_record in builder.get_ibocr_records():
text = ibocr_record.get_text()
lines = ibocr_record.get_lines()
refined_phrases = ibocr_record.get_refined_phrases()
ExternalFunction(text, lines, refined_phrases)
Write and modify ibocr_records
Use the ParsedIBOCRBuilder library to mutate the ibocr_records with the IBOCRRecordBuilder class. After the records are mutated, use the serialize_to_string
function to get the serialized string that can be returned in the UDF response.
The general use pattern is:
parsed_builder = ParsedIBOCRBuilder.load_from_str(input_filepath, content)
i = 0
for record in parsed_builder.get_ibocr_records():
ibocr_record_builder = record.as_builder()
<mutate_ibocr_record>
text = modify_text(record.get_text())
ibocr_record_builder.set_text(text)
return parsed_builder.serialize_to_string()
Classes and objects
The library, classes, objects, and methods are:
-
ParsedIBOCRBuilder
-
IBOCRRecord
-
IBOCRPageMetadata
-
IBOCRRecordLayout
-
IBOCRRecordBuilder
-
WordPolyDict
-
RefinedPhrases
-
ParsedIBOCR
-
Runtime Config
-
IBFile
ParsedIBOCRBuilder
Use the ParsedIBOCRBuilder library to read and modify the contents of the IBOCR. The ParsedIBOCRBuilder library provides convenience functions to serialize the parsed-ibocr to string.
The ParsedIBOCRBuilder reads the contents of the file and can read all the underlying records.
- Use the IBOCRRecord interface to read the records
- Use the IBOCRRecordBuilder interface to modify the records
class ParsedIBOCRBuilder(object):
@staticmethod
def load_from_str(filepath, content):
# type: (Text, bytes) -> Tuple[ParsedIBOCRBuilder, Text]
"""Given a filepath and its corresponding contents, construct a new
`ParsedIBOCRBuilder`.
Returns an error in case the construction fails.
"""
pass
def __len__(self):
# type: () -> int
"""
Returns the number of IBRecords present in this builder.
"""
pass
def get_ibocr_records(self):
"""
Returns the list of IBRecords present in this builder.
"""
# type: () -> List[IBOCRRecord]
pass
def get_ibocr_record(self, index):
# type: (int) -> IBOCRRecord
"""
Returns the IBRecord present at ‘index’. If the index is out of bounds
this function returns None.
"""
pass
def set_ibocr_record(self, index, record):
# type: (int, IBOCRRecord) -> ParsedIBOCRBuilder
"""
Sets the IBRecord at ‘index’. If the index is out of bounds, this function
is a no-op.
"""
pass
def add_ibocr_records(self, ibocr_records):
# type: (List[IBOCRRecord]) -> ParsedIBOCRBuilder
"""
Adds the list of IBRecords to the builder.
"""
pass
def serialize_to_string(self):
# type: () -> bytes
"""Serializes the content of the ParsedIBOCRBuilder."""
pass
IBOCRRecord
Provides accessors to the underlying records in the Flow output.
class IBOCRRecord(object):
def get_text(self):
# type: () -> Text
"""
Returns the text stored in the IBOCRRecord.
"""
def get_lines(self):
# type: () -> List[List[WordPolyDict]]
"""
Returns the lines stored in the IBOCRRecord.
"""
def get_metadata_list(self):
# type: () -> List[IBOCRPageMetadata]
"""
Returns the metadata list associated with the IBOCRRecord.
"""
def get_refined_phrases(self):
# type: () -> Tuple[List[RefinedPhrase], Text]
"""
Returns the refined phrase stored inside the IBOCRRecord.
"""
def get_class_label(self):
# type: () -> Text
"""
Returns the classification label this record was classified with.
If no classification took place, this function returns None.
"""
IBOCRPageMetadata
Provides access to the metadata in a page of IBOCR output. This object is one element of the list returned from a call to .get_metadata_list()
on an IBOCRRecord
object.
class IBOCRPageMetadata(object):
def get_layout(self):
# type: () -> IBOCRRecordLayout
return IBOCRRecordLayout(self._d['layout'])
IBOCRRecordLayout
Provides access to the record layout of a page of IBOCR output, including image dimensions and path to the processed image. This object is returned from a call to .get_layout()
on an IBOCRPageMetadata
object.
class IBOCRRecordLayout(object):
def __init__(self, d):
# type: (_LayoutDict) -> None
self._d = d
def get_width(self):
# type: () -> float
return self._d.get('width')
def get_height(self):
# type: () -> float
return self._d.get('height')
def get_processed_image_path(self):
# type: () -> Text
return self._d.get('processed_image_path')
def get_is_image_page(self):
# type: () -> bool
return self._d.get('is_image_page')
def as_dict(self):
# type: () -> _LayoutDict
return self._d
IBOCRRecordBuilder
f for the flow-records. A common usage pattern is:
parsed_builder = ParsedIBOCRBuilder.load_from_str(input_filepath, content)
i = 0
for record in parsed_builder.get_ibocr_records():
ibocr_record_builder = record.as_builder()
<mutate_ibocr_record>
You can set these attributes using IBOCRRecordBuilder:
class IBOCRRecordBuilder(object):
def set_refined_phrases(self, refined_phrases):
# type: (List[RefinedPhrase]) -> IBOCRRecordBuilder
"""
Sets the refined phrases inside the IBOCRRecordBuilder. Overrides any
pre-existing refined phrases.
"""
def add_refined_phrases(self, refined_phrases):
# type: (List[RefinedPhrase]) -> IBOCRRecordBuilder
"""
Appends the refined phrases to the pre-existing refined phrases
inside the IBOCRRecordBuilder.
"""
def set_text(self, text):
# type: (Text) -> IBOCRRecordBuilder
"""
Sets the text for this record.
"""
def set_lines(self, lines):
# type: (List[List[WordPolyDict]]) -> IBOCRRecordBuilder
"""
Sets the lines for this record.
"""
def set_from_deepcopy_of_record(self, record):
# type: (IBOCRRecord) -> IBOCRRecordBuilder
"""
Creates the builder with the copy of the various attributes present in the
record.
This is a deepcopy which means that the user of this function can safely
modify the builder without affecting the original record.
"""
def as_record(self):
# type: (int) -> Tuple[IBOCRRecord, Text]
"""
Returns the IBOCRRecord with the attributes set inside builder.
If an error happens, reports it to the user.
"""
WordPolyDict
The WordPolyDict dictionary describes the metadata for extracted words with the following keys:
'WordPolyDict', {
'raw_word': Text,
'word': Text,
'line_height': float,
'word_width': float,
'char_width': float,
'start_x': float,
'end_x': float,
'start_y': float,
'end_y': float,
'page': int,
'confidence': IBWordConfidenceDict,
'style': StyleData
}
RefinedPhrases
RefinedPhrases are phrases that are set by Refiner. You can use IBOCRRecordBuilder to create your own phrases and append to them to the record.
class RefinedPhrase(object):
def __init__(self, json_dict):
# type: (Dict) -> None
"""
Initialize the refined phrase with a dictionary. Usually the clients
use an empty dictionary.
"""
def get_column_name(self):
# type: () -> Text
"""
Get the column name for the refined phrase.
"""
def set_column_name(self, val):
# type: (Text) -> None
"""
Set the column name for the refined phrase.
"""
def get_column_value(self):
# type: () -> Text
"""
Get the column value for the refined phrase.
"""
def set_column_value(self, new_val):
# type: (Text) -> None
"""
Set the column value for the refined phrase.
"""
def get_is_edited(self):
# type: () -> bool
"""
Whether this phrase was manually edited.
"""
def get_formula_text(self):
# type: () -> Text
"""
The formula used to generate the refined phrase.
"""
def get_registered_return_type(self):
# type: () -> Text
"""
The registered return type for the refined phrase.
"""
def get_page_index(self):
# type: () -> int
"""
The page from which the refined phrase was extracted.
"""
def get_extracted_pos(self):
# type: () -> Dict[Text, List[List[int]]]
"""
A dictionary with one key, ‘pixels’, which is a list of regions in
the image that contains the extracted value, represented as
[top left X coordinate, top left Y coordinate, bottom right X coordinate,
bottom right Y coordinate, page number]
"""
def get_information_pos(self):
# type: () -> Dict[Text, List[List[int]]]
"""
A dictionary with one key, ‘pixels’, which is a list of regions in
the image that contains information used to get the extracted value,
represented as [top left X coordinate, top left Y coordinate, bottom right X
coordinate, bottom right Y coordinate, page number]
"""
def get_char_confidence(self):
# type: () -> float
"""
The confidence score for the refined phrase.
"""
def get_has_unsure_ex(self):
# type: () -> bool
def get_has_unsure_info(self):
# type: () -> bool
def get_was_frozen(self):
# type: () -> bool
def get_was_best_effort_tracked(self):
# type: () -> bool
ParsedIBOCR object
To reference the ParsedIBOCR object in UDFs, use:
from instabase.ocr.client.libs.ibocr import ParsedIBOCR
The interface for the object looks like:
class ParsedIBOCR(object):
@staticmethod
def load_from_str(full_path, txt):
# type: (Text, Union[bytes, str]) -> Tuple[ParsedIBOCR, Text]
pass
def get_num_records(self):
# type: () -> int
pass
def get_ibocr_records(self):
# type: () -> List[IBOCRRecord]
pass
Each record within the IBOCR has the following interface:
IBOCRRecord
class IBOCRRecord(object):
def get_text(self):
# type: () -> Text
pass
def get_lines(self):
# type: () -> List[List[WordPolyDict]]
pass
Each word per line contains the following information:
WordPolyDict = TypedDict('WordPolyDict', {
'raw_word': Text,
'word': Text,
'line_height': float,
'word_width': float,
'char_width': float,
'start_x': float,
'end_x': float,
'start_y': float,
'end_y': float,
'page': int
})
Runtime Config
Runtime Config is a Dict[Text, Text] set of key-value pairs that are passed at runtime into a flow binary. These variables can then be used to dynamically change behavior in your Flow. An example runtime config is:
{"key1": "val1", "key2": "val2"}
IBFile
The ibfile
object is an Instabase FileHandle reference that provides pre-authenticated access to file operations. All operations done with the ibfile
object have the same permissions as the user that invoked the operation.
The methods of the ibfile
object are:
is_file(complete_path)
Input type: Text
Output type: bool
is_dir(complete_path)
Input type: Text
Output type: bool
exists(complete_path)
Input type: Text
Output type: bool
list_dir(path, start_page_token)
Input type: Text, Text
Output type: Tuple[ListDirInfo, str]
mkdir(complete_path)
Input type: Text
Output type: Tuple[MkdirResp, str]
copy(complete_path, new_complete_path, force=False)
Input type: Text, Text, bool
Output type: Tuple[CopyResp, str]
rm(complete_path, recursive=True, force=True)
Input type: Text, bool, bool, bool
Output type: Tuple[RmResp, str]
open(path, mode='r')
Input type: Text, str
Output type: IBFileBase
read_file(file_path)
Input type: Text
Output type: Tuple[str, Text]
write_file(file_path, content)
Input type: Text, str
Output type: Tuple[bool, Text]
Here are the object definitions for the returned objects:
enum StatusCode {
OK = 1
# connect exceptions
MISSING_PARAM = 2
# file errors
READ_ERROR = 4
WRITE_ERROR = 5f
# missing file or directory
NONEXISTENT = 7
# general exceptions
FAILURE = 3
NO_MOUNT_DETAILS = 6
ACCESS_DENIED = 8
}
class Status(object):
def __init__(self, code, msg):
# type: (StatusCode, str) -> None
self.code = code
self.msg = msg
class MkdirResp(object):
def __init__(self, status):
# type: (Status) -> None
self.status = status
class RmResp(object):
def __init__(self, status):
# type: (Status) -> None
self.status = status
class CopyResp(object):
def __init__(self, status):
# type: (Status) -> None
self.status = status
class NodeInfo(object):
def __init__(self, name, path, full_path, node_type):
# type: (Text, Text, Text, Text) -> None
self.name = name # Name of the file or folder resource
self.path = path # Path relative to the mounted repo
self.full_path = full_path # Path including the location of the mounted repo
self._type = node_type # Type of node, such as 'file' or 'folder'
class ListDirInfo(object):
def __init__(self, nodes, start_page_token, next_page_token, has_more):
# type: (List[NodeInfo], Text, Text, bool) -> None
self.nodes = nodes # List of nodes in the directory
self.start_page_token = start_page_token # Number representing the start page
self.next_page_token = next_page_token # Number representing the start of the next page
self.has_more = has_more # Is true if not all directory contents have been listed
VALID_MODES = frozenset([
# Read only modes
'r',
'rU',
'rb',
'rbU',
# Writeable
'r+',
'rb+',
'r+c',
'rb+c',
'w',
'w+',
'wb',
'wb+',
'wc',
'w+c',
'wbc',
'wb+c',
'a',
'a+',
'ab',
'ab+',
'ac',
'a+c',
'abc',
'ab+c'
])
class IBFileBase(object):
def __init__(self, path, mode):
# type: (Text, Text) -> None
self.path = self.path # Relative path to the file
self._mode = self.mode # One of VALID_MODES
Sample UDF
This sample UDF uses ibfile
as a member variable of a clients
variable:
import logging
import StringIO
def do_nothing_udf(content, input_filepath, **kwargs):
builder, err = ibocr.ParsedIBOCRBuilder.load_from_str(input_filepath, content)
if err:
raise IOError(u'Could not load file: {}'.format(input_filepath))
out = builder.serialize_to_string()
return out
def do_nothing_out_dir_udf(content, input_filepath, **kwargs):
builder, err = ibocr.ParsedIBOCRBuilder.load_from_str(input_filepath, content)
if err:
raise IOError(u'Could not load file: {}'.format(input_filepath))
logging.info('Writing outdir file...')
outdir_path = out_dir + '/outdir.txt'
resp, err = clients.ibfile.write_file(outdir_path, out_dir)
if err:
raise IOError(u'Could not write file: {}'.format(outdir_path))
out = builder.serialize_to_string()
return out
def test_is_file(clients, complete_path):
is_file = clients.ibfile.is_file(complete_path)
logging.info('Is file at path {}: {}'.format(complete_path, is_file))
def test_is_dir(clients, complete_path):
is_dir = clients.ibfile.is_dir(complete_path)
logging.info('Is dir at path {}: {}'.format(complete_path, is_dir))
def test_exists(clients, complete_path):
exists = clients.ibfile.exists(complete_path)
logging.info('Exists at path {}: {}'.format(complete_path, exists))
def test_list_dir(clients, path, start_page_token):
list_dir_info, err = clients.ibfile.list_dir(path, start_page_token)
if err:
logging.info('ERROR list_dir at path {}: {}'.format(path, err))
logging.info('List dir at path {}: {}'.format(path, list_dir_info))
for node in list_dir_info.nodes:
logging.info('Node {}'.format(node.as_dict()))
return list_dir_info.nodes
def test_mkdir(clients, complete_path):
mkdir, err = clients.ibfile.mkdir(complete_path)
logging.info('Mkdir at path {}: {}'.format(complete_path, mkdir.status))
def test_copy(clients, complete_path, new_complete_path):
copy, err = clients.ibfile.copy(complete_path, new_complete_path)
logging.info('Copy at path {}: {}'.format(complete_path, copy.status))
def test_rm(clients, complete_path):
rm, err = clients.ibfile.rm(complete_path)
logging.info('Rm at path {}: {}'.format(complete_path, rm.status))
def test_read_file(clients, complete_path):
read_file = clients.ibfile.read_file(complete_path)
logging.info('Read file at path {}: {}'.format(complete_path, len(read_file)))
return read_file
def test_write_file(clients, complete_path, data):
write_file = clients.ibfile.write_file(complete_path, data)
logging.info('Write file at path {}: {}'.format(complete_path, write_file))
def test_file_ops_fn(val, root_out_folder, clients, **kwargs):
test_is_file(clients, root_out_folder)
test_is_dir(clients, root_out_folder)
test_exists(clients, root_out_folder)
nodes = test_list_dir(clients, root_out_folder, '')
example_file_path = nodes[0].full_path
example_file_name = nodes[0].name
test_mkdir(clients, root_out_folder + '/test-folder/')
test_copy(clients, example_file_path, root_out_folder + '/{}.copy.1'.format(example_file_name))
test_copy(clients, example_file_path, root_out_folder + '/{}.copy.2'.format(example_file_name))
test_rm(clients, root_out_folder + '/{}.copy.2'.format(example_file_name))
example_text = test_read_file(clients, example_file_path)
test_write_file(clients, example_file_path, example_text)
return val
def register(name_to_fn):
more_fns = {
'do_nothing_udf': {
'fn': do_nothing_udf,
'ex': '',
'desc': ''
},
'do_nothing_out_dir_udf': {
'fn': do_nothing_out_dir_udf,
'ex': '',
'desc': ''
},
'test_file_ops': {
'fn': test_file_ops_fn,
'ex': '',
'desc': ''
}
}
name_to_fn.update(more_fns)
REFINER_FNS object
The REFINER_FNS object provides an API for executing Refiner functions within a UDF. The object supports the following methods:
call(fn_name, *args, **kwargs)
Input type: Text, *Any, **Any
Output type: Any, Text
call_v(fn_name, *args, **kwargs)
Input type: Text, *Value[Any], **Any
Output type: Value, Text
The call
method takes in the case-sensitive name of a Refiner function and as the positional and keyword arguments for that function. The result is a tuple, with the first item being the result, and second item being an error (if one occurred).
The call_v
method is similar to call
, but provides provenance tracking functionality. The first argument is
a case-sensitive name of a Refiner function, and the remaining arguments are positional and keyword arguments
for that provenance tracked function (such as Value
objects).The result is a tuple, with the first item being
the provenance-tracked result, and second item being an error (if one occurred).
Here is a sample UDF that uses this object:
import json
# This function applies SCAN_RIGHT and CLEAN on the input IBOCR record
def demo_refiner_fns(val, out_dir, refiner, **kwargs):
# type: (Text, Text, RefinerFns, **Any) -> Text
# Example: Load text of the first record
text = json.loads(val)[0]['text']
# Use refiner functions to extract subtotal (scan then clean)
scan_result, err = refiner.call('scan_right', text, 'Subtotal', ignorecase=True)
if err:
return err
result, err = refiner.call('clean', scan_result)
if err:
return err
return result
def demo_refiner_fns_v(text, refiner_fns, **kwargs):
# type: (Value[Text], Value[RefinerFns], **Any) -> Value[Text]
# Use refiner functions to extract subtotal (scan then clean)
scan_result, err = refiner_fns.value().call_v('scan_right', text, Value('Subtotal'), ignorecase=Value(True))
if err:
return Value(err)
result, err = refiner_fns.value().call_v('clean', scan_result)
if err:
return Value(err)
return result
TokenFrameworkRegistry object
The TokenFrameworkRegistry object uses the following methods to facilitate access to TokenMatcher usage.
class TokenFrameworkRegistry(object):
def lookup_matcher(self, name):
# type: (Text) -> TokenMatcher
return self.token_matcher_registry.lookup_matcher(name)
def lookup_tokenizer(self, name):
# type: (Text) -> Tokenizer
return self.tokenizer_registry.lookup_tokenizer(name)
Logger object
The Logger object can be used to log within the UDF. These logs can be accessed via the Flow Dashboard.
Notes: Starting version 21.9.0, you can directly use Python’s logging
library to log within a UDF.
class Logger(object):
def debug(self, msg, *args, **kwargs):
# type: (Text, *Any, **Any) -> None
pass
def info(self, msg, *args, **kwargs):
# type: (Text, *Any, **Any) -> None
pass
def warning(self, msg, *args, **kwargs):
# type: (Text, *Any, **Any) -> None
pass
def error(self, msg, *args, **kwargs):
# type: (Text, *Any, **Any) -> None
pass
def critical(self, msg, *args, **kwargs):
# type: (Text, *Any, **Any) -> None
pass
FnContext object
The FnContext object contains references to input variables and certain clients which can be accessed during UDF execution time.
To access the FnContext from your UDF, use kwargs.get('_FN_CONTEXT_KEY')
:
def custom_fn(val, out_dir, clients, **kwargs):
fn_context = kwargs.get('_FN_CONTEXT_KEY')
logger, err = fn_context.get_by_col_name('LOGGER')
if logger:
logger.info('Out dir is {}'.format(out_dir))
return u'final val'
In addition to the LOGGER
keyword, you can also access the REFINER_FNS
object and the CLIENTS
object from the FnContext object.
The get_cached_object
and set_cached_object
functions are in FnContext that is shared across all columns of one record within Refiner.
To implement, use the FnContext object to pass information to UDFs:
def get_cached_object(self, obj_id):
#type: (Text) -> Tuple[Any, Text]
if not self._object_cache:
return None, u'No object cache was provided to this function context'
return self._object_cache.get(obj_id), None
def set_cached_object(self, obj_id, obj):
#type: (Text, Any) -> Tuple[bool, Text]
if not self._object_cache:
return False, u'No object cache was provided to this function context'
self._object_cache.set(obj_id, obj)
return True, None
The FnContext object looks like:
class FnContext(object):
def get_ibfile(self):
# type: () -> IBFile
# Returns a reference to the IBFile object for performing file operations.
pass
def get_by_col_name(self, col_name):
# type: (Text) -> Tuple[Any, Text]
# Returns a tuple. If the column name exists as part of the
# function call, then it returns col_val, None
# If the column name does not exist, it returns None,
# u'Could not find column names "LOGGER"'
#
# This function is useful when you are writing UDFs which need to
# work across many versions of Instabase.
# As Instabase adds more features, we sometimes add input columns
# which are available in a new version which are not available in
# older versions. To get around this you can use:
#
# fn_context = kwargs.get('_FN_CONTEXT_KEY')
# logger, err = fn_context.get_by_col_name('LOGGER')
# if logger:
# # perform some operation now that the logger is available.
pass