Provenance Tracking
What is provenance tracking
Provenance tracking in Refiner functions and Refiner UDFs allows us to map the output of a function to where it came from in the original input document.
In Refiner, you can pass the output of one function as input to another function, such as clean(scan_right(INPUT_COL))
. Provenance tracking allows us to map the final output of these functions to the corresponding text in the input document.
You can execute Refiner programs with or without provenance tracking (File > Settings > Enable provenance tracking). Enabling provenance tracking is recommended and is the default configuration, but it might add processing time and memory-usage depending on the complexity of your UDFs.
Writing a provenance-tracked UDF
Provenance-tracked UDFs have a different function signature from normal UDFs. All input and output arguments must be Value
objects, which is our special class for tracking provenance information alongside the actual variable value.
Typically, this means there are two versions of a UDF: the provenance-tracked version which expects input and ouput to be Value
objects, and the non-provenance tracked version, which expects input and output to be normal Python data types such as string
, list
, or dict
. If your UDFs will only be used with provenance tracking, you don’t have to implement a non-provenance tracked version.
Register your provenance-tracked UDF by adding the register_fn
decorator above your function. In the example below, adding the decorator makes the function provenance-tracked and registers it under the function name, custom_function_fn_v
.
from instabase.provenance.registration import register_fn
@register_fn
def custom_function_fn_v(content_value_obj, *args, **kwargs):
pass
For more information on UDF registration, see Registering UDFs.
Compatibility with previously written UDFs
Provenance tracked functions from Instabase versions earlier than July 2020 must be updated to use Value
object methods instead of directly manipulating provenance information, because we no longer guarantee that those underlying APIs won’t change. See Update earlier provenance tracked functions for details and examples of how to convert your functions.
instabase.provenance.tracking: Provenance APIs
These classes are used to access data and provenance information in provenance-tracked Refiner UDFs.
instabase.provenance.tracking.Value class
The Value
class is a wrapper around a variable used in a provenance-tracked Refiner UDF that tracks provenance information so the final value can be mapped to text or regions in the original input document. The Value
class consists of two parts:
-
the actual value the class holds, and
-
the tracker information, represented as a
ProvenanceTracker
class for text values or as anImageProvenanceTracker
class for image regions.
Basic usage in Refiner UDF:
from instabase.provenance.tracking import Value
def my_udf(name, **kwargs):
name_string = name.value()
greeting = 'Welcome ' + name_string
greeting_value = Value(greeting)
greeting_value.set_tracker(name.tracker().deepcopy())
return greeting_value
The Value
class has the following methods:
tracker
def tracker(self) -> ProvenanceTracker:
""" Access provenance tracker.
Returns a ProvenanceTracker object.
"""
set_tracker
def set_tracker(self, tracker: ProvenanceTracker) -> None:
""" Set provenance tracker.
Args:
tracker: provenance tracker to set on this Value object
"""
image_tracker
def image_tracker(self) -> ImageProvenanceTracker:
""" Access image provenance tracker.
Returns an ImageProvenanceTracker object.
"""
set_image_tracker
def set_tracker(self, tracker: ImageProvenanceTracker) -> None:
""" Set image provenance tracker.
Args:
tracker: image provenance tracker to set on this Value object
"""
value
def value(self) -> Any:
""" Returns the raw value that this Value class holds.
Return type depends on what the Value object holds.
"""
get_copy
def get_copy(self) -> Value:
""" Returns a deep copy of this Value object and its provenance information.
Returns a Value object.
"""
Call this method if you plan to make any modifications to the Value object. Modifying the input arguments of a Refiner UDF can cause unexpected side effects to subsequent fields in your Refiner program.
freeze_tracker
def freeze_tracker(self) -> None:
""" Freezes provenance tracking, so that any additional operations have no
effect on the tracked regions.
"""
instabase.provenance.tracking.ProvenanceTracker class
The ProvenanceTracker
class stores provenance information that connects any text values to text in the original document. Keep in mind that the Instabase document format (IBOCRRecord) represents text as both a string containing all the text in the document and as a list of words with bounding box information.
The ProvenanceTracker
class preserves the relationship between text in the current Value object and the IBOCRRecord
text. There is a separate ImageProvenanceTracker for tracking provenance for non-text values such as checkboxes.
When displaying provenance information in Refiner and Flow Review, we distinguish between “information” provenance and “extracted” provenance:
-
Information: regions of the document used to determine what to extract. For example, this might be the anchor word used for a
scan_right
function. -
Extracted: regions of the document that directly reflect the extracted information. For example, if the Refiner field returns a date ‘November 2020’, extracted provenance would point to the piece of text in the document containing the date.
Basic usage in Refiner UDF:
from instabase.provenance.tracking import Value
def pay_period(start_date, end_date **kwargs):
# Compute pay period.
pay_period = end_date.value() - start_date.value()
pay_period_value = Value(pay_period)
# Store provenance from start_date and end_date as information provenance on
# pay_period.
pay_period_tracker = start_date.tracker().deepcopy()
pay_period_tracker.convert_to_informational()
pay_period_tracker.insert_information_from(end_date.tracker())
pay_period_value.set_tracker(pay_period_tracker)
return pay_period
The Provenance Tracker
class has the following methods:
convert_to_informational
def convert_to_informational(self) -> None:
""" Converts all extracted provenance to information provenance.
"""
insert_information_from
def insert_information_from(self, other_tracker: ProvenanceTracker) -> None:
""" Adds the information and extracted provenance from other_tracker to this
tracker as information provenance.
Args:
other_tracker: provenance tracker whose provenance information you want
to add to this tracker
"""
deepcopy
def deepcopy(self) -> ProvenanceTracker:
""" Makes a deep copy of this tracker.
Returns copied ProvenanceTracker object.
"""
Always make a copy of a ProvenanceTracker before modifying it.
instabase.provenance.tracking.ImageProvenanceTracker class
The ImageProvenanceTracker
class stores provenance information for any rectangular regions in the original document image. The information is not tied to IBOCRRecord
text. The is useful for tracking provenance for non-text values like checkboxes or signatures.
ImageProvenanceTracker
has the same methods as ProvenanceTracker
, except that input and output arguments use ImageProvenanceTracker
instead.
string Value objects
If a Value
object contains a string, there are special string methods available on Value
that automatically update the provenance information to reflect the string operation performed. All string operations that modify the string return a copy, similar to Python string methods.
substring
def substring(self, start: int, end: int) -> Value:
""" Returns a copy of the string Value containing the string from
start (inclusive) to end (exclusive)
Args:
start: start of string
end: end of string
"""
substring
also works with Python slicing syntax, so the following code snippet is equivalent to using substring
:
string_value[1:3] # Equivalent to string_value.substring(1, 3)
Examples:
my_value = my_value.substring(2, 10)
my_value = my_value[2:10]
my_value = my_value[3]
length
def length(self) -> int:
"""Get the length of the value if it's a string.
"""
You can also pass the Value object as an argument to len()
to get the string length. For example: len(my_value) -> 3
for text-based Values. length
and len
are also implemented for Value
objects holding other data types.
Examples:
var = my_value.length()
var = len(my_value)
delete
def delete(self, start: int, end: int) -> Value[Text]:
"""Delete slice from Value from start index to end index.
Args:
start (int): start index of substring to remove.
end (int): end index of substring to remove.
"""
If one or both of the given start and end index are negative, or if the start index is not lower than or equal to the end index, delete
raises a ValueError
.
Examples:
my_value = my_value.delete(2, 10)
concatenate
def concatenate(self, string: Value[Text]) -> Value[Text]:
"""Appends `string` to the string in Value.
Args:
string (Value[Text]): string to concatenate to Value-string.
"""
Use concatenate
to add one text-based Value to another. You can also concatenate a regular string to a provenance-tracked Value. __add__
and __radd__
are implemented so you can use +
syntax to concatenate Values just like regular strings.
Examples:
my_new_value = first_value.concatenate(second_value)
my_new_value = first_value.concatenate('second value')
my_new_value = first_value + second_value
my_new_value = 'hello ' + second_value
my_new_value = first_value + ' world'
replace
def replace(self,
old: Union[Value[Text], Text],
new: Union[Value[Text], Text],
count: int = -1) -> Value[Text]:
"""Replace `old` in Value with `new`.
Args:
old (Value[Text] or Text): string in Value to be replaced.
new (Value[Text] or Text): string to replace occurrences of `old` in Value with.
count (int): -1 by default. Number of occurrences of `old` to replace with `new`.
"""
This function is used to replace specific substrings with a new string. This function is also implemented with regex as regex_sub
.
Examples:
my_value = my_value.replace(old_value, new_value)
my_value = my_value.replace('old value', new_value)
my_value = my_value.replace(old_value, 'new value')
my_value = my_value.replace('old value', 'new value')
lstrip, rstrip, strip
def lstrip(self, chars_to_strip: Text = None) -> Value:
"""Strip characters from start of Value.
Args:
chars_to_strip (Text): None by default, strips leading whitespace.
If set to specific characters, strips these from start of string.
"""
def rstrip(self, chars_t_strip: Text = None) -> Value:
"""Strip characters from end of Value.
Args:
chars_to_strip (Text): None by default, strips trailing whitespace.
If set to specific characters, strips these from end of string.
"""
def strip(self, chars_to_strip: Text = None) -> Value:
"""Strip characters from start and end of Value.
Args:
chars_to_strip (Text): None by default, strips leading and trailing
whitespace. If set to specific characters, strips these from
end of string.
"""
These functions are used to strip specific leading or trailing characters from Value.
Examples:
my_value = my_value.strip()
my_value = my_value.rstrip()
my_value = my_value.lstrip()
my_value = my_value.strip('\n')
insert
def insert(self, position: int, insert_str: Union[Value[Text], Text]) -> Value[Text]:
"""Insert given string `insert_str` in specified `position`.
Args:
position (int): index at which to insert given string into Value.
insert_str (Value[Text] or Text) string to insert at given index into Value.
"""
Use insert
to insert a string at a place other than the start or end.
Examples:
my_value = my_value.insert(3, 'my string')
my_value = my_value.insert(3, my_value_wrapped_string)
split
def split(self, split_string: Text = None, maxsplit: int = -1) -> Value[Text]:
"""Split string based on given separator or whitespace by default. `maxsplit`
by default is -1, which means that there is no maximum on the number of splits.
If `maxsplit` is set to 0, no split is performed.
Args:
split_string (Text): None by default. String to split on. If None,
splits on whitespace.
maxsplit (int): -1 by default. Number of splits to perform.
"""
This function is used for splitting a string. This function is also implemented with regex as regex_split
.
Examples:
my_value = my_value.split()
my_value = my_value.split('\n', 10)
join
@staticmethod
def join(join_char: Union[Value[Text], Text], vals: Sequence[Union[Value[Text], Text]]) -> Value[Text]:
"""Joins the given Value objects with the given joining character.
Args:
join_char (Value[Text] or Text): character to insert between joined values.
vals (Sequence of Value[Text] or Text): Value objects or strings to join.
"""
Use the join
function to join a list of text-based Values and strings with a specified join character.
Examples:
my_value = Value.join('\n', [my_value_1, my_value_2, my_value_3])
my_value = Value.join('\n', ['my value 1', 'my value 2', 'my value 3'])
Regex-based helper functions
The Value class includes static methods that mimic the re
or regex
library. Each function follows the documentation and interface provided by the regex
Python library.
Note that where re
and regex
return Match
objects, provenance tracking uses a TrackedMatchProxy
class to hold both the Match
and provenance information. See TrackedMatchProxy.
regex_search
@staticmethod
def regex_search(pattern: Union[Value[Text], Text, Pattern],
string: Union[Value[Text], Text],
flags: int = 0,
pos: int = None,
endpos: int = None,
partial: bool = False,
**kwargs: Any) -> TrackedMatchProxy:
"""Searches for the given pattern in the given string using `regex.search` internally.
Args:
pattern (Value[Text] or Text): string or regex pattern to search for.
string (Value[Text] or Text): string to search in.
flags (int): 0 by default. Regex flags for the given pattern.
pos (int): None by default. Start index of substring to search in if not at the start of `string`.
endpos (int): None by default. End index of substring to search in if not at the end of `string`.
partial (bool): Allow partial matches.
Returns a TrackedMatchProxy object.
"""
Use regex_search
if you need the functionality of regex.search
but want provenance tracked Match objects. regex_search
actually returns a TrackedMatchProxy
object instead, which has a similar interface to a Match object. See TrackedMatchProxy.
Use this function to find a match in some provenance-tracked text for the given pattern.
Examples:
my_match = Value.regex_search(r'\w+', my_input_str)
my_match.start() -> 2
my_match.end() -> 3
my_match.group() -> Value[...] (returns a tracked Value object containing given group's match text)
regex_findall
@staticmethod
def regex_findall(pattern: Union[Value[Text], Text, Pattern],
string: Union[Value[Text], Text],
flags: int = 0): -> Iterator[Union[Value[Text], Tuple[Value[Text], ...]]]
"""Returns all matches of the given pattern in string as an iterator of Value objects and
tuples containing Value objects using `regex.finditer` internally.
Args:
pattern (Value[Text], Text, or Pattern): string or regex pattern to search for.
string (Value[Text] or Text): string to search in.
flags (int): 0 by default. Regex flags for the given pattern.
"""
Use the regex_findall
function if you need the functionality of regex.finditer
. regex_findall
returns an iterator of (tuples of) Value objects, unlike regex.finditer
it does not return TrackedMatchProxy
objects.
Examples:
my_matches = Value.regex_findall(r'\w+', 'my string')
my_matches = Value.regex_findall(r'\w+', my_value_obj)
regex_finditer
@staticmethod
def regex_finditer(pattern: Union[Value[Text], Text, Pattern],
string: Union[Value[Text], Text],
flags: int = 0) -> Iterator[TrackedMatchProxy]:
"""Returns all matches of the given pattern in string as an iterator of
TrackedMatchProxy objects using regex.finditer internally.
Args:
pattern (Value[Text], Text, or Pattern): string or regex pattern to search for.
string (Value[Text] or Text): string to search in.
flags (int): 0 by default. Regex flags for the given pattern.
"""
regex_finditer
is similar to regex_findall
, except it returns TrackedMatchProxy
objects and does not group matches in tuples.
Examples:
my_matches = Value.regex_finditer(r'\w+', 'my string')
my_matches = Value.regex_finditer(r'\w+', my_value_obj)
regex_sub
@staticmethod
def regex_sub(pattern: Union[Value[Text], Text, Pattern],
repl: Union[Value[Text], Text],
string: Union[Value[Text], Text],
count: int = 0,
flags: int = 0,
pos: int = None,
endpos: int = None,
**kwargs: Any) -> 'Value[Text]':
"""Replaces matches of the given pattern in given string with `repl`. Returns
a tracked Value object.
Args:
pattern (Value[Text], Text, or Pattern): string or regex pattern to search for.
repl (Value[Text], Text): string to replace matches with.
string (Value[Text] or Text): string to search in.
count (int): 0 by default. Number of replacements to make. If 0, every match
is replaced.
flags (int): 0 by default Regex flags for the given pattern.
pos (int): None by default. Start index of substring to search in if not at the start of `string`.
endpos (int): None by default. End index of substring to search in if not at the end of `string`.
"""
Use regex_sub
if you need the functionality of regex.sub
. regex_sub
returns a Value object that contains the new string created by replacing occurrences of pattern
in the string
with the replacement repl
.
Examples:
my_new_value = Value.regex_sub(my_pattern_value, 'replacement', my_string)
my_new_value = Value.regex_sub(r'\s+', r'\s', my_input_text_val)
regex_split
@staticmethod
def regex_split(pattern: Union[Value[Text], Text],
string: Union[Value[Text], Text],
maxsplit: int = 0,
flags: int = 0,
**kwargs: Any) -> 'List[Value[Text]]':
"""Splits given string on given pattern. Returns a list of Value objects.
Args:
pattern (Value[Text] or Text): string or pattern to search for.
string (Value[Text] or Text): string to search in.
maxsplit (int): 0 by default. Number of splits to make. If 0, string
is split for every occurrence of pattern. If -1, no splits
are made.
flags (int): 0 by default. Regex flags for the given pattern.
"""
Use regex_split
if you need the functionality of regex.split
, or if you want to split based on a regex which split
cannot do.
Examples
my_list = Value.regex_split(r'\s+', my_input_text, maxsplit=3)
TrackedMatchProxy
class TrackedMatchProxy(object):
"""An object that mimics the interface of the re.Match object. Used in cases where we want to
use a Match object which we have control over during creation. Includes provenance tracking.
"""
def __init__(self, original_text: Value[Text], match: Any) -> None:
self._original_text = original_text
self.match = match
def start(self, group: int = 0) -> int:
return self.match.start(group)
def end(self, group: int = 0) -> int:
return self.match.end(group)
def group(self, group: int = 0) -> Value[Text]:
return self._original_text[self.start(group):self.end(group)]
def to_group_tuple(self) -> Union[Value[Text], Tuple[Value[Text], ...]]:
# If there are no groups, return the list of matches as a list
# If there is one group, return a list of first group matches as a list
# If there is more than one group, return a list of tuples with groups
...
def __str__(self) -> Text:
return u'<ib.TrackedMatchProxy object; span=({}, {}), match=\'{}\'>'.format(
self.start(), self.end(), Value.unwrap_value(self._original_text))
Collection Value objects
Value
objects can also contain collections such as lists and dictionaries. However, the elements of the collection must also be Value
objects if you want to track provenance correctly for them (see usage example below). Built-in Refiner functions that return lists (such as regex_get_all
) value-wrap each item of the list, recursively for nested lists. This allows the UI to properly display provenance information for all the nested elements.
If the result of a Refiner field is a collection, the extracted provenance of the collection elements is marked as information provenance. If a subsequent field accesses a string item in the collection, the extracted provenance of the item shows as extracted provenance.
Basic usage:
# list Value object
list_example: List[Value[Text]] = [a, b, c]
prov_tracked_list = Value(list_example) # Extracted provenance of a, b, and c
# show up as informational in the UI.
# dictionary Value object
dict_example = {
'a': Value(...),
'b': Value([Value(...), ...]),
Value('c'): 'hello'
}
prov_tracked_dict = Value(dict_example) # Extracted provenance of all the
# nested Value objects show up as informational in the UI.
Modifying provenance-tracked values
If you are writing a UDF that takes in tracked values and modifies those tracked values, be careful to not modify the underlying attributes of those Value
objects. Otherwise, you might accidentally modify the results of other Refiner fields.
Both the Value
and provenance tracker classes have methods to create deep-copies of themselves. The Value
class has the method get_copy()
, which will return a new Value
object with a deep-copy of the original Value
object’s underlying value and tracker information. If you want just a deep-copy of the Value
object’s tracker information, you can call value_object.tracker().deepcopy()
. See below for some examples.
def modify_input_value(input_val: Value[Text], **kwargs) -> Value[Text]:
new_val = input_val.get_copy()
# new_val has a deep-copy of input_val's underlying value and tracker information
# can now modify new_val safely
def update_input_value(input_val: Value[Text], **kwargs) -> Value[Text]:
updated_value = ... # ex. an updated/cleaned value based on input_val
new_val = Value(updated_value)
new_val.set_tracker(input_val.tracker().deepcopy())
# new_val has a deep-copy of input_val's underlying tracker information
# can now modify new_val safely
Advanced provenance tracking
Advanced provenance tracking concepts.
Freezing
When provenance tracking, it is sometimes helpful to save and terminate the tracking process as values propagate through the function being evaluated. For instance, imagine the following scenarios:
-
You extract an unsanitized value, but the cleaning process for your final output is unrelated to the original value in form (that is, going from
10/02/2018
toOctober 2, 2018
). In these cases, you would like to freeze the tracking at the point before sanitization is completed. -
During redaction and fraud detection, you can track the information and extracted region for multiple “checkpoints” throughout the formula. For instance, if you have
SCAN_RIGHT(INPUT_COL, ‘Net Pay’)
, you can tag the Net Pay tracking information so that you can reference it later.
You can use the following Refiner function to freeze provenance:
freeze(val)
- Causes the tracker on the given Value object to be frozen (that is, no changes take effect). NOTE: This method essentially causesval.tracker()
to becomeNone
, and the final tracker used during output generation is the frozen instance.
Note: A frozen value is frozen forever, including in other columns that use that value. Therefore, if you want to use a field that was frozen, you must wrap that entire column in a freeze()
call to replace the originally frozen tracker.
Accessing provenance information
You might want to access the provenance information of a Value
object; for example, to find the line numbers of the words in a phrase so that you can write a custom function to group words by line.
You can use the following Refiner function:
provenance_get(val)
- Returns a dictionary of information regarding the tracked information of this field.
Return format is a list of words and their bounding boxes:
[{
"original_characters": string,
"image_start": {
"x": float,
"y": float
},
"image_end": {
"x": float,
"y": float
},
"original_start_1d": integer,
"original_end_1d": integer,
"original_start_2d": {
"x": float,
"y": float
},
"original_end_2d": {
"x": float,
"y": float
},
"unsures": [
boolean,
...
],
"confidences": [
float,
...
]
},
...
]
Parameters | |
---|---|
original_characters | Original word text |
image_start | Top-left corner of text bounding box in image space |
image_end | Bottom-right corner of text bounding box in image pixel space |
original_start_1d | Start index (inclusive) of word in 1D text space |
original_end_1d | End index (exclusive) of word in 1D text space |
original_start_2d | Top-left corner of bounding box in 2D text space |
original_end_2d | Bottom-right corner of bounding box in 2D text space |
destinations_1d | List of index of each character in the word in the current string in 1D text space |
destinations_2d | List of index of each character in the word in the current string in 2D text space |
unsures | List of OCR engine unsureness for each character of word |
confidences | List of OCR confidence for each character of word |
Auto provenance tracking
The Refiner program can perform automatic provenance tracking so that even if a function does not support provenance tracking and the extracted results have no tracked information, Refiner can do a best-effort guess of where this value came from. Automatic provenance tracking works in the following ways:
-
If the output appears in the input record, mark the first occurrence of that output as the extracted region.
-
If the output, ignoring extra whitespace, appears in the input record, mark the first occurrence as the extracted region.
Switching between OCR and INPUT_COL domain
It is sometimes useful to work in OCR space (for example, WordPolys
), and then take those OCR results and display them as text within Refiner. This helper class provides functionality for getting provenance-tracked Values from WordPolys that came from an IBOCRRecord:
# Import via...
from instabase.ocr.client.libs.algorithms import WordPolyInputColMapper
class WordPolyInputColMapper:
def __init__(self, record: IBOCRRecord):
...
def get_index(self, word_poly: WordPolyDict) -> int:
"""Returns the string index of the given word_poly within INPUT_COL for this record.
"""
...
def get_value(self, word_poly: WordPolyDict) -> Value[Text]:
"""Returns a provenance-tracked Value object for the given word_poly.
"""
...
def get_cluster(self, word_polys: List[List[WordPolyDict]]) -> Value[Text]:
"""Returns a provenance-tracked Value object for the given word_polys. The input
is a list of list of WordPolys, where each internal list represents the word
polys to be included in one line, and the collection of lists is the collection
of lines.
"""
...
Troubleshooting provenance tracking
The return value of my UDF shows different provenance information than I would expect.
Check whether any of the input arguments of your UDF have frozen provenance information.
Frozen provenance information from input arguments takes precedence over any existing provenance information on the output value.
I used freeze()
on a list or dictionary, but it’s not showing any provenance information
freeze()
does not work on collection objects (such as list or dictionary), because freeze()
freezes only the top-level provenance tracker, which is None
for collection objects. When displaying provenance for collection objects, we aggregate the provenance information of the elements.