Parsing functions
left_pos
left_pos(text, label=None, label_any=None, e=0, ignorecase=false, default=None)
Finds leftmost character position of a word in a given text
Args:
text (str): original text
label (str, optional): string whose leftmost character will be used for determining left position
label_any (List<str>, optional): will search for each label in order and
return the position of the first matching label.
e (int, optional): number of errors allowed in the match.
ignorecase (bool, optional): Whether casing should be ignored. By default false.
default (int, optional): value to return if no match is found.
Returns:
Returns leftmost character position of a word in a given text
Examples:
left_pos('hello world', 'world') -> 6
left_pos('hello! whole wide world', 'wide') -> 13
right_pos
right_pos(text, label=None, label_any=None, e=0, ignorecase=false, default=None)
Finds rightmost character position of a word in a given text
Args:
text (str): original text
label (str, optional): string whose rightmost character will be used for determining right position
label_any (List<str>, optional): will search for each label in order
and return the position of the first matching label.
e (int, optional): number of errors allowed in the match.
ignorecase (bool, optional): Whether casing should be ignored. By default false.
default (int, optional): value to return if no match is found.
Returns:
Returns rightmost character position of a word in a given text
Examples:
right_pos('hello world', 'hello') -> 4
right_pos('hello! whole wide world', 'whole') -> 11
scan
scan(text, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, num_lines=None, e=0, ignorecase=false)
Returns a region of text that matches the bounding criteria.
Args:
text (str): original text
starts_after (str, optional): narrows search space for finding the label.
Only text following starts_after will be used for searching the label.
Defaults to beginning of the original text.
starts_after_any (List<str>): will search for each label in order and return
the position of the first matching label.
ends_before (str, optional): narrows original text, only text before
ends_before string will be used for searching the label. Defaults
to end of the original text.
ends_before_any (List<str>): will search for each label in order and return
the position of the first matching label.
left_pos (int, optional): left position index.
right_pos (int, optional): right position index.
num_lines (int, optional): number of lines to consider from starts_after;
defaults to all the lines.
e (int): number of errors allowed in the match.
ignorecase (bool, optional): Whether casing should be ignored. By default false.
Returns:
returns a region of text that matches the bounding criteria.
Examples:
scan(INPUT_COL, starts_after='Net Pay')
scan_below
scan_below(text, label=None, label_any=None, left_pos=None, right_pos=None, left_pad=None, right_pad=None, ends_before=None, ends_before_any=None, num_lines=None, e=0, ignorecase=false)
Returns value below the label, with provided padding on each side
Args:
text (str): original text.
label (str, optional): string used for determining position.
label_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
left_pos (int, optional): left position index.
right_pos (int, optional): right position index.
left_pad (int, optional): extends the left position index towards left.
right_pad (int, optional): extends the right position index towards right.
ends_before (str, optional): narrows original text, only text before
ends_before string will be used for searching the label. Defaults to
end of the original text.
ends_before_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
num_lines (int, optional): number of lines to consider below the label;
defaults to all the lines.
e (int, optional): number of errors allowed in the match.
ignorecase (bool, optional): Whether casing should be ignored. By default false.
Returns:
returns value below the label.
Examples:
scan_below(INPUT_COL, 'NET PAY', num_lines=1, left_pad=3, right_pad=5)
scan_below_repeated
scan_below_repeated(text, label=None, label_any=None, left_pos=None, right_pos=None, left_pad=None, right_pad=None, ends_before=None, ends_before_any=None, num_lines=None, e=0, ignorecase=false, max_scans=10000)
Finds a list of matches by running scan_below() repeatedly on the remaining text after each match.
Args:
text (str): original text.
label (str, optional): string used for determining position.
label_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
left_pad (int, optional): extends the left position index towards left.
right_pad (int, optional): extends the right position index towards right.
ends_before (str, optional): narrows original text, only text before
ends_before string will be used for searching the label. Defaults to
end of the original text.
ends_before_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
left_pos (int, optional): left position index.
right_pos (int, optional): right position index.
num_lines (int, optional): number of lines to consider below the label;
defaults to all the lines.
e (int, optional): number of errors allowed in the match.
ignorecase (bool, optional): Whether casing should be ignored. By default false.
max_scans (int, optional): maximum number of results that should be populated. If the
value is not set, or is less than 0, we default to 10000.
Returns:
a list of matches by running scan_below() repeatedly on the remaining text after each match.
Examples:
scan_below_repeated(INPUT_COL, 'NET PAY', num_lines=1, left_pad=3, right_pad=5)
scan_box
scan_box(text, label=None, label_any=None, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false, pixel_tolerance=2, exclude_label_line=false)
Returns the contents of the box containing the search term.
Requires provenance tracking and line detection used in Process Files.
Args:
text (str): The text to search in.
label (str): Search term contained in a visual box with the content to extract.
label_any (List<str>, optional): will search for each label in order and
return the position of the first matching label.
starts_after (str, optional): Narrows search space for finding the label.
Only text following starts_after will be used for searching the label.
Defaults to beginning of the original text.
starts_after_any (List<str>, optional): Will search for each label in order and return
the position of the first matching label.
ends_before (str, optional): Narrows original text, only text before
ends_before string will be used for searching the label. Defaults
to end of the original text.
ends_before_any (List<str>, optional): Will search for each label in order and return
the position of the first matching label.
left_pos (int, optional): Left position index.
right_pos (int, optional): Right position index.
e (int): Number of errors allowed in the match. By default 0.
ignorecase (bool, optional): Whether casing should be ignored. By default false.
pixel_tolerance (int, optional): Sets a number of pixels that a word can
be past the side of a rectangle it is contained in. By default 2.
exclude_label_line (bool, optional): If set to true, will remove the label
used to find the rectangle from the output.
Returns:
Returns content of the visual box containing the search term.
scan_line
scan_line(text, label=None, label_any=None, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false)
Returns the line that has the label bounded by left_pos and right_pos params
Args:
text (str): original text
label (str, optional): string used for determining position
label_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
starts_after (str, optional): narrows search space for finding the label.
Only text following starts_after will be used for searching the label.
Defaults to beginning of the original text.
starts_after_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
ends_before (str, optional): narrows original text, only text before
ends_before string will be used for searching the label. Defaults to
end of the original text.
ends_before_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
left_pos (int, optional): left position index.
right_pos (int, optional): right position index.
e (int): number of errors allowed in the match.
ignorecase (bool, optional): Whether casing should be ignored. By default false.
Returns:
returns value for a label by scanning right of the label in the same line
Examples:
scan_line(INPUT_COL, 'Tax Year', right_pos=left_pos(INPUT_COL, 'Tax Year'))
scan_line_repeated
scan_line_repeated(text, label=None, label_any=None, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false, max_scans=10000)
Finds a list of matches by running scan_line() repeatedly on the remaining text after each match.
Args:
text (str): original text
label (str, optional): string used for determining position
label_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
starts_after (str, optional): narrows search space for finding the label.
Only text following starts_after will be used for searching the label.
Defaults to beginning of the original text.
starts_after_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
ends_before (str, optional): narrows original text, only text before
ends_before string will be used for searching the label. Defaults to
end of the original text.
ends_before_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
left_pos (int, optional): left position index.
right_pos (int, optional): right position index.
e (int, optional): number of errors allowed in the match.
ignorecase (bool, optional): Whether casing should be ignored. By default false.
max_scans (int, optional): The maximum number of times we'll repeat the scan_line()
function. If the value is not set, or is < 0, we default to 10000.
Returns:
a list of matches by running scan_line() repeatedly on the remaining text after each match.
Examples:
scan_line_repeated(INPUT_COL, 'Tax Year')
scan_near
scan_near(text, label, target, max_distance=10, include_info=false, direction=None, max_distance_x=None, max_distance_y=None)
Finds a label within a piece of text, and returns desired targets around found labels.
See the `regex()` and `token_matcher()` functions to use regexes and special tokens
as labels and targets. Otherwise, pass a provenance-tracked value as a target or
label to scan from a specific extraction. If not provenance-tracked, the string literal
will be interpreted as a regex to search from.
Args:
text (str): The text to search in
label (Union<str, List<str>, Regex, List<Regex>, Matcher, List<Matcher>>): A previously computed value, regex, list of regexes, or a token matcher to search for within the given text
target (Union<str, List<str>, Regex, List<Regex>, Matcher, List<Matcher>>): A previously computed value, regex, list of regexes, or a token matcher to search for within the found region
max_distance (float, optional): The maximum distance allowed between the label and target. Distance is the minimum distance from one piece of text to another, with columns and lines each being a unit distance of 1.
For instance, [label][target] would be a distance of 0, while [label] [target] is a distance of 1,
and the following example is a distance of 1 (indicated by the pipe):
[label]
|
[target]
include_info (bool): If false, only returns the list of found targets. Otherwise, returns a dictionary including the label, target, locations, distance, and angle from label to target. Defaults to false.
direction (str): Can be 'left', 'right', 'above', or 'below', indicating what direction from the label to the target is prioritized. This weighted distance is included as 'heuristic_distance' when info is included.
max_distance_x (float, optional): The maximum distance allowed between the label and target in the x direction (see above for info on how this is computed). If not None, `max_distance` is ignored. Defaults to None.
max_distance_y must also be set, or this will be ignored.
max_distance_y (float, optional): The maximum distance allowed between the label and target in the y direction (see above for info on how this is computed). If not None, `max_distance` is ignored. Defaults to None.
max_distance_x must also be set, or this will be ignored.
Returns:
Desired targets around found labels
scan_repeated
scan_repeated(text, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, num_lines=None, e=0, ignorecase=false, max_scans=10000)
Finds a list of matches by running scan() repeatedly on the remaining text after each match.
Args:
text (str): original text
starts_after (str, optional): narrows search space for finding the label.
Only text following starts_after will be used for searching the label.
Defaults to beginning of the original text.
starts_after_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
ends_before (str, optional): narrows original text, only text before
ends_before string will be used for searching the label. Defaults
to end of the original text.
ends_before_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
left_pos (int, optional): left position index.
right_pos (int, optional): right position index.
num_lines (int, optional): number of lines to consider from starts_after;
defaults to all the lines.
e (int, optional): number of errors allowed in the match.
ignorecase (bool, optional): Whether casing should be ignored. By default false.
max_scans (int, optional): the maximum numbers of repeated scans performed.
Returns:
a list of matches by running scan() repeatedly on the remaining text after each match.
Examples:
scan_repeated(INPUT_COL, starts_after='Net Pay')
scan_right
scan_right(text, label=None, label_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false)
Finds value for a label by scanning right of the label in the same line
Args:
text (str): original text
label (str, optional): string used for determining position
label_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
ends_before (str, optional): narrows original text, only text before
ends_before string will be used for searching the label. Defaults to
end of the original text.
ends_before_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
left_pos (int, optional): left position index.
right_pos (int, optional): right position index.
e (int, optional): number of errors allowed in the match.
ignorecase (bool, optional): Whether casing should be ignored. By default false.
Returns:
returns value for a label by scanning right of the label in the same line
Examples:
scan_right(INPUT_COL, 'NET PAY')
scan_right_repeated
scan_right_repeated(text, label=None, label_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false, max_scans=10000)
Finds a list of matches by running scan_right() repeatedly on the remaining text after each match
Args:
text (str): original text
label (str, optional): string used for determining position
label_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
ends_before (str, optional): narrows original text, only text before
ends_before string will be used for searching the label. Defaults to
end of the original text.
ends_before_any (List<str>, optional): will search for each label in order and return
the position of the first matching label.
left_pos (int, optional): left position index.
right_pos (int, optional): right position index.
e (int, optional): number of errors allowed in the match.
ignorecase (bool, optional): Whether casing should be ignored. By default false.
max_scans (int, optional): maximum number of results that should be populated. If the
value is not set, or is less than 0, we default to 10000.
Returns:
a list of matches by running scan_right() repeatedly on the remaining text after each match.
Examples:
scan_right_repeated(INPUT_COL, 'NET PAY')