Parsing Functions
left_pos
left_pos(text, label=None, label_any=None, e=0, ignorecase=false, default=None)
Finds leftmost character position of a word in a given text Args: text (str): original text label (str, optional): string whose leftmost character will be used for determining left position label_any (List<str>, optional): will search for each label in order and return the position of the first matching label. e (int, optional): number of errors allowed in the match. ignorecase (bool, optional): Whether casing should be ignored. By default false. default (int, optional): value to return if no match is found. Returns: Returns leftmost character position of a word in a given text Examples: left_pos('hello world', 'world') -> 6 left_pos('hello! whole wide world', 'wide') -> 13
right_pos
right_pos(text, label=None, label_any=None, e=0, ignorecase=false, default=None)
Finds rightmost character position of a word in a given text Args: text (str): original text label (str, optional): string whose rightmost character will be used for determining right position label_any (List<str>, optional): will search for each label in order and return the position of the first matching label. e (int, optional): number of errors allowed in the match. ignorecase (bool, optional): Whether casing should be ignored. By default false. default (int, optional): value to return if no match is found. Returns: Returns rightmost character position of a word in a given text Examples: right_pos('hello world', 'hello') -> 4 right_pos('hello! whole wide world', 'whole') -> 11
scan
scan(text, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, num_lines=None, e=0, ignorecase=false)
Returns a region of text that matches the bounding criteria. Args: text (str): original text starts_after (str, optional): narrows search space for finding the label. Only text following starts_after will be used for searching the label. Defaults to beginning of the original text. starts_after_any (List<str>): will search for each label in order and return the position of the first matching label. ends_before (str, optional): narrows original text, only text before ends_before string will be used for searching the label. Defaults to end of the original text. ends_before_any (List<str>): will search for each label in order and return the position of the first matching label. left_pos (int, optional): left position index. right_pos (int, optional): right position index. num_lines (int, optional): number of lines to consider from starts_after; defaults to all the lines. e (int): number of errors allowed in the match. ignorecase (bool, optional): Whether casing should be ignored. By default false. Returns: returns a region of text that matches the bounding criteria. Examples: scan(INPUT_COL, starts_after='Net Pay')
scan_below
scan_below(text, label=None, label_any=None, left_pos=None, right_pos=None, left_pad=None, right_pad=None, ends_before=None, ends_before_any=None, num_lines=None, e=0, ignorecase=false)
Returns value below the label, with provided padding on each side Args: text (str): original text. label (str, optional): string used for determining position. label_any (List<str>, optional): will search for each label in order and return the position of the first matching label. left_pos (int, optional): left position index. right_pos (int, optional): right position index. left_pad (int, optional): extends the left position index towards left. right_pad (int, optional): extends the right position index towards right. ends_before (str, optional): narrows original text, only text before ends_before string will be used for searching the label. Defaults to end of the original text. ends_before_any (List<str>, optional): will search for each label in order and return the position of the first matching label. num_lines (int, optional): number of lines to consider below the label; defaults to all the lines. e (int, optional): number of errors allowed in the match. ignorecase (bool, optional): Whether casing should be ignored. By default false. Returns: returns value below the label. Examples: scan_below(INPUT_COL, 'NET PAY', num_lines=1, left_pad=3, right_pad=5)
scan_below_repeated
scan_below_repeated(text, label=None, label_any=None, left_pos=None, right_pos=None, left_pad=None, right_pad=None, ends_before=None, ends_before_any=None, num_lines=None, e=0, ignorecase=false, max_scans=10000)
Finds a list of matches by running scan_below() repeatedly on the remaining text after each match. Args: text (str): original text. label (str, optional): string used for determining position. label_any (List<str>, optional): will search for each label in order and return the position of the first matching label. left_pad (int, optional): extends the left position index towards left. right_pad (int, optional): extends the right position index towards right. ends_before (str, optional): narrows original text, only text before ends_before string will be used for searching the label. Defaults to end of the original text. ends_before_any (List<str>, optional): will search for each label in order and return the position of the first matching label. left_pos (int, optional): left position index. right_pos (int, optional): right position index. num_lines (int, optional): number of lines to consider below the label; defaults to all the lines. e (int, optional): number of errors allowed in the match. ignorecase (bool, optional): Whether casing should be ignored. By default false. max_scans (int, optional): maximum number of results that should be populated. If the value is not set, or is less than 0, we default to 10000. Returns: a list of matches by running scan_below() repeatedly on the remaining text after each match. Examples: scan_below_repeated(INPUT_COL, 'NET PAY', num_lines=1, left_pad=3, right_pad=5)
scan_box
scan_box(text, label=None, label_any=None, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false, pixel_tolerance=2, exclude_label_line=false)
Returns the contents of the box containing the search term. Requires provenance tracking and line detection used in Process Files. Args: text (str): The text to search in. label (str): Search term contained in a visual box with the content to extract. label_any (List<str>, optional): will search for each label in order and return the position of the first matching label. starts_after (str, optional): Narrows search space for finding the label. Only text following starts_after will be used for searching the label. Defaults to beginning of the original text. starts_after_any (List<str>, optional): Will search for each label in order and return the position of the first matching label. ends_before (str, optional): Narrows original text, only text before ends_before string will be used for searching the label. Defaults to end of the original text. ends_before_any (List<str>, optional): Will search for each label in order and return the position of the first matching label. left_pos (int, optional): Left position index. right_pos (int, optional): Right position index. e (int): Number of errors allowed in the match. By default 0. ignorecase (bool, optional): Whether casing should be ignored. By default false. pixel_tolerance (int, optional): Sets a number of pixels that a word can be past the side of a rectangle it is contained in. By default 2. exclude_label_line (bool, optional): If set to true, will remove the label used to find the rectangle from the output. Returns: Returns content of the visual box containing the search term.
scan_line
scan_line(text, label=None, label_any=None, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false)
Returns the line that has the label bounded by left_pos and right_pos params Args: text (str): original text label (str, optional): string used for determining position label_any (List<str>, optional): will search for each label in order and return the position of the first matching label. starts_after (str, optional): narrows search space for finding the label. Only text following starts_after will be used for searching the label. Defaults to beginning of the original text. starts_after_any (List<str>, optional): will search for each label in order and return the position of the first matching label. ends_before (str, optional): narrows original text, only text before ends_before string will be used for searching the label. Defaults to end of the original text. ends_before_any (List<str>, optional): will search for each label in order and return the position of the first matching label. left_pos (int, optional): left position index. right_pos (int, optional): right position index. e (int): number of errors allowed in the match. ignorecase (bool, optional): Whether casing should be ignored. By default false. Returns: returns value for a label by scanning right of the label in the same line Examples: scan_line(INPUT_COL, 'Tax Year', right_pos=left_pos(INPUT_COL, 'Tax Year'))
scan_line_repeated
scan_line_repeated(text, label=None, label_any=None, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false, max_scans=10000)
Finds a list of matches by running scan_line() repeatedly on the remaining text after each match. Args: text (str): original text label (str, optional): string used for determining position label_any (List<str>, optional): will search for each label in order and return the position of the first matching label. starts_after (str, optional): narrows search space for finding the label. Only text following starts_after will be used for searching the label. Defaults to beginning of the original text. starts_after_any (List<str>, optional): will search for each label in order and return the position of the first matching label. ends_before (str, optional): narrows original text, only text before ends_before string will be used for searching the label. Defaults to end of the original text. ends_before_any (List<str>, optional): will search for each label in order and return the position of the first matching label. left_pos (int, optional): left position index. right_pos (int, optional): right position index. e (int, optional): number of errors allowed in the match. ignorecase (bool, optional): Whether casing should be ignored. By default false. max_scans (int, optional): The maximum number of times we'll repeat the scan_line() function. If the value is not set, or is < 0, we default to 10000. Returns: a list of matches by running scan_line() repeatedly on the remaining text after each match. Examples: scan_line_repeated(INPUT_COL, 'Tax Year')
scan_near
scan_near(text, label, target, max_distance=10, include_info=false, direction=None, max_distance_x=None, max_distance_y=None)
Finds a label within a piece of text, and returns desired targets around found labels. See the `regex()` and `token_matcher()` functions to use regexes and special tokens as labels and targets. Otherwise, pass a provenance-tracked value as a target or label to scan from a specific extraction. If not provenance-tracked, the string literal will be interpreted as a regex to search from. Args: text (str): The text to search in label (Union<str, List<str>, Regex, List<Regex>, Matcher, List<Matcher>>): A previously computed value, regex, list of regexes, or a token matcher to search for within the given text target (Union<str, List<str>, Regex, List<Regex>, Matcher, List<Matcher>>): A previously computed value, regex, list of regexes, or a token matcher to search for within the found region max_distance (float, optional): The maximum distance allowed between the label and target. Distance is the minimum distance from one piece of text to another, with columns and lines each being a unit distance of 1. For instance, [label][target] would be a distance of 0, while [label] [target] is a distance of 1, and the following example is a distance of 1 (indicated by the pipe): [label] | [target] include_info (bool): If false, only returns the list of found targets. Otherwise, returns a dictionary including the label, target, locations, distance, and angle from label to target. Defaults to false. direction (str): Can be 'left', 'right', 'above', or 'below', indicating what direction from the label to the target is prioritized. This weighted distance is included as 'heuristic_distance' when info is included. max_distance_x (float, optional): The maximum distance allowed between the label and target in the x direction (see above for info on how this is computed). If not None, `max_distance` is ignored. Defaults to None. max_distance_y must also be set, or this will be ignored. max_distance_y (float, optional): The maximum distance allowed between the label and target in the y direction (see above for info on how this is computed). If not None, `max_distance` is ignored. Defaults to None. max_distance_x must also be set, or this will be ignored. Returns: Desired targets around found labels
scan_repeated
scan_repeated(text, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, num_lines=None, e=0, ignorecase=false, max_scans=10000)
Finds a list of matches by running scan() repeatedly on the remaining text after each match. Args: text (str): original text starts_after (str, optional): narrows search space for finding the label. Only text following starts_after will be used for searching the label. Defaults to beginning of the original text. starts_after_any (List<str>, optional): will search for each label in order and return the position of the first matching label. ends_before (str, optional): narrows original text, only text before ends_before string will be used for searching the label. Defaults to end of the original text. ends_before_any (List<str>, optional): will search for each label in order and return the position of the first matching label. left_pos (int, optional): left position index. right_pos (int, optional): right position index. num_lines (int, optional): number of lines to consider from starts_after; defaults to all the lines. e (int, optional): number of errors allowed in the match. ignorecase (bool, optional): Whether casing should be ignored. By default false. max_scans (int, optional): the maximum numbers of repeated scans performed. Returns: a list of matches by running scan() repeatedly on the remaining text after each match. Examples: scan_repeated(INPUT_COL, starts_after='Net Pay')
scan_right
scan_right(text, label=None, label_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false)
Finds value for a label by scanning right of the label in the same line Args: text (str): original text label (str, optional): string used for determining position label_any (List<str>, optional): will search for each label in order and return the position of the first matching label. ends_before (str, optional): narrows original text, only text before ends_before string will be used for searching the label. Defaults to end of the original text. ends_before_any (List<str>, optional): will search for each label in order and return the position of the first matching label. left_pos (int, optional): left position index. right_pos (int, optional): right position index. e (int, optional): number of errors allowed in the match. ignorecase (bool, optional): Whether casing should be ignored. By default false. Returns: returns value for a label by scanning right of the label in the same line Examples: scan_right(INPUT_COL, 'NET PAY')
scan_right_repeated
scan_right_repeated(text, label=None, label_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false, max_scans=10000)
Finds a list of matches by running scan_right() repeatedly on the remaining text after each match Args: text (str): original text label (str, optional): string used for determining position label_any (List<str>, optional): will search for each label in order and return the position of the first matching label. ends_before (str, optional): narrows original text, only text before ends_before string will be used for searching the label. Defaults to end of the original text. ends_before_any (List<str>, optional): will search for each label in order and return the position of the first matching label. left_pos (int, optional): left position index. right_pos (int, optional): right position index. e (int, optional): number of errors allowed in the match. ignorecase (bool, optional): Whether casing should be ignored. By default false. max_scans (int, optional): maximum number of results that should be populated. If the value is not set, or is less than 0, we default to 10000. Returns: a list of matches by running scan_right() repeatedly on the remaining text after each match. Examples: scan_right_repeated(INPUT_COL, 'NET PAY')