Parsing functions

left_pos

left_pos(text, label=None, label_any=None, e=0, ignorecase=false, default=None)

Finds leftmost character position of a word in a given text

Args:
    text (str): original text
    label (str, optional): string whose leftmost character will be used for determining left position
    label_any (List<str>, optional): will search for each label in order and
        return the position of the first matching label.
    e (int, optional): number of errors allowed in the match.
    ignorecase (bool, optional): Whether casing should be ignored. By default false.
    default (int, optional): value to return if no match is found.

Returns:
    Returns leftmost character position of a word in a given text

Examples:
    left_pos('hello world', 'world') -> 6
    left_pos('hello! whole wide world', 'wide') -> 13

right_pos

right_pos(text, label=None, label_any=None, e=0, ignorecase=false, default=None)

Finds rightmost character position of a word in a given text

Args:
    text (str): original text
    label (str, optional): string whose rightmost character will be used for determining right position
    label_any (List<str>, optional): will search for each label in order
        and return the position of the first matching label.
    e (int, optional): number of errors allowed in the match.
    ignorecase (bool, optional): Whether casing should be ignored. By default false.
    default (int, optional): value to return if no match is found.

Returns:
    Returns rightmost character position of a word in a given text

Examples:
    right_pos('hello world', 'hello') -> 4
    right_pos('hello! whole wide world', 'whole') -> 11

scan

scan(text, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, num_lines=None, e=0, ignorecase=false)

Returns a region of text that matches the bounding criteria.

Args:
    text (str): original text
    starts_after (str, optional): narrows search space for finding the label.
        Only text following starts_after will be used for searching the label.
        Defaults to beginning of the original text.
    starts_after_any (List<str>): will search for each label in order and return
        the position of the first matching label.
    ends_before (str, optional): narrows original text, only text before
        ends_before string will be used for searching the label. Defaults
        to end of the original text.
    ends_before_any (List<str>): will search for each label in order and return
        the position of the first matching label.
    left_pos (int, optional): left position index.
    right_pos (int, optional): right position index.
    num_lines (int, optional): number of lines to consider from starts_after;
        defaults to all the lines.
    e (int): number of errors allowed in the match.
    ignorecase (bool, optional): Whether casing should be ignored. By default false.

Returns:
    returns a region of text that matches the bounding criteria.

Examples:
    scan(INPUT_COL, starts_after='Net Pay')

scan_below

scan_below(text, label=None, label_any=None, left_pos=None, right_pos=None, left_pad=None, right_pad=None, ends_before=None, ends_before_any=None, num_lines=None, e=0, ignorecase=false)

Returns value below the label, with provided padding on each side

Args:
    text (str): original text.
    label (str, optional): string used for determining position.
    label_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    left_pos (int, optional): left position index.
    right_pos (int, optional): right position index.
    left_pad (int, optional): extends the left position index towards left.
    right_pad (int, optional): extends the right position index towards right.
    ends_before (str, optional): narrows original text, only text before
        ends_before string will be used for searching the label. Defaults to
        end of the original text.
    ends_before_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    num_lines (int, optional): number of lines to consider below the label;
        defaults to all the lines.
    e (int, optional): number of errors allowed in the match.
    ignorecase (bool, optional): Whether casing should be ignored. By default false.

Returns:
    returns value below the label.

Examples:
    scan_below(INPUT_COL, 'NET PAY', num_lines=1, left_pad=3, right_pad=5)

scan_below_repeated

scan_below_repeated(text, label=None, label_any=None, left_pos=None, right_pos=None, left_pad=None, right_pad=None, ends_before=None, ends_before_any=None, num_lines=None, e=0, ignorecase=false, max_scans=10000)

Finds a list of matches by running scan_below() repeatedly on the remaining text after each match.

Args:
    text (str): original text.
    label (str, optional): string used for determining position.
    label_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    left_pad (int, optional): extends the left position index towards left.
    right_pad (int, optional): extends the right position index towards right.
    ends_before (str, optional): narrows original text, only text before
        ends_before string will be used for searching the label. Defaults to
        end of the original text.
    ends_before_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    left_pos (int, optional): left position index.
    right_pos (int, optional): right position index.
    num_lines (int, optional): number of lines to consider below the label;
        defaults to all the lines.
    e (int, optional): number of errors allowed in the match.
    ignorecase (bool, optional): Whether casing should be ignored. By default false.
    max_scans (int, optional): maximum number of results that should be populated.  If the
        value is not set, or is less than 0, we default to 10000.

Returns:
    a list of matches by running scan_below() repeatedly on the remaining text after each match.

Examples:
    scan_below_repeated(INPUT_COL, 'NET PAY', num_lines=1, left_pad=3, right_pad=5)

scan_box

scan_box(text, label=None, label_any=None, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false, pixel_tolerance=2, exclude_label_line=false)

Returns the contents of the box containing the search term.

Requires provenance tracking and line detection used in Process Files.

Args:
  text (str): The text to search in.
  label (str): Search term contained in a visual box with the content to extract.
  label_any (List<str>, optional): will search for each label in order and
      return the position of the first matching label.
  starts_after (str, optional): Narrows search space for finding the label.
      Only text following starts_after will be used for searching the label.
      Defaults to beginning of the original text.
  starts_after_any (List<str>, optional): Will search for each label in order and return
      the position of the first matching label.
  ends_before (str, optional): Narrows original text, only text before
      ends_before string will be used for searching the label. Defaults
      to end of the original text.
  ends_before_any (List<str>, optional): Will search for each label in order and return
      the position of the first matching label.
  left_pos (int, optional): Left position index.
  right_pos (int, optional): Right position index.
  e (int): Number of errors allowed in the match. By default 0.
  ignorecase (bool, optional): Whether casing should be ignored. By default false.
  pixel_tolerance (int, optional): Sets a number of pixels that a word can
      be past the side of a rectangle it is contained in. By default 2.
  exclude_label_line (bool, optional): If set to true, will remove the label
      used to find the rectangle from the output.

Returns:
    Returns content of the visual box containing the search term.

scan_line

scan_line(text, label=None, label_any=None, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false)

Returns the line that has the label bounded by left_pos and right_pos params

Args:
    text (str): original text
    label (str, optional): string used for determining position
    label_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    starts_after (str, optional): narrows search space for finding the label.
        Only text following starts_after will be used for searching the label.
        Defaults to beginning of the original text.
    starts_after_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    ends_before (str, optional): narrows original text, only text before
        ends_before string will be used for searching the label. Defaults to
        end of the original text.
    ends_before_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    left_pos (int, optional): left position index.
    right_pos (int, optional): right position index.
    e (int): number of errors allowed in the match.
    ignorecase (bool, optional): Whether casing should be ignored. By default false.

Returns:
    returns value for a label by scanning right of the label in the same line

Examples:
    scan_line(INPUT_COL, 'Tax Year', right_pos=left_pos(INPUT_COL, 'Tax Year'))

scan_line_repeated

scan_line_repeated(text, label=None, label_any=None, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false, max_scans=10000)

Finds a list of matches by running scan_line() repeatedly on the remaining text after each match.

Args:
    text (str): original text
    label (str, optional): string used for determining position
    label_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    starts_after (str, optional): narrows search space for finding the label.
        Only text following starts_after will be used for searching the label.
        Defaults to beginning of the original text.
    starts_after_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    ends_before (str, optional): narrows original text, only text before
        ends_before string will be used for searching the label. Defaults to
        end of the original text.
    ends_before_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    left_pos (int, optional): left position index.
    right_pos (int, optional): right position index.
    e (int, optional): number of errors allowed in the match.
    ignorecase (bool, optional): Whether casing should be ignored. By default false.
    max_scans (int, optional): The maximum number of times we'll repeat the scan_line()
        function.  If the value is not set, or is < 0, we default to 10000.

Returns:
    a list of matches by running scan_line() repeatedly on the remaining text after each match.

Examples:
    scan_line_repeated(INPUT_COL, 'Tax Year')

scan_near

scan_near(text, label, target, max_distance=10, include_info=false, direction=None, max_distance_x=None, max_distance_y=None)

Finds a label within a piece of text, and returns desired targets around found labels.

  See the `regex()` and `token_matcher()` functions to use regexes and special tokens
  as labels and targets. Otherwise, pass a provenance-tracked value as a target or
  label to scan from a specific extraction. If not provenance-tracked, the string literal
  will be interpreted as a regex to search from.

  Args:
    text (str): The text to search in
    label (Union<str, List<str>, Regex, List<Regex>, Matcher, List<Matcher>>): A previously computed value, regex, list of regexes, or a token matcher to search for within the given text
    target (Union<str, List<str>, Regex, List<Regex>, Matcher, List<Matcher>>): A previously computed value, regex, list of regexes, or a token matcher to search for within the found region
    max_distance (float, optional): The maximum distance allowed between the label and target. Distance is the minimum distance from one piece of text to another, with columns and lines each being a unit distance of 1.
                  For instance, [label][target] would be a distance of 0, while [label] [target] is a distance of 1,
                  and the following example is a distance of 1 (indicated by the pipe):

                  [label]
                      |
                      [target]

    include_info (bool): If false, only returns the list of found targets. Otherwise, returns a dictionary including the label, target, locations, distance, and angle from label to target. Defaults to false.
    direction (str): Can be 'left', 'right', 'above', or 'below', indicating what direction from the label to the target is prioritized. This weighted distance is included as 'heuristic_distance' when info is included.
    max_distance_x (float, optional): The maximum distance allowed between the label and target in the x direction (see above for info on how this is computed). If not None, `max_distance` is ignored. Defaults to None.
                                      max_distance_y must also be set, or this will be ignored.
    max_distance_y (float, optional): The maximum distance allowed between the label and target in the y direction (see above for info on how this is computed). If not None, `max_distance` is ignored. Defaults to None.
                                      max_distance_x must also be set, or this will be ignored.

  Returns:
    Desired targets around found labels

scan_repeated

scan_repeated(text, starts_after=None, starts_after_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, num_lines=None, e=0, ignorecase=false, max_scans=10000)

Finds a list of matches by running scan() repeatedly on the remaining text after each match.

Args:
    text (str): original text
    starts_after (str, optional): narrows search space for finding the label.
        Only text following starts_after will be used for searching the label.
        Defaults to beginning of the original text.
    starts_after_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    ends_before (str, optional): narrows original text, only text before
        ends_before string will be used for searching the label. Defaults
        to end of the original text.
    ends_before_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    left_pos (int, optional): left position index.
    right_pos (int, optional): right position index.
    num_lines (int, optional): number of lines to consider from starts_after;
        defaults to all the lines.
    e (int, optional): number of errors allowed in the match.
    ignorecase (bool, optional): Whether casing should be ignored. By default false.
    max_scans (int, optional): the maximum numbers of repeated scans performed.

Returns:
    a list of matches by running scan() repeatedly on the remaining text after each match.

Examples:
    scan_repeated(INPUT_COL, starts_after='Net Pay')

scan_right

scan_right(text, label=None, label_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false)

Finds value for a label by scanning right of the label in the same line

Args:
    text (str): original text
    label (str, optional): string used for determining position
    label_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    ends_before (str, optional): narrows original text, only text before
        ends_before string will be used for searching the label. Defaults to
        end of the original text.
    ends_before_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    left_pos (int, optional): left position index.
    right_pos (int, optional): right position index.
    e (int, optional): number of errors allowed in the match.
    ignorecase (bool, optional): Whether casing should be ignored. By default false.

Returns:
    returns value for a label by scanning right of the label in the same line

Examples:
    scan_right(INPUT_COL, 'NET PAY')

scan_right_repeated

scan_right_repeated(text, label=None, label_any=None, ends_before=None, ends_before_any=None, left_pos=None, right_pos=None, e=0, ignorecase=false, max_scans=10000)

Finds a list of matches by running scan_right() repeatedly on the remaining text after each match

Args:
    text (str): original text
    label (str, optional): string used for determining position
    label_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    ends_before (str, optional): narrows original text, only text before
        ends_before string will be used for searching the label. Defaults to
        end of the original text.
    ends_before_any (List<str>, optional): will search for each label in order and return
        the position of the first matching label.
    left_pos (int, optional): left position index.
    right_pos (int, optional): right position index.
    e (int, optional): number of errors allowed in the match.
    ignorecase (bool, optional): Whether casing should be ignored. By default false.
    max_scans (int, optional): maximum number of results that should be populated.  If the
        value is not set, or is less than 0, we default to 10000.

Returns:
    a list of matches by running scan_right() repeatedly on the remaining text after each match.

Examples:
    scan_right_repeated(INPUT_COL, 'NET PAY')