TokenMatchers and Tokenizers (Legacy)

Table of Contents

You can use Token Matchers and Tokenizers with some Refiner functions.

Tokenizers provide a way to break text into multiple pieces, while
TokenMatchers provide a way to score and clean each particular piece of text with knowledge of the semantic category that it belongs to.

For example, without context, it is difficult for a computer to interpret the value of 1o Novembr 200B, but knowing that this value is supposed be a date changes everything: it is clearly 10 November 2008.

TokenMatcher and Tokenizer usage

Token Matchers and Tokenizers can be used with the following Refiner functions (documented in the Refiner/Sheet Functions::NLP Functions section):

nlp_token_clean cleans a piece of text
nlp_token_score provides a semantic validation score, from 0 to 1.0
nlp_token_find finds the best scoring token in a string
nlp_token_find_all finds all the tokens in a string with a score higher than 0.8.
nlp_token_select finds the best scoring string among its argument list

Each of these refiner functions take in a model_config argument, which is a dictionary of parameters that can be used by your chosen TokenMatcher. The nlp_token_find and nlp_token_find_all functions also take in tokenizer and tokenizer_config arguments to define how text should be broken up before each token is scored (default to unigrams, or splitting text at whitespace).

At minimum, these nlp_token functions take in the model argument. Most TokenMatchers are automatically paired with a default Tokenizer, usually the recommended Tokenizer for that TokenMatcher.

Here are a few examples of how these functions can be used:

nlp_token_find('Due on 20IB-0A-1B', 
               model='only-digits', 
               separator=' ') # -> '20IB-0A-1B'

Here is an example of how to pass a file path to the Lexicon TokenMatcher:

  nlp_token_find('Your input string', 
                 model='dataset', 
                 tokenizer_config=map_create([['dataset', 'your_dataset.csv']]))

Available TokenMatchers

The following token matcher can be used within Refiner formulas and UDFs.

Category	Identifier	Matcher Name	Functionality	Description	Find Example	Notes
Geography	iso-3-country-code	ISO 3 Country Code	score, clean	Matches tokens which conform to the ISO 3 country code standard	`"sdf usa foo"` → `"usa"`
Geography	us-state	US State	score, clean, generate, generate_obvious	Matches any state name or abbreviation.	`"Heading home to Virginia"` → `"Virginia"`	Prioritizes state names over abbreviations, and upper case over lower case.
Geography	us-address	Address (USA)	score	Scores tokens with respect to how likely they are to be a US-based address.	(Score example) `123 Something Street\nTownsville, CA 111111` → `0.9 Hello, world` → `0.1`	It is recommened that this is used with the `tokenizer:us-address` tokenizer.
Geography	global-address-word	Geography - Address Word (Global)	score	Matches words that are commonly used within local addresses.	`"I live on Main Street"` → `"Street"`	Must be used with the `tokenizer:global-address-word` tokenizer.
Geography	global-city	Geography - City (Global)	score	Matches various cities around the world, in English spelling.	`"I live in San Francisco, CA"` → `"San Francisco"`	Must be used with the `tokenizer:global-city` tokenizer.
Geography	global-state	Geography - State (Global)	score	Matches various state and provice names around the world, in English spelling.	`"I live in Bengaluru, Karnataka"` → `"Karnataka"`	Must be used with the `tokenizer:global-state` tokenizer.
Geography	global-zipcode	Geography - Zipcode (Global)	score	Matches various zipcodes used by different countries and territories.	`"I went to Douglas, Isle of Man IM1 1EG to watch the race."` → `"IM1 1EG"`	Must be used with the `tokenizer:global-zipcode` tokenizer.
Geography	country	Geography - Country	score	Matches country names and some country name abbreviations	`"I grew up in the United States."` → `"United States"`	Must be used with the `tokenizer:country` tokenizer.
Geography	global-address	Geography - Address (Global)	score	Matches addresses for various regions.	`"My address:\n123 Some Street, London UK"` → `"123 Some Street, London UK"`	Must be used with the `tokenizer:global-address` tokenizer.
ID Numbers	passport-number	Passport Number	score, clean, generate, generate_obvious	Matches passport numbers.	`"sdf 12 122309df fd"` → `"12"`	Mostly just matches a sequence of digits.
ID Numbers	social-security-number	Social Security Number	score, clean, generate, generate_obvious	Matches US social security numbers.	`"sdf 444-12-2812"` → `"444-12-2812"`	Can handle when digits get OCR’d as letters.
Numbers	credit-card-number	Credit Card Number	score, clean, generate, generate_obvious	Matches the major credit card types.	`"sdf 4444125562813212"` → `"4444125562813212"`	Missing numbers and format mismatches will cause a score of 0.
Core	floating-point-us	Floating Point (US)	score, clean	Matches numbers of the form ###,###,###.##.	`"Ted 1.0 2.0 foo"` → `"1.0"`	Will pick first in the case of ties. Score is lower but not 0 when letters are found (ex: B may be an 8)
Core	positive-integer	Positive integer	score, clean	Matches numbers of the form ##### (no decimal points).	`"Ted 1.0 10 foo"` → `"10"`
Core	only-az	Only AZ	score, clean	Matches only the letters A-Za-z.	`"Ted 1.0 10 foo"` → `"Ted"`
Core	only-digits	Only Digits	score, clean	Matches only digits.	`"Ted 1.0 10 foo"` → `"10"`	Can include decimal points and letters that look like numbers, but prioritizes pure digits.
Core	passport-field-sex	Gender (M/F)	score, clean	Matches only the letters M and F.	`"Z M F"` → `"M"`	Picks the first item when there are ties.
Core	mrz	MRZ	score, clean	Matches a string containing a machine-readable zone (Type 1 and Type 3).
Names	person-name	Person - Name	score, clean, generate, generate_obvious	Matches person names.	`"Barack Obama was born on August 4, 1961"` → `"Barack Obama"`
Time	dd-month-yyyy	DD-Month-YYYY	score, clean	Matches DD-Month-YYYY.	`"The date is 13-Jan-2019"` → `"13-Jan-2019"`	Finds the month using the `matcher:month-name` TokenMatcher
Time	month-name	Month Name	score, clean	Matches English-language month names and their abbreviations.	`"I'll see you in March"` → `"March"`
Time	passport-field-date	US Passport Date	score, clean	Matches the date format that appears in passport fields: ## MonthName ####	(Clean example) `"fxx12 March 2018"` → `"12 MARCH 2018"`
Time	passport-mrz-date	US Passport Date (MRZ)	score, clean	Matches the date format that appears in passport MRZ: YYMMDD	`"sdf 122309 fd"` → `"122309"`
Time	date	Date	score, clean, generate, generate_obvious	Matches dates of a variety of different formats.	`Some text on 11-JUN-2018` → `11-JUN-2018`	By default uses the `date-regex` tokenizer.
Currency	currency	Currency	score	Matches currency and amount pairs both currency-first and amount-first	`Total transaction amount USD 1,447,329.11` → `USD 1,447,329.11`	By default uses the `currency` tokenizer. Must be used with this tokenizer.
Dataset	dataset	Dataset	score, clean	Returns matches from list of terms provided as a filepath in `tokenizer_config`	Dataset: `['UK', 'US', 'EU']`, `"An EU regulation is a legal act which applies directly at the national level." -> "EU"`	Must be used with the `dataset` tokenizer

Available Tokenizers

The TokenMatcher framework allows for different ways to break up text before searching for tokens. The following tokenizers can be used. (The default for most operations is Unigram).

Identifier	Matcher Name	Description	Example	Notes
unigram	Unigram	Splits input text into single-token pieces based on a split regex.	“Tokenize this text!” → [“Tokenize”, “this”, “text!”]	Equivalent to Python’s `split` method.
bigram	Bigram	Creates bigrams by splitting input text into two-token pieces based on a split regex.	“Tokenize this text!” → [“Tokenize this”, “this text!”]
n-gram	n-gram	Provides tokenization by breaking up text into single tokens, and then returning a list which is generated using a sliding window of size `n` across the entire list of single tokens. `n` defaults to 1.	“a b c d e f g”, ngram_size=3 → [“a b c”, “b c d”, “c d e”, “d e f”, “e f g”]
n-gram-range	n-gram range	Equivalent to `tokenizer:n-gram`, but for a range of `n`. The range defaults to `1:T+1` where `T` is the total number of single tokens created from this text. Upper bound is exclusive.	“a b c d”, ngram_range=“1:3” → [“a”, “b”, “c”, “d”, “a b”, “b c”, “c d”]
us-address	Address (USA)	Returns a list of tokens from an input text that are deemed to be likely US addresses.	“Welcome to Instabase\n123 Street Ave.\nCity, ST 123456” → [“123 Street Ave.\nCity, ST 123456”]	Is not guaranteed to return a list that covers the entire input text.
global-address-word	Geography - Address Word (Global)	Returns a list of tokens from an input text that are deemed to be likely local address clarifiers, such as “street” or “road”.	“I live at Main Street near Instabase Square” → [“Street”, “Square”]	Current state is experimental.
global-city	Geography - City (Global)	Returns a list of tokens from an input text that are deemed to be likely city names.	“I grew up in Bristol, UK but moved to Boston, MA” → [“Bristol”, “Boston”]	Current state is experimental.
global-state	Geography - State (Global)	Returns a list of tokens from an input text that are deemed to be likely state names.	“I’ve been to New York, CT, and ME, and now I am on my way to Karnataka.” → [“New York”, “CT”, “ME”, “Karnataka”]	Current state is experimental.
global-zipcode	Geography - Zipcode (Global)	Returns a list of tokens from an input text that are deemed to be likely zipcodes.	“I visited Boston, MA with a zip code of 02116 and Douglas, Isle of Man IM1 1EG.” → [“02116”, “IM1 1EG”]	Current state is experimental.
country	Geography - Country	Returns a list of tokens from an input text that are deemed to be likely country names.	“I grew up in the U.K., but moved to the United States and then Canada.” → [“U.K.”, “United States”, “Canada”]	Current state is experimental.
global-address	Geography - Address (Global)	Returns a list of tokens from an input text that are deemed to be likely addresses from various regions of the world.	“My address:\n123 Some Street, London UK” → [“123 Some Street, London UK”]	Current state is experimental.
regex	Basic - Regex	Returns matches from provided regex.	regex used: `r'\b[A-Za-z]+\b'`, `"Jun 1992 2019-01-01 1st january 2018"` → `["Jun", "january"]`	If no regex is provided, returns an empty list. Is not guaranteed to return output, if there are no matches in the input text.
date-regex	Regex - Date	Returns matches from preset date regex.	`"Jun 1992 2019-01-01 1st january 2018"` → `["Jun 1992", "2019-01-01", "1st january 2018"]`	Is not guaranteed to return output, if there are no matches in the input text.
currency-regex	Basic - Currency	Returns matches from preset currency regex. Only matches amounts with US style	`"USD 199.99 USD 1,447,329.11 EUR 1.447,98"` → `["USD 199.99", "USD 1,447,329.11"]`	Is not guaranteed to return output, if there are no matches in the input text.
dataset	Dataset	Returns matches from list of terms provided as a filepath in `tokenizer_config`	Dataset: `['UK', 'US', 'EU']`, `"An EU regulation is a legal act which applies directly at the national level." -> "EU"`	Is not guaranteed to return output, if there are no matches in the input text, or if no filepath has been passed to the Tokenizer.

Creating custom TokenMatchers and Tokenizers

Creating custom TokenMatchers and Tokenizers is similar to the process of registering custom Refiner formulas. Import the TokenMatcher and Tokenizer interfaces, and create subclasses that implement the required functionality. The custom classes are then registered by defining register_token_matchers and register_tokenizers functions. See the example below:

from instabase.model_utils.tokenizer import Tokenizer
from instabase.model_utils.token_matcher import TokenMatcher
from instabase.provenance.tracking import Value

class MyCustomTokenizer(Tokenizer):
  TOKEN_TYPE = u'my-custom-tokenizer'
  HUMAN_NAME = u'Custom - From a UDF'

  def tokenize(self, text, **kwargs):
    # type: (Value[Text]) -> List[Value[Text]]
    # Just return the first word
    return [Value(text.value().split(' ')[0])]


class MyCustomTokenMatcher(TokenMatcher):
  MATCHER_TYPE = u'my-custom-token-matcher'
  HUMAN_NAME = u'Custom - From a UDF'
  DEFAULT_TOKENIZER = u'my-custom-tokenizer'

  def score(self, token, **kwargs):
    # type: (Value[Text], **Any) -> float
    # For this example, give all tokens the same score
    return 0.8


def register_token_matchers(name_to_fn):
    name_to_fn.update({
      'MyCustomTokenMatcher': {
        'class': MyCustomTokenMatcher
      }
    })
def register_tokenizers(name_to_fn):
  name_to_fn.update({
    'MyCustomTokenizer': {
      'class': MyCustomTokenizer
    }
  })

The tokenizer takes in a Value object of type Text and returns a list of Value objects.

This list of value objects can then be used in a Refiner formula:

  nlp_token_find('Some example text', 
                 model='my-custom-token-matcher', 
                 tokenizer='my-custom-tokenizer')

or within a UDF using REFINER_FNS:

scan_result, err = REFINER_FNS.call('nlp_token_find_all', 
                                    val, 
                                    model='my-custom-token-matcher', 
                                    tokenizer='my-custom-tokenizer', 
                                    threshold=0.7, 
                                    **kwargs)

Make sure to include the **kwargs parameter that comes from the calling UDF function. The **kwargs parameter includes an important context for running TokenMatchers.