TokenMatchers and Tokenizers (Legacy)
You can use Token Matchers and Tokenizers with some Refiner functions.
- Tokenizers provide a way to break text into multiple pieces, while
- TokenMatchers provide a way to score and clean each particular piece of text with knowledge of the semantic category that it belongs to.
For example, without context, it is difficult for a computer to interpret the
value of 1o Novembr 200B
, but knowing that this value is supposed be a date
changes everything: it is clearly 10 November 2008
.
TokenMatcher and Tokenizer usage
Token Matchers and Tokenizers can be used with the following Refiner functions (documented in the Refiner/Sheet Functions::NLP Functions section):
nlp_token_clean
cleans a piece of textnlp_token_score
provides a semantic validation score, from 0 to 1.0nlp_token_find
finds the best scoring token in a stringnlp_token_find_all
finds all the tokens in a string with a score higher than 0.8.nlp_token_select
finds the best scoring string among its argument list
Each of these refiner functions take in a model_config
argument, which is a
dictionary of parameters that can be used by your chosen TokenMatcher. The
nlp_token_find
and nlp_token_find_all
functions also take in tokenizer
and tokenizer_config
arguments to define how text should be broken up before
each token is scored (default to unigrams, or splitting text at whitespace).
At minimum, these nlp_token
functions take in the model
argument. Most TokenMatchers are automatically paired with a default Tokenizer, usually the recommended Tokenizer for that TokenMatcher.
Here are a few examples of how these functions can be used:
nlp_token_find('Due on 20IB-0A-1B',
model='only-digits',
separator=' ') # -> '20IB-0A-1B'
Here is an example of how to pass a file path to the Lexicon TokenMatcher:
nlp_token_find('Your input string',
model='dataset',
tokenizer_config=map_create([['dataset', 'your_dataset.csv']]))
Available TokenMatchers
The following token matcher can be used within Refiner formulas and UDFs.
Category | Identifier | Matcher Name | Functionality | Description | Find Example | Notes |
---|---|---|---|---|---|---|
Geography | iso-3-country-code | ISO 3 Country Code | score, clean | Matches tokens which conform to the ISO 3 country code standard | "sdf usa foo" → "usa" |
|
Geography | us-state | US State | score, clean, generate, generate_obvious | Matches any state name or abbreviation. | "Heading home to Virginia" → "Virginia" |
Prioritizes state names over abbreviations, and upper case over lower case. |
Geography | us-address | Address (USA) | score | Scores tokens with respect to how likely they are to be a US-based address. | (Score example) 123 Something Street\nTownsville, CA 111111 → 0.9 Hello, world → 0.1 |
It is recommened that this is used with the tokenizer:us-address tokenizer. |
Geography | global-address-word | Geography - Address Word (Global) | score | Matches words that are commonly used within local addresses. | "I live on Main Street" → "Street" |
Must be used with the tokenizer:global-address-word tokenizer. |
Geography | global-city | Geography - City (Global) | score | Matches various cities around the world, in English spelling. | "I live in San Francisco, CA" → "San Francisco" |
Must be used with the tokenizer:global-city tokenizer. |
Geography | global-state | Geography - State (Global) | score | Matches various state and provice names around the world, in English spelling. | "I live in Bengaluru, Karnataka" → "Karnataka" |
Must be used with the tokenizer:global-state tokenizer. |
Geography | global-zipcode | Geography - Zipcode (Global) | score | Matches various zipcodes used by different countries and territories. | "I went to Douglas, Isle of Man IM1 1EG to watch the race." → "IM1 1EG" |
Must be used with the tokenizer:global-zipcode tokenizer. |
Geography | country | Geography - Country | score | Matches country names and some country name abbreviations | "I grew up in the United States." → "United States" |
Must be used with the tokenizer:country tokenizer. |
Geography | global-address | Geography - Address (Global) | score | Matches addresses for various regions. | "My address:\n123 Some Street, London UK" → "123 Some Street, London UK" |
Must be used with the tokenizer:global-address tokenizer. |
ID Numbers | passport-number | Passport Number | score, clean, generate, generate_obvious | Matches passport numbers. | "sdf 12 122309df fd" → "12" |
Mostly just matches a sequence of digits. |
ID Numbers | social-security-number | Social Security Number | score, clean, generate, generate_obvious | Matches US social security numbers. | "sdf 444-12-2812" → "444-12-2812" |
Can handle when digits get OCR’d as letters. |
Numbers | credit-card-number | Credit Card Number | score, clean, generate, generate_obvious | Matches the major credit card types. | "sdf 4444125562813212" → "4444125562813212" |
Missing numbers and format mismatches will cause a score of 0. |
Core | floating-point-us | Floating Point (US) | score, clean | Matches numbers of the form ###,###,###.##. | "Ted 1.0 2.0 foo" → "1.0" |
Will pick first in the case of ties. Score is lower but not 0 when letters are found (ex: B may be an 8) |
Core | positive-integer | Positive integer | score, clean | Matches numbers of the form ##### (no decimal points). | "Ted 1.0 10 foo" → "10" |
|
Core | only-az | Only AZ | score, clean | Matches only the letters A-Za-z. | "Ted 1.0 10 foo" → "Ted" |
|
Core | only-digits | Only Digits | score, clean | Matches only digits. | "Ted 1.0 10 foo" → "10" |
Can include decimal points and letters that look like numbers, but prioritizes pure digits. |
Core | passport-field-sex | Gender (M/F) | score, clean | Matches only the letters M and F. | "Z M F" → "M" |
Picks the first item when there are ties. |
Core | mrz | MRZ | score, clean | Matches a string containing a machine-readable zone (Type 1 and Type 3). | ||
Names | person-name | Person - Name | score, clean, generate, generate_obvious | Matches person names. | "Barack Obama was born on August 4, 1961" → "Barack Obama" |
|
Time | dd-month-yyyy | DD-Month-YYYY | score, clean | Matches DD-Month-YYYY. | "The date is 13-Jan-2019" → "13-Jan-2019" |
Finds the month using the matcher:month-name TokenMatcher |
Time | month-name | Month Name | score, clean | Matches English-language month names and their abbreviations. | "I'll see you in March" → "March" |
|
Time | passport-field-date | US Passport Date | score, clean | Matches the date format that appears in passport fields: ## MonthName #### | (Clean example) "fxx12 March 2018" → "12 MARCH 2018" |
|
Time | passport-mrz-date | US Passport Date (MRZ) | score, clean | Matches the date format that appears in passport MRZ: YYMMDD | "sdf 122309 fd" → "122309" |
|
Time | date | Date | score, clean, generate, generate_obvious | Matches dates of a variety of different formats. | Some text on 11-JUN-2018 → 11-JUN-2018 |
By default uses the date-regex tokenizer. |
Currency | currency | Currency | score | Matches currency and amount pairs both currency-first and amount-first | Total transaction amount USD 1,447,329.11 → USD 1,447,329.11 |
By default uses the currency tokenizer. Must be used with this tokenizer. |
Dataset | dataset | Dataset | score, clean | Returns matches from list of terms provided as a filepath in tokenizer_config |
Dataset: ['UK', 'US', 'EU'] , "An EU regulation is a legal act which applies directly at the national level." -> "EU" |
Must be used with the dataset tokenizer |
Available Tokenizers
The TokenMatcher framework allows for different ways to break up text before searching for tokens. The following tokenizers can be used. (The default for most operations is Unigram).
Identifier | Matcher Name | Description | Example | Notes |
---|---|---|---|---|
unigram | Unigram | Splits input text into single-token pieces based on a split regex. | “Tokenize this text!” → [“Tokenize”, “this”, “text!”] | Equivalent to Python’s split method. |
bigram | Bigram | Creates bigrams by splitting input text into two-token pieces based on a split regex. | “Tokenize this text!” → [“Tokenize this”, “this text!”] | |
n-gram | n-gram | Provides tokenization by breaking up text into single tokens, and then returning a list which is generated using a sliding window of size n across the entire list of single tokens. n defaults to 1. |
“a b c d e f g”, ngram_size=3 → [“a b c”, “b c d”, “c d e”, “d e f”, “e f g”] | |
n-gram-range | n-gram range | Equivalent to tokenizer:n-gram , but for a range of n . The range defaults to 1:T+1 where T is the total number of single tokens created from this text. Upper bound is exclusive. |
“a b c d”, ngram_range=“1:3” → [“a”, “b”, “c”, “d”, “a b”, “b c”, “c d”] | |
us-address | Address (USA) | Returns a list of tokens from an input text that are deemed to be likely US addresses. | “Welcome to Instabase\n123 Street Ave.\nCity, ST 123456” → [“123 Street Ave.\nCity, ST 123456”] | Is not guaranteed to return a list that covers the entire input text. |
global-address-word | Geography - Address Word (Global) | Returns a list of tokens from an input text that are deemed to be likely local address clarifiers, such as “street” or “road”. | “I live at Main Street near Instabase Square” → [“Street”, “Square”] | Current state is experimental. |
global-city | Geography - City (Global) | Returns a list of tokens from an input text that are deemed to be likely city names. | “I grew up in Bristol, UK but moved to Boston, MA” → [“Bristol”, “Boston”] | Current state is experimental. |
global-state | Geography - State (Global) | Returns a list of tokens from an input text that are deemed to be likely state names. | “I’ve been to New York, CT, and ME, and now I am on my way to Karnataka.” → [“New York”, “CT”, “ME”, “Karnataka”] | Current state is experimental. |
global-zipcode | Geography - Zipcode (Global) | Returns a list of tokens from an input text that are deemed to be likely zipcodes. | “I visited Boston, MA with a zip code of 02116 and Douglas, Isle of Man IM1 1EG.” → [“02116”, “IM1 1EG”] | Current state is experimental. |
country | Geography - Country | Returns a list of tokens from an input text that are deemed to be likely country names. | “I grew up in the U.K., but moved to the United States and then Canada.” → [“U.K.”, “United States”, “Canada”] | Current state is experimental. |
global-address | Geography - Address (Global) | Returns a list of tokens from an input text that are deemed to be likely addresses from various regions of the world. | “My address:\n123 Some Street, London UK” → [“123 Some Street, London UK”] | Current state is experimental. |
regex | Basic - Regex | Returns matches from provided regex. | regex used: r'\b[A-Za-z]+\b' , "Jun 1992 2019-01-01 1st january 2018" → ["Jun", "january"] |
If no regex is provided, returns an empty list. Is not guaranteed to return output, if there are no matches in the input text. |
date-regex | Regex - Date | Returns matches from preset date regex. | "Jun 1992 2019-01-01 1st january 2018" → ["Jun 1992", "2019-01-01", "1st january 2018"] |
Is not guaranteed to return output, if there are no matches in the input text. |
currency-regex | Basic - Currency | Returns matches from preset currency regex. Only matches amounts with US style | "USD 199.99 USD 1,447,329.11 EUR 1.447,98" → ["USD 199.99", "USD 1,447,329.11"] |
Is not guaranteed to return output, if there are no matches in the input text. |
dataset | Dataset | Returns matches from list of terms provided as a filepath in tokenizer_config |
Dataset: ['UK', 'US', 'EU'] , "An EU regulation is a legal act which applies directly at the national level." -> "EU" |
Is not guaranteed to return output, if there are no matches in the input text, or if no filepath has been passed to the Tokenizer. |
Creating custom TokenMatchers and Tokenizers
Creating custom TokenMatchers and Tokenizers is similar to the process of
registering custom Refiner formulas. Import the TokenMatcher
and Tokenizer
interfaces, and create subclasses that implement the required functionality.
The custom classes are then registered by defining register_token_matchers
and register_tokenizers
functions. See the example below:
from instabase.model_utils.tokenizer import Tokenizer
from instabase.model_utils.token_matcher import TokenMatcher
from instabase.provenance.tracking import Value
class MyCustomTokenizer(Tokenizer):
TOKEN_TYPE = u'my-custom-tokenizer'
HUMAN_NAME = u'Custom - From a UDF'
def tokenize(self, text, **kwargs):
# type: (Value[Text]) -> List[Value[Text]]
# Just return the first word
return [Value(text.value().split(' ')[0])]
class MyCustomTokenMatcher(TokenMatcher):
MATCHER_TYPE = u'my-custom-token-matcher'
HUMAN_NAME = u'Custom - From a UDF'
DEFAULT_TOKENIZER = u'my-custom-tokenizer'
def score(self, token, **kwargs):
# type: (Value[Text], **Any) -> float
# For this example, give all tokens the same score
return 0.8
def register_token_matchers(name_to_fn):
name_to_fn.update({
'MyCustomTokenMatcher': {
'class': MyCustomTokenMatcher
}
})
def register_tokenizers(name_to_fn):
name_to_fn.update({
'MyCustomTokenizer': {
'class': MyCustomTokenizer
}
})
The tokenizer takes in a Value object of type Text and returns a list of Value objects.
This list of value objects can then be used in a Refiner formula:
nlp_token_find('Some example text',
model='my-custom-token-matcher',
tokenizer='my-custom-tokenizer')
or within a UDF using REFINER_FNS:
scan_result, err = REFINER_FNS.call('nlp_token_find_all',
val,
model='my-custom-token-matcher',
tokenizer='my-custom-tokenizer',
threshold=0.7,
**kwargs)
Make sure to include the **kwargs
parameter that comes from the calling UDF function. The **kwargs
parameter includes an important context for running TokenMatchers.