NLP functions

nlp_get_entities

nlp_get_entities(text, label=None)

Extracts entities from natural language text.

Args:
    text (str): the text of interest
    label (str): filters for a specific kind of entity, such as PERSON or ORG. Defaults to None, which gets all entity types.

Returns:
    Returns a dictionary containing entities extracted from the text

Examples:
    nlp_get_entities('The Massachusetts Institute of Technology is a private research university in Cambridge, Massachusetts, United States.') ->

    {
      'entities': [
        {'char_pos': {'end': 41, 'start': 0},
         'entity': u'The Massachusetts Institute of Technology',
         'label': u'ORG',
         'word_pos': {'end': 5, 'start': 0}},
        {'char_pos': {'end': 87, 'start': 78},
         'entity': u'Cambridge',
         'label': u'GPE',
         'word_pos': {'end': 12, 'start': 11}},
        {'char_pos': {'end': 102, 'start': 89},
         'entity': u'Massachusetts',
         'label': u'GPE',
         'word_pos': {'end': 14, 'start': 13}},
        {'char_pos': {'end': 117, 'start': 104},
         'entity': u'United States',
         'label': u'GPE',
         'word_pos': {'end': 17, 'start': 15}}
      ],
      'status': 'OK'
    }

nlp_token_clean

nlp_token_clean(text, model=None, model_config=None)

Cleans a token according to the provided model.

Args:
    text (str): the token of interest
    model (str): the name of a valid token model
    model_config (map): A map of options to configure the model

Returns:
    The input token, cleaned according to the logic of the token model.

Examples:
    nlp_token_clean('20IB0A1B', model='matcher:only-digits') -> '20180418'

nlp_token_find

nlp_token_find(text, model=None, separator=None, tokenizer=None, model_config=None, tokenizer_config=None)

Tokenizes the input string and returns the best scoring token according to the provided matcher.

If given, will use specified tokenizer. Otherwise, will use the default
tokenizer specified in the tokenmatcher class. If no tokenizer is specified or
set as default, will use unigram tokenizer.

Args:
    text (str): the text assumed to contain the token of interest
    model (str): the name of a valid token matcher
    separator (str): the string on which to to split text into tokens
    tokenizer (str): the name of a valid tokenizer
    model_config (map): A map of options to configure the model
    tokenizer_config (map): A map of options to configure the tokenizer


Returns:
    The best scoring token according to the provided matcher logic.

Examples:
    nlp_token_find('Due on 20IB-0A-1B', model='matcher:only-digits', separator=' ') -> '20IB-0A-1B'

nlp_token_find_all

nlp_token_find_all(text, model=None, separator=None, threshold=None, tokenizer=None, model_config=None, tokenizer_config=None)

Tokenizes the input string and returns all tokens with score above
threshold, according to the provided matcher.

If given, will use specified tokenizer. Otherwise, will use the default
tokenizer specified in the tokenmatcher class. If no tokenizer is specified or
set as default, will use unigram tokenizer.

Args:
    text (str): the text assumed to contain the token of interest
    model (str): the name of a valid token matcher
    separator (str): the string on which to to split text into tokens
    threshold (float): the threshold for determining whether token fits the
      model, default=0.8.
    tokenizer (str): the name of a valid tokenizer
    model_config (map): A map of options to configure the model
    tokenizer_config (map): A map of options to configure the tokenizer

Returns:
    The best scoring token according to the provided matcher logic.

Examples:
    nlp_token_find_all('ID: 20I80A1B', model='matcher:only-digits', separator=' ') -> ['20I80A1B']

nlp_token_score

nlp_token_score(text, model=None, model_config=None)

Scores a token from 0 to 1.0 according to the provided matcher.

Args:
    text (str): the token of interest
    model (str): the name of a valid token matcher
    model_config (map): A map of options to configure the model

Returns:
    A score for the input token, from 0 to 1.0, according to the logic
    of the token matcher.

Examples:
    nlp_token_score('20IB0A1B', model='matcher:only-digits') -> 0.75

nlp_token_select

nlp_token_select(*args: Any)

Returns the best scoring token, among provided inputs, according to the provided matcher.

Args:
    args: dict containing:
        text1 .. textN (str): the tokens of interest.
        model (str): the name of a valid token matcher
        model_config (map): A map of options to configure the model

Returns:
    The best scoring token according to the provided matcher logic.

Examples:
    nlp_token_select('20IB-0A-1B', '2018-01-20', model='matcher:only-digits') -> '2018-01-20'