Custom Classifier guide
Most document processing projects require the incoming documents to be sorted, or classified. Instabase provides an app called Classifier to perform this sorting operation.
The automatic Instabase Classifier works with word frequency. Sometimes, your documents are complex or similar enough that you need more specific logic.
In this guide, we will learn to extend Classifier with custom Python, so that we can label small, medium, or large input documents.
Prerequisites
Before this exercise, complete the Flow guide, the Refiner guide, and the Classifier guide. All of these are in Core concepts.
For this exercise, we’ll be working with ADP and Gusto paystubs. You can use your Metaflow paystub project created from previous guides if you want to.
Or, you can start from our Metaflow paystub solution:
- Create a new workspace and remove any initialized directories. Unzip the custom-classifier-starter.zip on your computer and drag your unzipped
paystub-extraction
folder into your blank workspace.
1. Why Use Custom Classifiers
Sometimes, Instabase’s default Classifier is not robust enough to catch all the types of documents that we want. In this case, we can extend the Classifier by building a custom Classifier. Building a custom Classifier makes the most sense in two cases:
-
Case 1: The documents are extremely simple to classify by hand. For example, maybe every document you’ll sort begins with some standard phrase, like:
DEPARTMENT: 4
. In this situation, rather than use a machine learning model, it might make sense to build a Classifier that extracts the department number and returns something likedepartment_4
as the result. -
Case 2: You have a sophisticated classifier from your engineering team that you would like to use. Maybe your engineering department has just developed the latest deep learning technology that knocks the socks off everything out there, and you’d like to use it with Instabase.
Custom Classifiers can implement heuristic models to label documents.
Custom Classifiers are implemented in Python, placed inside a special
scripts
directory in Instabase, and then registered for use with our
platform. This guide will walk you through the required steps to create
one and attach it to your Flow.
2. Implementing a Custom Classifier
A Classifier is a Python class that implements the following interface:
class Classifier(object):
"""
Interface for an Instabase Classifier.
"""
def get_type(self):
# type: () -> Text
raise NotImplementedError(u'To be implemented.')
def get_version(self):
# type: () -> Text
raise NotImplementedError(u'To be implemented.')
def train(self, training_context, reporting_context):
# type: (ClassifierTrainingContext, ClassifierReportingContext) -> Tuple[Dict, Text]
"""
Deprecated method for training a classifier
"""
raise NotImplementedError(u'Deprecated, no need to implement.')
def predict(self, datapoint):
# type: (ClassifierInput) -> Tuple[ClassifierPrediction, Text]
raise NotImplementedError(u'To be implemented.')
def export_parameters_to_string(self):
# type: () -> Tuple[Text, Text]
"""
Deprecated method for returning a representation of the trained model as a string.
"""
raise NotImplementedError(u'Deprecated, no need to implement.')
def load_parameters_from_string(self, model_string, model_metadata=None):
# type: (Text, Dict) -> Tuple[bool, Text]
"""
Deprecated method for loading a trained model from a string.
"""
raise NotImplementedError(u'Deprecated, no need to implement.')
Your custom Classifier is responsible for:
-
Reporting its own type and version. In this case,
type
means the String that is returned fromget_type()
. You can give your Classifier any type you want, but it is good practice to make sure it is specific and descriptive like"ib:heuristic-classifier"
. -
Serializing and de-serializing its model to a string.
-
Predicting a label, given a datapoint.
-
Any custom progress, status, or error logging.
Serializing and de-serializing your model
When asked to save, your Classifier must encode its model as a UTF-8 string.
This will be stored by Instabase inside a JSON wrapper containing more metadata about your Classifier. If your model contains data that can not be stored inside a JSON UTF-8 string, we recommend using Base64 encoding.
Predicting
The predict()
method returns a ClassifierPrediction
object, which implements the following interface:
class ClassifierPrediction(object):
"""
Wrapper for the result of a classification.
"""
def __init__(self, best_match, debugging_data=None):
# type: (Text) -> None
self.best_match = best_match # type: Text
# For anything else.
self.debugging_data = None # type: Dict
The only required field is best_match
, which is the string label your Classifier has predicted. The debugging_data
field can be used to store any additional information (such as a distribution over labels).
Activity
-
Create a new folder in your
classifiers
folder. Call the new folderscripts
. -
Create a new file. Name it
new_custom_classifier.py
. -
Open
new_custom_classifier.py
in a text editor. -
We are going to make a custom Classifier that sorts documents based on their size. First, copy and paste these helper functions into
new_custom_classifier.py
:
from typing import Text, Dict, List, Union, Tuple, Any, Callable, Set
_PAGE_IN_CHARS = 80 * 48
_DOC_MEDIUM = 2 * _PAGE_IN_CHARS
_DOC_LARGE = 10 * _PAGE_IN_CHARS
def _ibocr_to_text(ibocr): # extract text from an IBOCR
# type: (List[Dict]) -> Tuple[Text, Text]
"""
Transform an IBOCR file in Python representation into a string containing
the concatenation of each page.
"""
page_texts = []
for i in range(ibocr.get_num_records()):
cur_txt, err = ibocr.get_text_at_record(i)
if err:
return None, err
page_texts.append(cur_txt)
return u'\n'.join(page_texts), None
class DocsizePrediction(object): # You don't need to subclass Prediction
def __init__(self, best_match):
self.best_match = best_match
self.debugging_data = dict()
- Copy and paste this example code for a custom Classifier:
class DocsizeDemoClassifier(object): # You don't need to subclass Classifier
"""
This is a demo heuristic Classifier.
"""
def get_type(self):
return u'ib:heuristic-demo'
def get_version(self):
return u'1.0.0'
def train(self, training_context, reporting_context):
"""
No training is necessary; this is a heuristic model.
"""
return dict(), None
def export_parameters_to_string(self):
"""
Returns an empty string; this is a heuristic model.
"""
return u'', None
def load_parameters_from_string(self, model_string, model_metadata=None):
"""
No-op; this is a heuristic model.
"""
return True, None
This DocsizeDemoClassifier will be used to classify documents by size, but at this point it doesn’t do anything. It doesn’t have a predict()
method.
- We want to be able to predict what type a document is based on whether it is Small, Medium or Large. Add a
predict()
method to your DocsizeDemoClassifier. It could look like this example:
def predict(self, datapoint):
"""
Classifies a document into categories EMPTY, SMALL, MEDIUM, and LARGE
based on arbitrary thresholds defined in the constants at the top of the
file.
"""
if datapoint.get_ibocr() and not datapoint.get_text():
text_content, ibocr_error = _ibocr_to_text(datapoint.get_ibocr())
if ibocr_error:
return None, u'Could not transform IBOCR file to text'
datapoint.set_text(text_content)
best_match = u'EMPTY'
the_text = datapoint.get_text()
if the_text:
if len(the_text) > _DOC_LARGE:
best_match = u'LARGE'
elif len(the_text) > _DOC_MEDIUM:
best_match = u'MEDIUM'
else:
best_match = u'SMALL'
return DocsizePrediction(best_match), None
This example model uses a size heuristic to sort documents. It sorts incoming documents by length: small, medium, and large.
-
Save the changes to
new_custom_classifier.py
-
Go to the Classifier app. Create a new Classifier and give it the Classifier type
Custom Classifier
. Select thescripts
folder inside yourclassifiers
folder.
Uh oh! Expect an error here. We haven’t registered our new Classifier, so Instabase doesn’t know yet that it exists. We’ll fix this in the next section.
3. Registering your Classifier
To make your custom Classifier available to Instabase, you must register it inside the same Python module that defines it. This is done by creating a special register_classifiers
function that Instabase will look for.
This function returns a Python dictionary of the following form:
def register_classifiers():
return {
'company:classifier-name': {
'class': MyClassifierClass
}
}
For example:
def register_classifiers():
return {
'instabase:size-heuristic': {
'class': SizeHeuristicClassifier
}
}
This function tells Instabase to use the SizeHeuristicClassifier
object when a user selects the 'instabase:size-heuristic'
custom Classifier option.
-
Open
new_custom_classifier.py
with the Text Editor -
Copy this code to the end of your file:
def register_classifiers():
return {
'my-example-classifier': {
'class': DocsizeDemoClassifier
}
}
-
Save the changes.
-
Go to the Classifier app. Create a new Classifier and give it the Classifier type
Custom Classifier
. Select thescripts
folder inside yourclassifiers
folder. Now, you can seemy-example-classifier
as an option. Select that one.
Awesome! Now, you’ve created a Classifier that has functionality beyond the normal bigram classification. This one sorts by size, but you can create a Classifier that classifies by any number of characteristics. If a document has the word “PAYSTUB” in it, for example, you could catch that in a custom Classifier.
When you get the hang of using heuristics to classify documents with your specific structure, you’ll be able to solve any classification project that comes your way.
4. Why Use Ensemble Classifiers
Let’s say we have a Classifier that correctly predicts 9 out of 10 document types, but it continually gets the 10th document type mixed up with the 8th document type. Instead of starting over and trying to create a Classifier that predicts all types correctly, we can chain our existing Classifier with another Classifier that would provide specific logic to handle that last case.
When we combine multiple Classifiers into one classification process, we call it an Ensemble Classifier.
On Instabase, you can create an Ensemble Classifier that:
-
Applies multiple Classifiers, including custom ones, to a piece of data
-
Applies custom logic to decide which result to select for each datapoint
-
Applies custom pre-processing on Classifier inputs before prediction
5. Implementing an Ensemble Classifier
An Ensemble Classifier is a Python class that implements the same interface that our singular custom Classifiers implement.
An Ensemble Classifier has an extra method called get_ensemble_types()
:
def get_ensemble_types(self):
# type: () -> List[Text]
"""
Return the types of the classifiers to form an ensemble from. Classifier types
must come from the Custom Classifiers occupying the same scripts/
folder. An Ensemble Classifier can't request to contain another copy
of itself, such as the type returned by self.get_type()
"""
return ['ib:custom-classifier']
Your Ensemble Classifier is responsible for:
-
Reporting its type and version
-
Reporting the Classifier types it wraps
-
Making a decision about the Ensemble’s final output
-
Optional: Manipulating incoming data-points for prediction
Classifier Types
The get_ensemble_types
function in an Ensemble Classifier can request the type of any custom Classifier in the same scripts
folder as the Ensemble Classifier.
Modifying the input
Ensemble Classifiers allow us to do preprocessing on a data-point before the classification is performed. We put the preprocessing inside a function called classifiers_will_predict()
, which takes in one datapoint
.
def classifiers_will_predict(self, datapoint):
# type: (ClassifierInput) -> ClassifierInput
"""
Called before prediction. Provides an opportunity to modify the
input.
"""
return datapoint
For example, we might want to transform all text to lowercase and take out common words before we look for regex matches.
In that case, our classifiers_will_predict
function would look something like:
def classifiers_will_predict(self, datapoint):
# type: (ClassifierInput) -> ClassifierInput
"""
Called before prediction. Provides an opportunity to modify the
input.
"""
text = datapoint.get_text()
text = text.lower()
#remove some common words:
common_words_to_remove = ["the", "and", "a", "an", "of"]
text_words = text.split()
text_list = [word for word in text_words if word not in common_words_to_remove]
text = ' '.join(text_list)
datapoint.set_text(text)
return datapoint
Classify with Additional Logic
After all of the Classifiers have run, we can add conventional logic to sort the documents. We can put this logic into the function classifiers_did_predict()
, which runs after the classification occurs.
def classifiers_did_predict(self, original_datapoint, modified_datapoint, predictions):
# type: (ClassifierInput, ClassifierInput, List[ClassifierPrediction]) -> Tuple[ClassifierPrediction, Text]
"""
Called after prediction. Requires the implementor to decide which output is correct.
User can additionally add their own heuristic output as an override for special cases.
"""
return predictions[0], None
For example, you could use regex search to see if any of the prediction objects that were classified to be "other"
actually are paystubs.
To do this, you would check in classifiers_did_predict()
if predictions[0]
contains the strings "paystub"
, "pay check"
, "pay cheque"
, or any other strings you have specifically identified that paystub documents have. If it has any of these words, you could return the type "paystub"
, instead of the original "other"
.
In this case, the classifiers_did_predict
would look something like:
def classifiers_did_predict(self, original_datapoint, modified_datapoint, predictions):
# type: (ClassifierInput, ClassifierInput, List[ClassifierPrediction]) -> Tuple[ClassifierPrediction, Text]
"""
Called after prediction. Requires the implementor to decide which output is correct.
User can additionally add their own heuristic output as an override for special cases.
"""
prediction = predictions[0]
text = modified_datapoint.get_text()
keywords = ["paystub", "pay check", "pay cheque"]
#if the Classifier called the document 'other', check to see if it's actually a paystub
if prediction.best_match == "other":
for keyword in keywords:
if keyword in text:
return ClassifierPrediction("paystub"), None
return prediction, None
Conclusion
Great! Now you have seen how to create and use a custom Classifier to sort documents that you know the structure of.
Together, we saw how to:
-
Create a custom Classifier
-
Create an Ensemble Classifier
-
Register your new Classifiers with the Instabase platform
If not, reach out to us at training@instabase.com. We’d love to chat about any questions, comments, or concerns that you might’ve had in completing this guide.