Classifier (Legacy)
Classifier runs in the Classifier Flow step and uses processed text files from an OCR step to sort documents by document type. Classifier is trained to discover the structure of document types.
For hands-on guides to working with Classifier, see the Custom Classifier guide.
Classifier types
-
Custom Classifier
Custom Classifiers can inherit from DocumentSplitter to become custom split documents Classifiers.
-
Automatic features (deprecated) Built-in default classifier.
-
Naive Bayes (deprecated) Built-in Naive Bayes classifier.
-
Split Documents (deprecated)
The split documents Classifiers solve the use-case of document separation: separating and classifying documents in an input bundle. There are a few Split Documents models available for specific uses.
-
Split documents (fixed templates)
Best for applications where documents tend to match a fixed or consistent template. This Classifier type can be trained with only one instance per label class, under the assumption that the bundled data will look very similar. Includes confidence scores in Split Result based on a joint confidence over template continuity and document class. Scores range from
[0, 1]
, where higher scores are better. They are not probabilities. -
Split documents (by first page)
Best for applications where documents are of unknown length, but contain a consistently-identifiable first page. For example, bank statements tend to have an account overview page followed by some number of transactions on subsequent pages. Instabase recommends training with at least three examples per label class. Includes confidence scores which are the classification model’s probability of the predicted class given a split range of pages.
-
Custom classifiers
Custom Classifiers extend the sorting capabilities of the Classifier functionality.
Custom Classifiers can implement heuristic models to label documents.
Custom Classifiers are implemented in Python files in the scripts
directory and must be registered for use with the platform.
Implementing a custom Classifier
A custom Classifier is responsible for:
-
Reporting its type and version
-
Predicting a label based on a data point
A custom Classifier is a Python class that implements the following interface:
class Classifier(object):
"""
Interface for an Instabase classifier.
"""
def get_type(self) -> Text:
raise NotImplementedError('To be implemented.')
def get_version(self) -> Text:
raise NotImplementedError('To be implemented.')
def train(self, training_context: ClassifierTrainingContext,
reporting_context: ClassifierReportingContext) -> Tuple[Dict, Text]:
"""
Deprecated method for training the classifier
"""
raise NotImplementedError(u'Deprecated, no need to implement.')
def predict(self, datapoint: ClassifierInput) -> Tuple[ClassifierPrediction, Text]:
# This method is used for traditional per-record classification and will return
# a single label for the given datapoint.
raise NotImplementedError('To be implemented.')
def split_doc(self, datapoint: ClassifierInput) -> Tuple[DocumentSplitResult, Text]:
# WARNING: This method will take precedence over 'predict()'
# This method is used for labeling various subsections of the document and will
# return an object as specified in the "Splitting Documents" section below.
raise NotImplementedError('To be implemented.')
def export_parameters_to_string(self) -> Tuple[Text, Text]:
"""
Deprecated method for returning a representation of the trained model as a string.
"""
raise NotImplementedError(u'Deprecated, no need to implement.')
def load_parameters_from_string(self, model_string: Text, model_metadata: Dict = None) -> Tuple[bool, Text]:
"""
Deprecated method for loading a trained model from a string.
"""
raise NotImplementedError(u'Deprecated, no need to implement.')
def get_feature_types(self) -> List[Text]:
"""
Deprecated method for selecting feature types.
"""
raise NotImplementedError(u'Deprecated, no need to implement.')
If the split_doc
method is implemented, it takes priority over predict
.
Accessing runtime variables
When running a Flow, you can pass in a runtime_config
, which is a map<string, string>
.
To access this information, use the model_metadata
passed in to load_parameters_from_string
method.
ModelMetadataDict is of type:
ModelMetadataDict = TypedDict('ModelMetadataDict', {
'runtime_config': Dict[Text, Text]
})
The key is runtime_config
. For example, if the runtime_config
passed into Flow is {"key1": "val1"}
then the model_metadata will be:
{
"runtime_config": {
"key1": "val1"
}
}
Accessing FnContext
To access the FnContext object in custom Classifiers through the datapoint, call datapoint.get_fn_ctx()
. The FnContext object that is returned has objects that can be accessed using the get_by_col_name
method. The following columns are available:
-
INPUT_COL
is the input text. This is the same asdatapoint.get_text()
-
PARSED_IBOCR
is the input ibocr. This is the same asdatapoint.get_ibocr()
-
INPUT_FILEPATH
is the path to the input IBDOC file -
ROOT_OUTPUT_FOLDER
is the path to the Run Flow/Flows/Metaflow operation’s output directory -
CONFIG
is theruntime_config
-
CLIENTS
contains the clients that can be used to access various resources and consist of the following:ibfile
allows for filesystem access from the custom classifier
-
TOKEN_FRAMEWORK_REGISTRY
allows for use of the TokenFrameworkRegistry object -
REFINER_FNS
allows for use of the REFINER_FNS object -
LOGGER
allows for use of the LOGGER object
Accessing the ibfile object
To access the ibfile object, follow this example:
fn_ctx = datapoint.get_fn_ctx()
clients, err = fn_ctx.get_by_col_name('CLIENTS')
if err:
logger.info('No clients exist')
ibfile = clients.ibfile
Predicting
Predictions return a ClassifierPrediction
object that implements the following interface:
class ClassifierPrediction(object):
"""
Wrapper for the result of a classification.
"""
def __init__(self, best_match: Text, debugging_data=None) -> None:
self.best_match = best_match # type: Text
# For anything else.
self.debugging_data = None # type: Dict
The only required field is best_match
, a string label the Classifier has predicted. Use the debugging_data
field to store additional information, such as a distribution over labels.
Splitting documents
Split a document by its constituent classes to get the records and word dicts from the datapoint using the get_ibocr
method. You can specify the split ranges with a list of page ranges per class. The predictions implement the following interface:
class DocumentSplitResult(object):
"""
Wrapper for the result of a document split.
"""
def __init__(self, doc_splits: Dict[Text, List[Tuple], split_type: Text='page-ranges', debugging_data: Dict=None) -> None:
self.doc_splits = doc_splits
self.split_type = split_type
# For anything else.
self.debugging_data: Dict = debugging_data
The doc_splits
field is a dictionary mapping each class name to a list of page ranges and an optional confidence score. The first two items in the tuple are the start and end page number of a record. The page ranges are 1-indexed and inclusive, such that (1, 3) specifies pages 1, 2, and 3 (there is no page 0). Classifier confidence score, a float, can be returned as the optional third item in the tuple.
The split_type
field specifies how to split the document.
page-ranges
specifies that documents are split using page ranges. For example, to split by lists of pages:
DocumentSplitResult({
'class1': [(1,3), (5, 8)],
'class2': [(4,4)],
'class3': [(9,10, 90.0)] # the record spans page 9 and 10 in the file, and the confidence score is 90.0
}, split_type='page-ranges')
Registering a custom Classifier
To make your custom Classifier available, register it inside the same Python module that defines it by creating a special register_classifiers
function.
This function returns a Python dictionary of the following form:
def register_classifiers():
return {
'company:classifier-name': {
'class': MyClassifierClass
}
}
Using a custom Classifier
When you create a classifier in the Classifier app, select Custom
as the type, and select the scripts
folder that contains your custom Classifier implementations. New Classifiers show as options in the creation dialogue.
Example: a heuristic model
Let’s create an example model that uses heuristics. Suppose we have incoming documents that we’d like to sort into by length: small, medium, and large. We can create a custom Classifier that maps the length of these documents into a size category.
from typing import Text, Dict, List, Union, Tuple, Any, Callable, Set
# Note: Your custom code will not extend the classes here;
# Instead, simply follow their structure as defined above.
_PAGE_IN_CHARS = 80 * 48
_DOC_MEDIUM = 2 * _PAGE_IN_CHARS
_DOC_LARGE = 10 * _PAGE_IN_CHARS
def _ibocr_to_text(ibocr: List[Dict]) -> Tuple[Text, Text]: # Copy this function to extract text from an IBOCR
# type: (List[Dict]) -> Tuple[Text, Text]
"""
Transform an IBOCR file in Python representation into a string containing
the concatenation of each page.
"""
page_text = []
if (len(ibocr)):
for page_num in range(len(ibocr)):
if ibocr[page_num]['text']:
page_text.append(ibocr[page_num]['text'])
return '\n'.join(page_text), None
class DocsizePrediction(object): # You don't need to subclass Prediction
def __init__(self, best_match: Text):
self.best_match = best_match
self.debugging_data = dict()
class DocsizeDemoClassifier(object): # You don't need to subclass Classifier
"""
This is a demo heuristic classifier.
"""
def __init__(self) -> None:
self._model_metadata = None
def get_type(self) -> Text:
return 'ib:heuristic-demo'
def get_version(self) -> Text:
return '1.0.0'
def train(self, training_context: ClassifierTrainingContext, reporting_context: ClassifierReportingContext) -> Tuple[Dict, Text]:
"""
No training is necessary; this is a heuristic model.
"""
return dict(), None
def predict(self, datapoint:C ClassifierInput) -> Tuple[DocsizePrediction, Text]:
"""
Classifies a document into categories EMPTY, SMALL, MEDIUM, and LARGE
based on arbitrary thresholds defined in the constants at the top of the
file.
"""
if datapoint.get_ibocr() and not datapoint.get_text():
text_content, ibocr_error = _ibocr_to_text(datapoint.get_ibocr())
if ibocr_error:
return None, 'Could not transform IBOCR file to text'
datapoint.set_text(text_content)
best_match = 'EMPTY'
the_text = datapoint.get_text()
if the_text:
if len(the_text) > _DOC_LARGE:
best_match = 'LARGE'
elif len(the_text) > _DOC_MEDIUM:
best_match = 'MEDIUM'
else:
best_match = 'SMALL'
return DocsizePrediction(best_match), None
def export_parameters_to_string(self) -> Tuple[Text, Text]:
"""
Returns an empty string; this is a heuristic model.
"""
return '', None
def load_parameters_from_string(self, model_string: Text, model_metadata: Dict=None) -> Tuple[bool, Text]:
"""
No-op; this is a heuristic model.
"""
self._model_metadata = model_metadata
return True, None
Example: A document splitting model
This example creates a model for splitting a document into page ranges by class.
from instabase.ocr.client.libs import ibocr
class DocumentSplitResult(object):
def __init__(self, doc_splits: Dict[Text, List[Tuple[int, int]]], split_type: Text='page-ranges', debugging_data: Dict=None) -> None:
self.doc_splits = doc_splits
self.split_type = split_type
self.debugging_data = debugging_data
class DocSplitClassifier(object):
def get_type(self) -> Text:
return 'doc_split_classifier'
def get_version(self) -> Text:
return '1.0'
def train(self, training_context: ClassifierTrainingContext, reporting_context: ClassifierReportingContext) -> Tuple[Dict, Text]:
return dict(), None
def split_doc(self, datapoint: ClassifierInput) -> Tuple[DocumentSplitResult, Text]:
parsed_ibocr = datapoint.get_ibocr()
num_records = parsed_ibocr.get_num_records()
return DocumentSplitResult({
'class1': [(1,3), (5, 8)],
'class2': [(4,4)],
'class3': [(9,10)]
}), None
def export_parameters_to_string(self) -> Tuple[Text, Text]:
return '', None
def load_parameters_from_string(self, model_string: Text, model_metadata: Dict=None) -> Tuple[bool, Text]:
return True, None
def register_classifiers():
return {
'doc_split_classifier': {
'class': DocSplitClassifier
}
}
Ensemble Classifiers
Create custom Classifier ensembles to use in Flow.
Custom Ensemble Classifiers can apply:
-
Multiple Classifiers to a piece of data, including custom Classifiers
-
Custom logic to decide which result to select for each datapoint
-
Custom pre-processing on Classifier inputs during prediction
Ensemble Classifiers are implemented in Python, placed in the scripts
directory in Instabase, and then registered for use.
Implementing an Ensemble Classifier
An Ensemble Classifier classifier is a Python class that implements the following interface:
class Classifier(object):
"""
Interface for an Instabase classifier.
"""
def get_type(self) -> Text:
raise NotImplementedError('To be implemented.')
def get_version(self) -> Text:
raise NotImplementedError('To be implemented.')
def get_ensemble_types(self) -> List[Text]:
"""
Return the types of the classifiers to form an ensemble from. Classifier types
must come from the Custom Classifiers occupying the same scripts/
folder. An Ensemble Classifier can't request to contain another copy
of itself, such as the type returned by self.get_type()
"""
return ['ib:custom-classifier']
def classifiers_will_predict(self, datapoint: ClassifierInput) -> ClassifierInput:
"""
Called before prediction. Provides an opportunity to modify the
input.
"""
return datapoint
def classifiers_did_predict(self, original_datapoint: ClassifierINput, modified_datapoint: ClassifierInput,
predictions: List[ClassifierPrediction]) -> Tuple[ClassifierPrediction, Text]:
"""
Called after prediction. Requires the implementor to decide which output is correct.
User can additionally their own heuristic output as an override for special cases.
"""
try:
return predictions[0], None
except Exception as e:
return None, 'Error: {}'.format(e)
Your Ensemble Classifier is responsible for:
-
Reporting its type and version.
-
Reporting what Classifier types it wraps.
-
Making a decision about the ensemble’s final output.
-
(Optional) Manipulating incoming datapoints for prediction.
Classifier types
Valid types that an Ensemble Classifier can request in the get_ensemble_types
function are defined in Classifier types, and Custom Classifiers that share the same scripts
folder as the Ensemble Classifier.
Non-ensemble usage scenarios
Ensemble Classifiers also provide a useful way to combine automatic classification with manual, heuristic logic for special cases.
Consider a typical ML Studio classifier trained on 15 different document types. Perhaps you observe good performance on 14/15 document types, but it keeps grouping the 15th type together with the 13th. An Ensemble Classifier could be created to wrap the ML Studio classifier and insert additional logic to handle that special case.
The classifiers_did_predict
method might:
-
Detect the problematic 13th document label as the output about to be returned
-
Apply a series of regular expressions to the input document to test for situations that you, the human, know to distinguish the 13th document class from the 15th document class
-
Correct the mistaken agglomeration in this one special case, defaulting to the ML Studio classifier in all other cases
Updating legacy classifiers
scikit-learn
, a Python dependency, is upgraded in release 23.07 and again in release 24.04. This change can result in breaking errors when you run flows that include legacy (non-ibformers) classifiers, including classifiers trained with the Classifier app. Updating all legacy classifiers avoids this risk.
If you didn’t encounter breaking errors after upgrading to release 23.07, you might still encounter breaking errors in release 24.04 or later, as the scope of the scikit-learn
upgrade increased. This scope increase means legacy classifiers that were unaffected by the 23.07 scikit-learn
upgrade are likely to be affected by the 24.04 scikit-learn
upgrade.
To update legacy classifiers, do one of the following:
-
(Recommended) Retrain your classifier model using Machine Learning Studio.
-
Convert the model to ONNX.
NoteSome
sklearn
models, likeKNearestNeighborClassifier
, are difficult to impossible to convert. In these cases, you must use ML Studio to retrain the model.
To identify whether your classifier model might break in the upgrade, check the classifier script code, which is in the classifier module in a flow. The model might need to be updated if:
-
The code imports
sklearn
or useskoala
classifier models that usesklearn
. -
Serialization/deserialization is performed using
joblib.dump
andjoblib.loads
. -
The code is affected by breaking sklearn API changes.
Convert custom sklearn models to ONNX
Follow these steps to convert an sklearn classifier into ONNX running inference with predict()
and predict_proba()
. If the inference code uses other sklearn
methods like kneighbors()
, logic using those functions must be rewritten in terms of predict()
and predict_proba()
.
-
Identify the classifier’s inference code and the
sklearn
components used in the classifier.In the inference code (located in the
script.py
file), identify thesklearn
components that are called. Typically there is a preprocessing pipeline that transforms the input text, and a model that callspredict()
on the transformed text.For example, the following is the inference code for a split classifier that involves 4
sklearn
components:-
self.page_vectorizer
, which is of typeTfidfVectorizer
-
self.segmenter
, which is of typeLinearSVC
-
self.doc_vectorizer
, which is of typeTfidfVectorizer
-
self.classifier
, which is of typeLinearSVC
def split_doc(self, datapoint: ClassifierInput) -> Tuple[DocumentSplitResult, Text]: page_texts = [record.get_text() for record in datapoint.get_ibocr().get_ibocr_records()] # Step 1. Predict the location of the splits [Bundle Segmentation] # - Transform the page texts to page features with our page_vectorizer page_features = self.page_vectorizer.transform(page_texts) # - Predict location of segments firstpage_predictions = self.segmenter.predict(page_features) # - Transform sequence of [1, 0]-decisions into list of split ranges (i, j) split_indices = binary_firstpage_to_split_indexes(firstpage_predictions) # Step 2. For each split range (i, j), predict document class label # [Document Classification] splits = defaultdict(list) for i, j in split_indices: # get the text of these pages as a single 'document' text = join_page_texts(page_texts[i:j]) # get the feature for this 'document' X = self.doc_vectorizer.transform([text]) # predict the class label for this doc label = self.classifier.predict(X)[0] splits[label].append((i+1, j)) return DocumentSplitResult(splits), ''
-
-
For each
sklearn
component, determine the number of features and the data type of each feature that thesklearn
component takes in.Sometimes, you can identify this information by looking at the inference code. For example, in the following code,
self.doc_vectorizer
takes in a single feature which is a string, which is the whole document’s text. Note that the entire document’s text counts as only one feature, as opposed to the number of features being equal tolen(text)
.text = join_page_texts(page_texts[i:j]) X = self.doc_vectorizer.transform([text])
In other cases, the only way to find the number and data types of a feature is to add log statements before the
predict()/transform()
call, which lets you print out that information. After adding the log statements, run the flow that uses the model, and then look at the logs to see the number and data types of the features.For example, you can use the following logging to determine the number and data types of the features:
logging.info("[classifier] Type of input: " + str(type(X[0,0]))) logging.info("[classifier] Size of input: " + str(X.shape)) label = self.classifier.predict(X)[0] # predict the class label for this doc
When run, the following logs might be returned. These logs indicate that
self.classifier
took in one data point with 1411 features, all of which are floats.[classifier] Type of input: <class 'numpy.float64'> [classifier] Size of input: (1, 1411)
-
Update your classifier to serialize the
sklearn
components as ONNX models.In your custom classifier,
export_parameters_to_string
is used to serialize the model. Update this function to serialize thesklearn
components as ONNX models.sklearn
components are serialized to strings usingserialize_model_to_string
. To serialize a component to an ONNX model, you must provide thetensor_types
argument with the number and data types of the features.For example, to serialize an
sklearn
component that takes in any number of data points and 1411 float features, for each data point, do the following:serialize_model_to_string(self.classifier, tensor_types=[('input', FloatTensorType([None, 1411]))])
FloatTensorType(...)
is an ONNX data type object. There are corresponding types for other kinds of data types, such as:from skl2onnx.common.data_types import ( StringTensorType, FloatTensorType, DoubleTensorType, Int64TensorType, Int32TensorType, BooleanTensorType, SequenceTensorType, DictionaryTensorType )
See the sklearn-onnx documentation to learn more about these data types. You can see examples of values being passed in for an argument called
initial_types
. This argument takes in the same kind of object thattensor_types
takes in.For a custom classifier model with multiple
sklearn
components, your newexport_parameters_to_string
function might resemble the following:from skl2onnx.common.data_types import ( StringTensorType, FloatTensorType) def export_parameters_to_string(self) -> Tuple[Text, Text]: """ Returns a representation of the trained model as a string, and an error string if applicable. """ logging.info("calling export_parameters_to_string!") try: params = { 'page_vectorizer': serialize_model_to_string(self.page_vectorizer), 'doc_vectorizer': serialize_model_to_string(self.doc_vectorizer), 'segmenter': serialize_model_to_string(self.segmenter), 'classifier': serialize_model_to_string(self.classifier), 'page_vectorizer_onnx': serialize_model_to_string(self.page_vectorizer, tensor_types=[('input', StringTensorType([None, 1]))]), 'doc_vectorizer_onnx': serialize_model_to_string(self.doc_vectorizer, tensor_types=[('input', StringTensorType([None, 1]))]), 'segmenter_onnx': serialize_model_to_string(self.segmenter, tensor_types=[('input', FloatTensorType([None, 1409]))]), 'classifier_onnx': serialize_model_to_string(self.classifier, tensor_types=[('input', FloatTensorType([None, 1411]))]) } return safe_json_dumps(params), '' except Exception as e: return '', 'Failed to export params to string: ' + str(e)
WarningIn
export_parameters_to_string
, make sure to return an error string if an exception is raised. Doing so will ensure that the ibclassifier file is not updated in case something wrong happens, and not doing so can result in your original classifier being erased.This example keeps the serialization logic for the existing
page_vectorizer, doc_vectorizer, segmenter, classifier
components and creates new components for the ONNX version of each component. Do this in order to keep your existing serialized components the same. -
Update your classifier to deserialize the
sklearn
components as ONNX models.In your custom classifier,
load_parameters_from_string
is used to deserialize the model, and this function must call todeserialize_model_from_string
. You can use this function without any changes to deserialize a string into an ONNX model.If you did not serialize new components to your model in
export_parameters_to_string
from step 3, no changes need to be made toload_parameters_from_string
. Otherwise, you must updateload_parameters_from_string
to deserialize the new components:def load_parameters_from_string(self, model_string: Text, model_metadata: Dict = None) -> Tuple[bool, Text]: """ Loads a trained model from a string. """ if not model_string: return True, '' try: params = json.loads(model_string) self.page_vectorizer = deserialize_model_from_string(params['page_vectorizer']) self.doc_vectorizer = deserialize_model_from_string(params['doc_vectorizer']) self.segmenter = deserialize_model_from_string(params['segmenter']) self.classifier = deserialize_model_from_string(params['classifier']) # Adding this if statement because this new component might not exist yet # in your classifier file. if 'page_vectorizer_onnx' in params: self.page_vectorizer_onnx = deserialize_model_from_string(params['page_vectorizer_onnx']) self.doc_vectorizer_onnx = deserialize_model_from_string(params['doc_vectorizer_onnx']) self.segmenter_onnx = deserialize_model_from_string(params['segmenter_onnx']) self.classifier_onnx = deserialize_model_from_string(params['classifier_onnx']) return True, '' except Exception as e: return False, 'Error loading parameters from string ' + str(e)
-
Run the flow to serialize the new components to your classifier’s
.ibclassifier
file.Call
serialize_model_to_file
inload_parameters_from_string
, after all deserialization logic. For example:from instabase.flow.runnables.classify_record_fns import serialize_model_to_file def load_parameters_from_string(self, model_string: Text, model_metadata: Dict = None) -> Tuple[bool, Text]: """ Loads a trained model from a string. """ if not model_string: return True, '' try: params = json.loads(model_string) self.page_vectorizer = deserialize_model_from_string(params['page_vectorizer']) self.doc_vectorizer = deserialize_model_from_string(params['doc_vectorizer']) self.segmenter = deserialize_model_from_string(params['segmenter']) self.classifier = deserialize_model_from_string(params['classifier']) # Adding this if statement because this new component might not exist yet # in your classifier file. if 'page_vectorizer_onnx' in params: self.page_vectorizer_onnx = deserialize_model_from_string(params['page_vectorizer_onnx']) self.doc_vectorizer_onnx = deserialize_model_from_string(params['doc_vectorizer_onnx']) self.segmenter_onnx = deserialize_model_from_string(params['segmenter_onnx']) self.classifier_onnx = deserialize_model_from_string(params['classifier_onnx']) serialize_model_to_file("<path to .ibclassifier file>", self) return True, '' except Exception as e: return False, 'Error loading parameters from string ' + str(e)
After saving the
script.py
file, run the flow. The new serialized ONNX components are written to the.ibclassifier
file.Once the flow completes successfully, remove the call to
serialize_model_to_file
. -
Update your model’s inference code to use ONNX models.
Your new inference code resembles your old inference code, except that every component must use the
.predict
method for model predictions and.predict_proba
for model confidence scores.For example, the inference code from step 1 would be changed to the following. You can see that
self.page_vectorizer.transform(...)
is changed toself.page_vectorizer_onnx.predict(...)
, and likewiseself.doc_vectorizer
is changed toself.doc_vectorizer_onnx.predict(...)
.import numpy as np def split_doc(self, datapoint: ClassifierInput) -> Tuple[DocumentSplitResult, Text]: page_texts = [record.get_text() for record in datapoint.get_ibocr().get_ibocr_records()] # We need to reshape page_texts into a two-dimensional array because, in step 3, # we serialized page_vectorizer_onnx as a model with shape [None, 1]. # This means the model expects a two-dimensional array, where each array is the one feature # for a data point. page_texts_reshaped = np.array(page_texts).reshape(-1, 1) # Step 1. Predict the location of the splits [Bundle Segmentation] # - Transform the page texts to page features with our page_vectorizer page_features = self.page_vectorizer_onnx.predict(page_texts_reshaped) # - Predict location of segments firstpage_predictions = self.segmenter_onnx.predict(page_features) # - Transform sequence of [1, 0]-decisions into list of split ranges (i, j) split_indices = binary_firstpage_to_split_indexes(firstpage_predictions) # Step 2. For each split range (i, j), predict document class label # [Document Classification] # map document class label -> list of split ranges (start, end) splits = defaultdict(list) for i, j in split_indices: # get the text of these pages as a single 'document' text = join_page_texts(page_texts[i:j]) # We need to pass in a two-dimensional array because, in step 3, we serialized # doc_vectorizer_onnx as a model with shape [None, 1]. # This means the model expects a two-dimensional array, where each array is the one feature # for a data point. X = self.doc_vectorizer_onnx.predict([[text]]) # predict the class label for this doc label = self.classifier_onnx.predict(X)[0] # convert to 1-indexed, inclusive-end splits[label].append((i+1, j)) return DocumentSplitResult(splits), ''
-
Run the flow. The flow is now using ONNX to classify files.
Sklearn API changes
If the code used to run a legacy model contains any of the following, it might need to be updated in order to run in the new version.
-
joblib
is no longer a part ofsklearn.externals
and must be directly imported asimport joblib
. -
Importing from most submodules is deprecated. For any import statement of the format
from sklearn.module.submodule import ClassName
, use the new formatsklearn.module import ClassName
. -
The
classes
andn_classes
attributes oftree.DecisionTreeRegressor
are now deprecated. -
The
presort parameter
is now deprecated intree.DecisionTreeClassifier
,tree.DecisionTreeRegressor
,ensemble.GradientBoostingClassifier
, andensemble.GradientBoostingRegressor
. -
The
n_classes_
parameter is now deprecated inensemble.GradientBoostingRegressor
. -
The
copy
parameter is now deprecated inTfidVectorizer.transform
. -
The
n_jobs
parameter is now deprecated incluster.KMeans
,cluster.SpectralCoclustering
, andcluster.SpectralBiclustering
. -
The
precompute_distances
parameter ofcluster.KMeans
is deprecated. -
Most estimators now expose a
n_features_in
attribute.