Classifier (Legacy)

Table of Contents

Classifier runs in the Classifier Flow step and uses processed text files from an OCR step to sort documents by document type. Classifier is trained to discover the structure of document types.

Tip

For hands-on guides to working with Classifier, see the Custom Classifier guide.

Classifier types

Custom Classifier

Custom Classifiers can inherit from DocumentSplitter to become custom split documents Classifiers.
Automatic features (deprecated) Built-in default classifier.
Naive Bayes (deprecated) Built-in Naive Bayes classifier.
Split Documents (deprecated)

The split documents Classifiers solve the use-case of document separation: separating and classifying documents in an input bundle. There are a few Split Documents models available for specific uses.
- Split documents (fixed templates)
  
  Best for applications where documents tend to match a fixed or consistent template. This Classifier type can be trained with only one instance per label class, under the assumption that the bundled data will look very similar. Includes confidence scores in Split Result based on a joint confidence over template continuity and document class. Scores range from [0, 1], where higher scores are better. They are not probabilities.
- Split documents (by first page)
  
  Best for applications where documents are of unknown length, but contain a consistently-identifiable first page. For example, bank statements tend to have an account overview page followed by some number of transactions on subsequent pages. Instabase recommends training with at least three examples per label class. Includes confidence scores which are the classification model’s probability of the predicted class given a split range of pages.

Custom classifiers

Custom Classifiers extend the sorting capabilities of the Classifier functionality.

Custom Classifiers can implement heuristic models to label documents.

Custom Classifiers are implemented in Python files in the scripts directory and must be registered for use with the platform.

Implementing a custom Classifier

A custom Classifier is responsible for:

Reporting its type and version
Predicting a label based on a data point

A custom Classifier is a Python class that implements the following interface:

class Classifier(object):
  """
  Interface for an Instabase classifier.
  """

  def get_type(self) -> Text:
    raise NotImplementedError('To be implemented.')

  def get_version(self) -> Text:
    raise NotImplementedError('To be implemented.')

  def train(self, training_context: ClassifierTrainingContext,
            reporting_context: ClassifierReportingContext) -> Tuple[Dict, Text]:
    """
    Deprecated method for training the classifier
    """
    raise NotImplementedError(u'Deprecated, no need to implement.')

  def predict(self, datapoint: ClassifierInput) -> Tuple[ClassifierPrediction, Text]:
    # This method is used for traditional per-record classification and will return
    # a single label for the given datapoint.
    raise NotImplementedError('To be implemented.')

  def split_doc(self, datapoint: ClassifierInput) -> Tuple[DocumentSplitResult, Text]:
    # WARNING: This method will take precedence over 'predict()'
    # This method is used for labeling various subsections of the document and will
    # return an object as specified in the "Splitting Documents" section below.
    raise NotImplementedError('To be implemented.')

  def export_parameters_to_string(self) -> Tuple[Text, Text]:
    """
    Deprecated method for returning a representation of the trained model as a string.
    """
    raise NotImplementedError(u'Deprecated, no need to implement.')

  def load_parameters_from_string(self, model_string: Text, model_metadata: Dict = None) -> Tuple[bool, Text]:
    """
    Deprecated method for loading a trained model from a string.
    """
    raise NotImplementedError(u'Deprecated, no need to implement.')

  def get_feature_types(self) -> List[Text]:
    """
    Deprecated method for selecting feature types.
    """
    raise NotImplementedError(u'Deprecated, no need to implement.')

If the split_doc method is implemented, it takes priority over predict.

Accessing runtime variables

When running a Flow, you can pass in a runtime_config, which is a map<string, string>.

To access this information, use the model_metadata passed in to load_parameters_from_string method.

ModelMetadataDict is of type:

ModelMetadataDict = TypedDict('ModelMetadataDict', {
  'runtime_config': Dict[Text, Text]
})

The key is runtime_config. For example, if the runtime_config passed into Flow is {"key1": "val1"} then the model_metadata will be:

{
  "runtime_config": {
    "key1": "val1"
  }
}

Accessing FnContext

To access the FnContext object in custom Classifiers through the datapoint, call datapoint.get_fn_ctx(). The FnContext object that is returned has objects that can be accessed using the get_by_col_name method. The following columns are available:

INPUT_COL is the input text. This is the same as datapoint.get_text()
PARSED_IBOCR is the input ibocr. This is the same as datapoint.get_ibocr()
INPUT_FILEPATH is the path to the input IBDOC file
ROOT_OUTPUT_FOLDER is the path to the Run Flow/Flows/Metaflow operation’s output directory
CONFIG is the runtime_config
CLIENTS contains the clients that can be used to access various resources and consist of the following:
- ibfile allows for filesystem access from the custom classifier
TOKEN_FRAMEWORK_REGISTRY allows for use of the TokenFrameworkRegistry object
REFINER_FNS allows for use of the REFINER_FNS object
LOGGER allows for use of the LOGGER object

Accessing the ibfile object

To access the ibfile object, follow this example:

fn_ctx = datapoint.get_fn_ctx()
clients, err = fn_ctx.get_by_col_name('CLIENTS')
if err:
  logger.info('No clients exist')
ibfile = clients.ibfile

Predicting

Predictions return a ClassifierPrediction object that implements the following interface:

class ClassifierPrediction(object):
  """
  Wrapper for the result of a classification.
  """

  def __init__(self, best_match: Text, debugging_data=None) -> None:
    self.best_match = best_match # type: Text

    # For anything else.
    self.debugging_data = None # type: Dict

The only required field is best_match, a string label the Classifier has predicted. Use the debugging_data field to store additional information, such as a distribution over labels.

Splitting documents

Split a document by its constituent classes to get the records and word dicts from the datapoint using the get_ibocr method. You can specify the split ranges with a list of page ranges per class. The predictions implement the following interface:

class DocumentSplitResult(object):
  """
  Wrapper for the result of a document split.
  """
  def __init__(self, doc_splits: Dict[Text, List[Tuple], split_type: Text='page-ranges', debugging_data: Dict=None) -> None:
    self.doc_splits = doc_splits
    self.split_type = split_type
    # For anything else.
    self.debugging_data: Dict = debugging_data

The doc_splits field is a dictionary mapping each class name to a list of page ranges and an optional confidence score. The first two items in the tuple are the start and end page number of a record. The page ranges are 1-indexed and inclusive, such that (1, 3) specifies pages 1, 2, and 3 (there is no page 0). Classifier confidence score, a float, can be returned as the optional third item in the tuple.

The split_type field specifies how to split the document.

page-ranges specifies that documents are split using page ranges. For example, to split by lists of pages:

DocumentSplitResult({
  'class1': [(1,3), (5, 8)],
  'class2': [(4,4)],
  'class3': [(9,10, 90.0)] # the record spans page 9 and 10 in the file, and the confidence score is 90.0
}, split_type='page-ranges')

Registering a custom Classifier

To make your custom Classifier available, register it inside the same Python module that defines it by creating a special register_classifiers function.

This function returns a Python dictionary of the following form:

def register_classifiers():
  return {
    'company:classifier-name': {
      'class': MyClassifierClass
    }
  }

Using a custom Classifier

When you create a classifier in the Classifier app, select Custom as the type, and select the scripts folder that contains your custom Classifier implementations. New Classifiers show as options in the creation dialogue.

Example: a heuristic model

Let’s create an example model that uses heuristics. Suppose we have incoming documents that we’d like to sort into by length: small, medium, and large. We can create a custom Classifier that maps the length of these documents into a size category.

from typing import Text, Dict, List, Union, Tuple, Any, Callable, Set

# Note: Your custom code will not extend the classes here;
# Instead, simply follow their structure as defined above.

_PAGE_IN_CHARS = 80 * 48
_DOC_MEDIUM = 2 * _PAGE_IN_CHARS
_DOC_LARGE = 10 * _PAGE_IN_CHARS

def _ibocr_to_text(ibocr: List[Dict]) -> Tuple[Text, Text]: # Copy this function to extract text from an IBOCR
  # type: (List[Dict]) -> Tuple[Text, Text]
  """
  Transform an IBOCR file in Python representation into a string containing
  the concatenation of each page.
  """
  page_text = []
  if (len(ibocr)):
    for page_num in range(len(ibocr)):
      if ibocr[page_num]['text']:
        page_text.append(ibocr[page_num]['text'])
  return '\n'.join(page_text), None

class DocsizePrediction(object): # You don't need to subclass Prediction
  def __init__(self, best_match: Text):
    self.best_match = best_match
    self.debugging_data = dict()

class DocsizeDemoClassifier(object): # You don't need to subclass Classifier
  """
  This is a demo heuristic classifier.
  """

  def __init__(self) -> None:
    self._model_metadata = None

  def get_type(self) -> Text:
    return 'ib:heuristic-demo'

  def get_version(self) -> Text:
    return '1.0.0'

  def train(self, training_context: ClassifierTrainingContext, reporting_context: ClassifierReportingContext) -> Tuple[Dict, Text]:
    """
    No training is necessary; this is a heuristic model.
    """
    return dict(), None

  def predict(self, datapoint:C ClassifierInput) -> Tuple[DocsizePrediction, Text]:
    """
    Classifies a document into categories EMPTY, SMALL, MEDIUM, and LARGE
    based on arbitrary thresholds defined in the constants at the top of the
    file.
    """

    if datapoint.get_ibocr() and not datapoint.get_text():
      text_content, ibocr_error = _ibocr_to_text(datapoint.get_ibocr())
      if ibocr_error:
        return None, 'Could not transform IBOCR file to text'
      datapoint.set_text(text_content)

    best_match = 'EMPTY'
    the_text = datapoint.get_text()

    if the_text:
      if len(the_text) > _DOC_LARGE:
        best_match = 'LARGE'
      elif len(the_text) > _DOC_MEDIUM:
        best_match = 'MEDIUM'
      else:
        best_match = 'SMALL'

    return DocsizePrediction(best_match), None

  def export_parameters_to_string(self) -> Tuple[Text, Text]:
    """
    Returns an empty string; this is a heuristic model.
    """
    return '', None

  def load_parameters_from_string(self, model_string: Text, model_metadata: Dict=None) -> Tuple[bool, Text]:
    """
    No-op; this is a heuristic model.
    """
    self._model_metadata = model_metadata
    return True, None

Example: A document splitting model

This example creates a model for splitting a document into page ranges by class.

from instabase.ocr.client.libs import ibocr

class DocumentSplitResult(object):
    def __init__(self, doc_splits: Dict[Text, List[Tuple[int, int]]], split_type: Text='page-ranges', debugging_data: Dict=None) -> None:
        self.doc_splits = doc_splits
        self.split_type = split_type
        self.debugging_data = debugging_data

class DocSplitClassifier(object):
    def get_type(self) -> Text:
        return 'doc_split_classifier'

    def get_version(self) -> Text:
        return '1.0'

    def train(self, training_context: ClassifierTrainingContext, reporting_context: ClassifierReportingContext) -> Tuple[Dict, Text]:
        return dict(), None

    def split_doc(self, datapoint: ClassifierInput) -> Tuple[DocumentSplitResult, Text]:
        parsed_ibocr = datapoint.get_ibocr()
        num_records = parsed_ibocr.get_num_records()
        return DocumentSplitResult({
            'class1': [(1,3), (5, 8)],
            'class2': [(4,4)],
            'class3': [(9,10)]
        }), None

    def export_parameters_to_string(self) -> Tuple[Text, Text]:
        return '', None

    def load_parameters_from_string(self, model_string: Text, model_metadata: Dict=None) -> Tuple[bool, Text]:
        return True, None

def register_classifiers():
    return {
        'doc_split_classifier': {
            'class': DocSplitClassifier
        }
    }

Ensemble Classifiers

Create custom Classifier ensembles to use in Flow.

Custom Ensemble Classifiers can apply:

Multiple Classifiers to a piece of data, including custom Classifiers
Custom logic to decide which result to select for each datapoint
Custom pre-processing on Classifier inputs during prediction

Ensemble Classifiers are implemented in Python, placed in the scripts directory in Instabase, and then registered for use.

Implementing an Ensemble Classifier

An Ensemble Classifier classifier is a Python class that implements the following interface:

class Classifier(object):
  """
  Interface for an Instabase classifier.
  """

  def get_type(self) -> Text:
    raise NotImplementedError('To be implemented.')

  def get_version(self) -> Text:
    raise NotImplementedError('To be implemented.')

  def get_ensemble_types(self) -> List[Text]:
    """
    Return the types of the classifiers to form an ensemble from. Classifier types
    must come from the Custom Classifiers occupying the same scripts/
    folder. An Ensemble Classifier can't request to contain another copy
    of itself, such as the type returned by self.get_type()
    """
    return ['ib:custom-classifier']

  def classifiers_will_predict(self, datapoint: ClassifierInput) -> ClassifierInput:
    """
    Called before prediction. Provides an opportunity to modify the
    input.
    """
    return datapoint

  def classifiers_did_predict(self, original_datapoint: ClassifierINput, modified_datapoint: ClassifierInput,
                              predictions: List[ClassifierPrediction]) -> Tuple[ClassifierPrediction, Text]:
    """
    Called after prediction. Requires the implementor to decide which output is correct.
    User can additionally their own heuristic output as an override for special cases.
    """
    try:
      return predictions[0], None
    except Exception as e:
      return None, 'Error: {}'.format(e)

Your Ensemble Classifier is responsible for:

Reporting its type and version.
Reporting what Classifier types it wraps.
Making a decision about the ensemble’s final output.
(Optional) Manipulating incoming datapoints for prediction.

Classifier types

Valid types that an Ensemble Classifier can request in the get_ensemble_types function are defined in Classifier types, and Custom Classifiers that share the same scripts folder as the Ensemble Classifier.

Non-ensemble usage scenarios

Ensemble Classifiers also provide a useful way to combine automatic classification with manual, heuristic logic for special cases.

Consider a typical ML Studio classifier trained on 15 different document types. Perhaps you observe good performance on 14/15 document types, but it keeps grouping the 15th type together with the 13th. An Ensemble Classifier could be created to wrap the ML Studio classifier and insert additional logic to handle that special case.

The classifiers_did_predict method might:

Detect the problematic 13th document label as the output about to be returned
Apply a series of regular expressions to the input document to test for situations that you, the human, know to distinguish the 13th document class from the 15th document class
Correct the mistaken agglomeration in this one special case, defaulting to the ML Studio classifier in all other cases

Updating legacy classifiers

scikit-learn, a Python dependency, is upgraded in release 23.07 and again in release 24.04. This change can result in breaking errors when you run flows that include legacy (non-ibformers) classifiers, including classifiers trained with the Classifier app. Updating all legacy classifiers avoids this risk.

Note

If you didn’t encounter breaking errors after upgrading to release 23.07, you might still encounter breaking errors in release 24.04 or later, as the scope of the scikit-learn upgrade increased. This scope increase means legacy classifiers that were unaffected by the 23.07 scikit-learn upgrade are likely to be affected by the 24.04 scikit-learn upgrade.

To update legacy classifiers, do one of the following:

(Recommended) Retrain your classifier model using Machine Learning Studio.
Convert the model to ONNX.

Note

Some sklearn models, like KNearestNeighborClassifier, are difficult to impossible to convert. In these cases, you must use ML Studio to retrain the model.

To identify whether your classifier model might break in the upgrade, check the classifier script code, which is in the classifier module in a flow. The model might need to be updated if:

The code imports sklearn or uses koala classifier models that use sklearn.
Serialization/deserialization is performed using joblib.dump and joblib.loads.
The code is affected by breaking sklearn API changes.

Convert custom sklearn models to ONNX

Follow these steps to convert an sklearn classifier into ONNX running inference with predict() and predict_proba(). If the inference code uses other sklearn methods like kneighbors(), logic using those functions must be rewritten in terms of predict() and predict_proba().

Identify the classifier’s inference code and the sklearn components used in the classifier.

In the inference code (located in the script.py file), identify the sklearn components that are called. Typically there is a preprocessing pipeline that transforms the input text, and a model that calls predict() on the transformed text.

For example, the following is the inference code for a split classifier that involves 4 sklearn components:

self.page_vectorizer, which is of type TfidfVectorizer
self.segmenter, which is of type LinearSVC
self.doc_vectorizer, which is of type TfidfVectorizer
self.classifier, which is of type LinearSVC

def split_doc(self, datapoint: ClassifierInput) -> Tuple[DocumentSplitResult, Text]:
 page_texts = [record.get_text() for record in datapoint.get_ibocr().get_ibocr_records()]

 # Step 1. Predict the location of the splits [Bundle Segmentation]
 # - Transform the page texts to page features with our page_vectorizer

 page_features = self.page_vectorizer.transform(page_texts)
 # - Predict location of segments
 firstpage_predictions = self.segmenter.predict(page_features)
 # - Transform sequence of [1, 0]-decisions into list of split ranges (i, j)
 split_indices = binary_firstpage_to_split_indexes(firstpage_predictions)

 # Step 2. For each split range (i, j), predict document class label
 # [Document Classification]
 splits = defaultdict(list)
 for i, j in split_indices:
   # get the text of these pages as a single 'document'
   text = join_page_texts(page_texts[i:j])

   # get the feature for this 'document'
   X = self.doc_vectorizer.transform([text])

   # predict the class label for this doc
   label = self.classifier.predict(X)[0]
   splits[label].append((i+1, j))

 return DocumentSplitResult(splits), ''

For each sklearn component, determine the number of features and the data type of each feature that the sklearn component takes in.

Sometimes, you can identify this information by looking at the inference code. For example, in the following code, self.doc_vectorizer takes in a single feature which is a string, which is the whole document’s text. Note that the entire document’s text counts as only one feature, as opposed to the number of features being equal to len(text).
```
text = join_page_texts(page_texts[i:j])
X = self.doc_vectorizer.transform([text])
```
In other cases, the only way to find the number and data types of a feature is to add log statements before the predict()/transform() call, which lets you print out that information. After adding the log statements, run the flow that uses the model, and then look at the logs to see the number and data types of the features.

For example, you can use the following logging to determine the number and data types of the features:
```
logging.info("[classifier] Type of input: " + str(type(X[0,0])))
logging.info("[classifier] Size of input: " + str(X.shape))
label = self.classifier.predict(X)[0] # predict the class label for this doc
```
When run, the following logs might be returned. These logs indicate that self.classifier took in one data point with 1411 features, all of which are floats.
```
[classifier] Type of input: <class 'numpy.float64'>
[classifier] Size of input: (1, 1411)
```

Update your classifier to serialize the sklearn components as ONNX models.

In your custom classifier, export_parameters_to_string is used to serialize the model. Update this function to serialize the sklearn components as ONNX models.

sklearn components are serialized to strings using serialize_model_to_string. To serialize a component to an ONNX model, you must provide the tensor_types argument with the number and data types of the features.

For example, to serialize an sklearn component that takes in any number of data points and 1411 float features, for each data point, do the following:

serialize_model_to_string(self.classifier, tensor_types=[('input', FloatTensorType([None, 1411]))])

FloatTensorType(...) is an ONNX data type object. There are corresponding types for other kinds of data types, such as:

from skl2onnx.common.data_types import (
  StringTensorType,
  FloatTensorType,
  DoubleTensorType,
  Int64TensorType,
  Int32TensorType,
  BooleanTensorType,
  SequenceTensorType,
  DictionaryTensorType
)

See the sklearn-onnx documentation to learn more about these data types. You can see examples of values being passed in for an argument called initial_types. This argument takes in the same kind of object that tensor_types takes in.

For a custom classifier model with multiple sklearn components, your new export_parameters_to_string function might resemble the following:

from skl2onnx.common.data_types import (
  StringTensorType,
  FloatTensorType)
def export_parameters_to_string(self) -> Tuple[Text, Text]:
"""
Returns a representation of the trained model as a string, and an
error string if applicable.
"""
logging.info("calling export_parameters_to_string!")
try:
  params = {
    'page_vectorizer': serialize_model_to_string(self.page_vectorizer),
    'doc_vectorizer': serialize_model_to_string(self.doc_vectorizer),
    'segmenter': serialize_model_to_string(self.segmenter),
    'classifier': serialize_model_to_string(self.classifier),
    'page_vectorizer_onnx': serialize_model_to_string(self.page_vectorizer, tensor_types=[('input', StringTensorType([None, 1]))]),
    'doc_vectorizer_onnx': serialize_model_to_string(self.doc_vectorizer, tensor_types=[('input', StringTensorType([None, 1]))]),
    'segmenter_onnx': serialize_model_to_string(self.segmenter, tensor_types=[('input', FloatTensorType([None, 1409]))]),
    'classifier_onnx': serialize_model_to_string(self.classifier, tensor_types=[('input', FloatTensorType([None, 1411]))])
  }
  return safe_json_dumps(params), ''
except Exception as e:
  return '', 'Failed to export params to string: ' + str(e)

Warning

In export_parameters_to_string, make sure to return an error string if an exception is raised. Doing so will ensure that the ibclassifier file is not updated in case something wrong happens, and not doing so can result in your original classifier being erased.

This example keeps the serialization logic for the existing page_vectorizer, doc_vectorizer, segmenter, classifier components and creates new components for the ONNX version of each component. Do this in order to keep your existing serialized components the same.

Update your classifier to deserialize the sklearn components as ONNX models.

In your custom classifier, load_parameters_from_string is used to deserialize the model, and this function must call to deserialize_model_from_string. You can use this function without any changes to deserialize a string into an ONNX model.

If you did not serialize new components to your model in export_parameters_to_string from step 3, no changes need to be made to load_parameters_from_string. Otherwise, you must update load_parameters_from_string to deserialize the new components:

def load_parameters_from_string(self, model_string: Text, model_metadata: Dict = None) -> Tuple[bool, Text]:
 """
 Loads a trained model from a string.
 """
 if not model_string:
   return True, ''

 try:
   params = json.loads(model_string)

   self.page_vectorizer = deserialize_model_from_string(params['page_vectorizer'])
   self.doc_vectorizer = deserialize_model_from_string(params['doc_vectorizer'])
   self.segmenter = deserialize_model_from_string(params['segmenter'])
   self.classifier = deserialize_model_from_string(params['classifier'])

   # Adding this if statement because this new component might not exist yet
   # in your classifier file.
   if 'page_vectorizer_onnx' in params:
     self.page_vectorizer_onnx = deserialize_model_from_string(params['page_vectorizer_onnx'])
     self.doc_vectorizer_onnx = deserialize_model_from_string(params['doc_vectorizer_onnx'])
     self.segmenter_onnx = deserialize_model_from_string(params['segmenter_onnx'])
     self.classifier_onnx = deserialize_model_from_string(params['classifier_onnx'])

   return True, ''
 except Exception as e:
   return False, 'Error loading parameters from string ' + str(e)

Run the flow to serialize the new components to your classifier’s .ibclassifier file.

Call serialize_model_to_file in load_parameters_from_string, after all deserialization logic. For example:

from instabase.flow.runnables.classify_record_fns import serialize_model_to_file
def load_parameters_from_string(self, model_string: Text, model_metadata: Dict = None) -> Tuple[bool, Text]:
 """
 Loads a trained model from a string.
 """
 if not model_string:
   return True, ''

 try:
   params = json.loads(model_string)

   self.page_vectorizer = deserialize_model_from_string(params['page_vectorizer'])
   self.doc_vectorizer = deserialize_model_from_string(params['doc_vectorizer'])
   self.segmenter = deserialize_model_from_string(params['segmenter'])
   self.classifier = deserialize_model_from_string(params['classifier'])

   # Adding this if statement because this new component might not exist yet
   # in your classifier file.
   if 'page_vectorizer_onnx' in params:
     self.page_vectorizer_onnx = deserialize_model_from_string(params['page_vectorizer_onnx'])
     self.doc_vectorizer_onnx = deserialize_model_from_string(params['doc_vectorizer_onnx'])
     self.segmenter_onnx = deserialize_model_from_string(params['segmenter_onnx'])
     self.classifier_onnx = deserialize_model_from_string(params['classifier_onnx'])

   serialize_model_to_file("<path to .ibclassifier file>", self)
   return True, ''
 except Exception as e:
   return False, 'Error loading parameters from string ' + str(e)

After saving the script.py file, run the flow. The new serialized ONNX components are written to the .ibclassifier file.

Once the flow completes successfully, remove the call to serialize_model_to_file.

Update your model’s inference code to use ONNX models.

Your new inference code resembles your old inference code, except that every component must use the .predict method for model predictions and .predict_proba for model confidence scores.

For example, the inference code from step 1 would be changed to the following. You can see that self.page_vectorizer.transform(...) is changed to self.page_vectorizer_onnx.predict(...), and likewise self.doc_vectorizer is changed to self.doc_vectorizer_onnx.predict(...).

import numpy as np
def split_doc(self, datapoint: ClassifierInput) -> Tuple[DocumentSplitResult, Text]:
 page_texts = [record.get_text() for record in datapoint.get_ibocr().get_ibocr_records()]

 # We need to reshape page_texts into a two-dimensional array because, in step 3,
 # we serialized page_vectorizer_onnx as a model with shape [None, 1].
 # This means the model expects a two-dimensional array, where each array is the one feature
 # for a data point.
 page_texts_reshaped = np.array(page_texts).reshape(-1, 1)

 # Step 1. Predict the location of the splits [Bundle Segmentation]
 # - Transform the page texts to page features with our page_vectorizer
 page_features = self.page_vectorizer_onnx.predict(page_texts_reshaped)
 # - Predict location of segments
 firstpage_predictions = self.segmenter_onnx.predict(page_features)
 # - Transform sequence of [1, 0]-decisions into list of split ranges (i, j)
 split_indices = binary_firstpage_to_split_indexes(firstpage_predictions)

 # Step 2. For each split range (i, j), predict document class label
 # [Document Classification]

 # map document class label -> list of split ranges (start, end)
 splits = defaultdict(list)
 for i, j in split_indices:
   # get the text of these pages as a single 'document'
   text = join_page_texts(page_texts[i:j])

   # We need to pass in a two-dimensional array because, in step 3, we serialized
   # doc_vectorizer_onnx as a model with shape [None, 1].
   # This means the model expects a two-dimensional array, where each array is the one feature
   # for a data point.
   X = self.doc_vectorizer_onnx.predict([[text]])

   # predict the class label for this doc
   label = self.classifier_onnx.predict(X)[0]

   # convert to 1-indexed, inclusive-end
   splits[label].append((i+1, j))

 return DocumentSplitResult(splits), ''

Run the flow. The flow is now using ONNX to classify files.

Sklearn API changes

If the code used to run a legacy model contains any of the following, it might need to be updated in order to run in the new version.

joblib is no longer a part of sklearn.externals and must be directly imported as import joblib.
Importing from most submodules is deprecated. For any import statement of the format from sklearn.module.submodule import ClassName, use the new format sklearn.module import ClassName.
The classes and n_classes attributes of tree.DecisionTreeRegressor are now deprecated.
The presort parameter is now deprecated in tree.DecisionTreeClassifier, tree.DecisionTreeRegressor, ensemble.GradientBoostingClassifier, and ensemble.GradientBoostingRegressor.
The n_classes_ parameter is now deprecated in ensemble.GradientBoostingRegressor.
The copy parameter is now deprecated in TfidVectorizer.transform.
The n_jobs parameter is now deprecated in cluster.KMeans, cluster.SpectralCoclustering, and cluster.SpectralBiclustering.
The precompute_distances parameter of cluster.KMeans is deprecated.
Most estimators now expose a n_features_in attribute.