Classifier guide

Most document processing projects require the incoming documents to be sorted, or classified. Instabase provides an app called Classifier to perform this sorting operation.

In this guide, we will add an Automatic Features Classifier to our paystub project, so that incoming documents can be labeled as ADP or Gusto.

By the end of this guide, you should be able to:

  • Create an Instabase Classifier
  • Train an Instabase Classifier on multiple file types
  • Run a Metaflow that uses an Instabase Classifier

Prerequisites

Before this exercise, you should have completed the Flow guide and the Refiner guide.

For this exercise, we’ll be working with a variety of paystubs. You can download them here:

The following paystubs are unlabeled, and will be used to test the effectiveness of our Classifier.

1. Why classify?

Most of the time, a Flow will be able to process one specific type of document. There may be a Flow that extracts names from paystubs, or a Flow that extracts stats from baseball cards. In a production workflow, you usually expect to receive a variety of documents. There may be one folder that stores a collection of invoices, shipping receipts, and travel orders. Each type of document has to go through a different Flow for the information to be extracted properly.

Text extraction and other document processing often rely upon knowing the specific structure of a document in order to work best. If we’re trying to find an address, we would look in a different place on a paystub vs on a driver license, for example. Because of this, the first step of an automation pipeline is to sort the documents into their types, so that future processing can be done within a known structure. The process of sorting documents algorithmically is called classification.

Instabase provides a way to build a classifier to sort the documents as a first step in an automation process.

This app is called Classifier.

How Classifier works

Classifier uses text to classify. The image files can either be already processed with an OCR step, or you can have Classifier run OCR on your input files before training. If you’re using already processed files, makes sure not to rerun OCR on them. The Automatic Features Classifier looks at every pair of words, or bigram, in the document type samples and calculates their frequencies.

Then, when Classifier looks at a new, unclassified image file, it can independently determine the confidence level for each document type. Basically, for a document type X, how sure is the Classifier that this image file belongs to type X?

If you have run a prebuilt solution before, you have probably seen a predictions.json file in your output folder. The predictions.json file shows how confident the Classifier was that each file belonged to a certain type.

For example:

{
  ".../data/input/buzz-lightyear.pdf": {
    "debugging_data": {
      "similarity_scores": {
        "toy_astronaut": 0.857,
        "pet_dog": 0.118,
        "videogame": 0.065
      },
      "max_similarity": 0.8571428571428571
    },
    "best_match": "toy_astronaut"
  }
}

In this example, Classifier has assigned the highest confidence for buzz-lightyear.pdf to the toy_astronaut type.

In this guide, we walk through how to create a Classifier that would produce a file of predictions.

2. Creating a Classifier

Instabase Classifiers work as a part of a Metaflow, which is a program that runs multiple Flows.

From the Classifier welcome page, fill out the name for your Classifier and select the folder to save it in. After you’ve submitted the form, you’ll be redirected to the Classifier app, where you can select the Classifier type and specify the data training source. By default, Class Mappings will be selected as the date training source. If you select Class Mappings, you have the option to edit the OCR settings for the input data.

Let’s first go through the two different ways of specifying the training data source.

Using a dataset from a Metaflow Project

One way is to use or set up Flows and run them on your input data. For this, we first need example documents for each type of document we’ll want to classify.

If we’re making a program that should classify documents as Uber receipts or Amazon receipts, for example, we should make sure we have a folder filled with Uber receipts and one filled with Amazon receipts. Then, we use Instabase’s Metaflow creator to create the folder structure a Classifier requires.

Activity

  1. Create a new workspace called classification-practice, remove any default folders, and upload (New > Upload files) the two prerequisite folders.
  2. Select New > New Project > Metaflow Project, and name your project. Click Create!
  3. Now, the project creator is building a Flow for the first document type. Call the first one ADP.
  4. Select the folder of adp_paystubs that you uploaded earlier and click Install.
  5. After it successfully installs, click Install to install another.
  6. Do the same process for the gusto_paystubs.
  7. Click Create, and then View Project.
  8. Your third folder, unsorted paystubs, is not a folder that we want to train a classifier on. This is just sample data that we’ll use later to test if our classifier works as expected. Upload this to your workspace, and notice that the data is named ‘input’.

Setting up training data manually

You do not necessarily need to set up the training data with the Metaflow Project Recipebook. As long as you have the following set up, you can select the folder containing your flows as Path to flow folder:

  • A folder containing a flow for each class.
  • Data for each class in its own folder.
  • Each flow has been run on its respective input folder.

In the Classifier Edit UI you need to select the folder containing your Flows for “Path to flow folder”.

Using existing folders or files as your dataset

Another way to define the dataset for a Classifier is to use the UI to name your classes and use the file picker to select the file or folder paths containing files belonging to that class. For this, select Class mappings. Add a new class with the Add class mapping button and give your class a name. Then you can add a new input path by using the Add input path button. You can select single files as well as folders. Only the top-level processable files in a folder will be part of the dataset.

With this method, you can either add already processed files such as ibdoc or ibocr, or add raw files, such as pdfs. Make sure that you are only adding either processed files OR raw files and not a mix of both.

If you are adding processed files, then you’re done! Proceed to the next section.

For unprocessed files, check Data requires preprocessing? checkbox and the UI for specifying Process Files settings will be displayed. These configuration options are largely the same as the options for a Process Files step in Flow. See Process Files config options.

Good job! You’ve created a new Classifier. Next, we’ll learn how to view and train the Classifier to sort documents.

3. Training a Classifier

A Classifier must first be trained to discover the structure of the document types so that it can recognize them when it sees new documents.

Training a Classifier from a folder of Flows

You can trigger training by clicking Train. The Classifier will read in the data. If you use Flows, then the Classifier will go through each of the Flows in flows, look at the folder that each Flow is hooked up to, and extract features from the text or ibdoc of each document.

Each Flow name becomes the class name. If we have a Flow called w2.ibflow, the class for that type of document will be w2. If we have a Flow called astronaut_license.ibflow, the class will be called astronaut_license.

If we rename a Flow, we have to be sure to train the Classifier again.
Renaming a Flow does not automatically retrain the Classifier.

Training a Classifier with Class Mapping

If you use Class Mappings, then the Classifier will read in the data from the paths in the mapping. If your data is in a processed format, just click Train. If not, the next section explains Run OCR, and whether you need it.

Running OCR before training

If your data is in a raw format such as pdf or jpeg , use the toggle to turn on Run OCR before clicking Train.

Before the Classifier trains on the data in the class mappings, it runs OCR on the input, writes the output of that process to the file system, and then reads in that data to train the Classifier.

The Classifier will write out the resulting files of OCR to the location you specified in OCR output folder. For each class, it will create a folder with the name of the class. Then, for each input path you specified, a numbered folder will be created containing all the output documents from that path.

If you add new files to the existing folders, you need to run OCR again. If you are removing files, then you need to manually delete the output folder.

During training

What kind of features are extracted depends on the Classifier you are training. If you selected Automatic Features, then bigrams will be created from the text for each doc. The ideal number of sample documents for this Classifier to train on is somewhere between 10 and 20 for each type. If you have more known documents, the classification will be better, as the Classifier will have more examples of structure for each document type.

By default, the Automatic Features Classifier includes an other option. Documents that do not seem to fit any of the defined classes will fit into the other category. A document will be of the other type if it does not reach at least a 0.4 confidence level in any of the defined categories.

Note that other built-in Classifiers such as Naive Bayes or Document Splitter Classifiers do not classify documents in an other category.

Activity

  1. View your Classifier by clicking View, under View Classifier in the viewme.ibrecipebook
  2. Press the Train button on your classifier.ibclassifier. This step is all you need to do to train the Classifier.

4. Running the Classifier

What is a Metaflow?

A Classifier can be run only as part of a Metaflow. The binaries you have run in the “How to Use Solutions” guide are examples of full Metaflows.

We typically make one Flow to handle one document type. You might need a Flow to process:

  • Paystubs
  • Driver Licenses
  • W2 Forms
  • Any other type of document with data in it.

When we need to process multiple types of documents (say we have paystubs and W2 forms), we need to make a Flow for each one, then create a Metaflow to determine which Flow should be run on a given document. A Metaflow will first have the Classifier classify the document, and then it will route it to the appropriate Flow. The Metaflow is included in the compiled binary for solutions that Instabase delivers.

Run a Metaflow to Run the Classifier

Activity

  1. Navigate back to the viewme.ibrecipebook by clicking on the name of your project in the navigation header (Instabase Drive > YOUR PROJECT NAME).
  2. Select Edit in the Edit/Run Metaflow section.
  3. Run the Metaflow by selecting ToolsRun.
    • Set the Run Type to be Metaflow. Specify the input folder as the Data input directory, your Classifier as the Classifier file, and the Metaflow file as your Metaflow file in the flows folder. For the Post Run: step, select Merge CSVs.
  4. Click Run.

Congratulations! You’ve just run a Metaflow with the Classifier you created. You’ve already seen most of the output types in previous guides, but we’ll look a little more at the Classification results specifically in the next section.

5. Review results

The Metaflow creates an out folder when it runs. This output folder contains three things:

  • The classification results
    • classes.json: a mapping from each template this Metaflow knows about to the documents that matched that template.
    • predictions.json: the classifier’s raw prediction log for each file.
    • class_output_folders.json: a mapping from each template this Metaflow knows about to the folder in which those results were stored.
    • _class_to_processed.json: a mapping from each template this Metaflow knows about to the locations where the processed .ibdoc files are stored.
  • Each individual flow’s results
    • You will see folders such as luke-receipts. This folder contains the individual flow output for any document that matched this template.
  • Merged results
    • joined.xls contains a spreadsheet with merged results. Note that Instabase online spreadsheet viewer does not reveal that this sheet has multiple tabs, one per template, so we recommend you download and open it.

After running a Classifier, you can also view the logs and stack trace if there was any error.

Conclusion

Great! Now you know how to create and use a Classifier to sort documents.

You should be able to:

  • Create an Instabase Classifier
  • Train an Instabase Classifier on multiple file types
  • Run a Metaflow that uses an Instabase Classifier

Next, you can learn how to create your own Custom Classifiers to do more complex labeling. See the Advanced section of this training manual.