23.01 Release notes

Table of Contents

Release 23.01 is a major release that introduces new features, enhancements, and bug fixes.

Release 23.01.22

Bug fixes

Platform

Customers using Microsoft Authenticator for MFA couldn’t use separate tokens for multiple environments because all environments were assigned the name Instabase. With this fix, each environment is assigned an arbitrary UUID suffix.
Subspaces didn’t load in space selector dropdowns under certain circumstances.

Flow Review

After you reclassify a record, you can apply either a class schema that is a combination of all previous extraction steps (by default) or the class schema of the most recently run extraction step. Extraction steps are any apply refiner step or run extraction model step in the flow. The record’s fields are replaced by the fields from the new schema.

Release 23.01.10

Bug fixes

Flow Review

You can select the schema to apply to a record after you reclassify it by choosing either to apply the schema of the most recently run extraction step for the class or a combination of all previous extraction steps. Extraction steps are any Apply Refiner step or Run Extraction Model step in the flow. The record’s current fields will be replaced by the fields from the new schema.

ML Studio

Model artifacts resulting from annotation analysis jobs could be published to Marketplace or used in solutions–these model artifacts are meant to be used only for annotation assist.
Erroneous dataset modification warnings appeared if data was stored on NFS drives or if users were editing different datasets in the same model.

Refiner

A refiner with more than one dropdown output field incorrectly contained the same value in all dropdown output fields.
When running flows with an extraction model step, an incorrect FieldExtraction error occurred.

Release 23.01.0

New features and enhancements

Infrastructure

Public preview | Inter-pod traffic is routed through service mesh, which standardizes request routing across all services within an Instabase cluster. Using service mesh provides better service discovery and load balancing, as well as improved observability. This change also provides extensibility to add mTLS support and build custom routing strategies, such as sharding (used by model-service), hashing for performance, and efficiency and reliability improvements.

Service mesh uses Envoy as a proxy for all traffic between pods. A custom-built control plane, mesh-manager, configures the mesh, which comprises an Envoy proxy sidecar running on each pod.

Service mesh has the following resource requirements:
- Each Envoy proxy sidecar requests 10m CPU and 100MiB memory.
- Service mesh is only supported in deployments that use Deployment Manager.
OpenSearch is the primary search infrastructure for the Instabase platform. As part of this migration from ElasticSearch to OpenSearch, index lifecycle management (rollovers and deletion of index) is enabled through the Instabase index management system. This change is backwards compatible with environments running managed ElasticSearch instances.

Platform

Tip

With the public preview release of Solution Builder, the interface of some Instabase apps has changed. You can toggle Instabase’s Solution Builder mode back to Classic mode by clicking Solution Builder > Classic mode in the top menu bar.

Public preview | Solution Builder simplifies solution development, helping you more easily build, train, and debug complete data processing solutions. A single centralized project workspace brings together everything you need.

Low-code tools and shortcuts streamline uploading and annotating documents, training classification and extraction models, refining and validating data, and creating an automated flow.

Solution Builder provides you with the right artifacts—such as documents, user-defined functions, refiners, and models—at the right time, so that you no longer have to organize your files and modules manually in the filesystem.

A guided interactive tutorial in Solution Builder walks you through the development process so that you can get started developing your solutions right away. See the Solution Builder documentation for more details.
Public preview | You can review diagnostics for the state of your database from the Admin Diagnostics app, including tests for connectivity, reading, inserts, updates, deletes, scans, and more.
The All apps display has a refreshed appearance:
- You can scroll through the list of available apps, instead of clicking through pages.
- Apps downloaded from the Marketplace no longer display in the All apps view.
- App icon labels identify which apps are in preview.

File system

This release adds support for both Azure Blob Storage (general availability (GA)) and Google Cloud Storage (public preview) as file storage options. You can use either option as the primary storage backing your Instabase Drive, or you can mount the Azure Blob Storage or Google Cloud Storage drives to subspaces. Both Azure Blob Storage and Google Cloud Storage can be configured using the Instabase Installer.

Note

Support for Google Cloud Storage as a file client is in public preview as of release 23.01. While in public preview, it should only be used as a file storage option for non-production environments, such as developer environments. Azure Blob Storage as a file client is GA as of release 23.01 and supported on major releases 22.08 and later.
A new file service cache component improves system performance under read-heavy workloads by storing and fetching data directly from an in-memory cache. By avoiding round trips to disk, this change improves the response time for Stat and Read calls by >50%. For storage systems that lack strong consistency, such as Network File Systems, the file service cache improves consistency guarantees as contents can be fetched from the cache before contents are fully written to disk.

There are some limitations to using the file service cache:
- Cached entries are only available in the cache for a short period of time, typically 5 minutes.
- Large files are not cached. The default limit for cached files is 50MB, though this can be adjusted.
- The file service cache does not support encrypted drives.
To enable the file service cache component, set the environment variable ENABLE_FILE_CACHE to True in the file-tservice grpc-file-service container. Using this feature requires a separate Redis instance, redis-file-service, dedicated to file caching.

Model service

The model service no longer strictly requires a persistent volume (PV) to use models. While a PV is still highly recommended for performance, this change offers flexibility for environments that can’t provide PVs that meet the required PV specifications.

To use the model service without a PV, the deployment must be patched and nodes must have sufficient disk space to allow mounting large local volume to model-service pods (>50 GB). These requirements help ensure good model inference throughput. To learn more, see the Store model on local disk section of the Model training and inference requirements documentation.
Several new features improve model service visibility and help with debugging model service-related issues and errors:
- Logs include a task_id value for both Refiner and Extraction steps, making it easier to identify task-specific issues.
- The job status API returns detailed error messages when querying the status of asynchronous model service execution jobs. Previously, the API returned a generic error message when the model service encountered errors during asynchronous execution.
- Flow trace has visibility into internal steps in model service.
Public preview | Open Neural Network Exchange (ONNX) runtime is supported for the layoutlm_base model (classification and extraction tasks) and instalm_base model (extraction tasks). This implementation reduces both inference time and memory usage and can be enabled by setting the enable_onnx_runtime_optimization hyperparameter to true.
The model service supports multiple versions of ibformers. This change means models trained on previous ibformers versions do not become obsolete after an ibformers upgrade and can still be used for model inference, alongside models trained on a newer ibformers version.

Model training

ibformers 2.0 is available and introduces a variety of features and improvements. With this release, ibformers now:
- Public preview | Supports list fields: extraction models are able to predict multiple text items instead of a single text value.
- Includes training size metadata metrics.
- Reports support metric for extraction.
- Writes a confusion matrix file for classification and split-classification tasks.
- Makes the long-doc retriever and multilabel features available as separate checkboxes for all base models.
- Reduces the size of the base model image size by half, from 14 GB to 7 GB.
- Improves fault tolerance in model downloading and caching.
- Outputs extraction model inference predictions in the same order for each document and handles OCR variations better.
This release requires a minimum ibformers version of 2.0, which also requires an update to the Marketplace base models. See the Deployment Guide section for more information.

Deployment Manager

Deployment Manager is the new name for the Instabase control plane. The updated name reflects the full suite of features Deployment Manager offers for installing and managing your Instabase infrastructure. As of platform release 23.01, Deployment Manager has also joined the same release cadence and versioning as platform releases.
The Instabase Installer has several UI updates and new features:
- The Preflight Checks step runs a suite of validation tests before beginning installation. Available preflight checks include verifying database and file system credentials and access, confirming Kubernetes service account permissions, and ensuring listed PVCs meet requirements for installation.
- A Platform Configuration step lets you upload your Instabase license, network policies, base configurations, and default patches as part of installation.
- Azure Blob Storage and Google Cloud Storage are added as options in the File Storage step.
- The post-install actions feature lets you initiate and automate key post-installation steps directly from the Deployment Manager UI. Available post-install actions include installing and configuring Instabase’s training package and base models (public preview), and configuring Marketplace for use.
Telescope is out of beta and available by default in all environments. Telescope also has several usability improvements, including tooltips, more intuitive UI interactions, and a button to generate a link to your query results.
From the new Logs page, you can view, search, and filter logs from across the Insabase platform. You can view logs for services, pods, and containers, and filter results by keyword and by date and time range. You can also generate a link to your query results with one click.
Single sign-on to Grafana through Deployment Manager is available. You can navigate to Grafana directly from Deployment Manager by clicking the Monitor button on the Deployments tab of the Status Dashboard. When clicked, you’re brought directly to Grafana, where you can monitor CPU and memory usage of workloads. A Monitor button is also available on each individual deployment’s details page.
You can access container insights from the status dashboard, and load live configurations for a running container. Select a deployment from the Deployments tab of the Dashboard, then click through to a pod to find a list of containers, each with its own Insights button.

Flow

Two new clients, job_client and flow_client, are available to use in user-defined functions (UDFs). With these clients, you can interact with job and Flow APIs directly from your UDFs, without needing to pass auth tokens via runtime flags or make external API calls.

job_client can access job APIs and has methods to pause, resume, retry, cancel, list, get the status of, and update pipelines of a job. flow_client can interact with Flow APIs, letting you run new flows and query flow status. To learn more, see the UDF clients documentation.
The webhook handler supports client objects, including the file, job, and flow clients. This change means you can use clients to access their respective APIs without an auth token.
Public preview | You can use the Process Files step for skew correction. When enabled, during processing the image is assessed for skew, then corrected if skew is present. To enable, open the Process Files step’s advanced OCR config settings, and set the dewarp_page feature flag to true.
Public preview | Flow Review has two features in preview, inline editing and lasso mode:
- Inline editing lets you edit TEXT, INT, and FLOAT fields in document view, through an inline editor that displays alongside the provenance of that field. Additionally, when you select a field to edit from the right-hand pane, Flow Review automatically zooms into the extracted provenance for that field and opens the inline editor.
- Lasso mode lets you edit or add the provenance of extracted fields directly from Flow Review’s document view. To enter lasso mode, click the pen icon on the top-left of the document view, and click and drag to highlight the correct provenance for the selected field. Lasso mode currently works for TEXT, INT, FLOAT, and IMG fields.
To enable inline editing or lasso mode, click Settings > Feature Preview in the Flow Review menu bar. From there, you can toggle either feature on or off.
The Grafana dashboard has a new Flow dashboard, which you can use to identify and troubleshoot Flow-related issues. To learn how to use the Flow dashboard, see Flow logging and troubleshooting.
The overhead when executing flow binaries has been reduced by several seconds compared to performance with release 22.08. These optimizations lower overhead for both synchronous and asynchronous execution.
You can see the classification confidence score for classifier and split classifier models in the Apply Classifier step. For split classifier models, you see the confidence score for each split.

Note

To view a classification confidence score for a classifier model, the model must have been trained on ibformers version 2.0. This requirement does not apply to split classifier models.

ML Studio

Public preview | ML Studio supports incremental training and model evaluation, letting you train and evaluate models using entirely new datasets. You can train a given model on a new dataset to evaluate its performance on one set of documents compared to another, or to evaluate the model’s baseline performance.
- If an annotation set contains a class with no fields, you can import a class schema from any Marketplace model.
- When importing a Marketplace model into ML Studio, you see a searchable, filterable list of available models. The list contains Marketplace models approved for use in evaluation or incremental training.
Warning

Models trained prior to 23.01 must be either retrained or republished in order to be imported into ML Studio. For instructions and additional information, see the Working with Marketplace models documentation.
Public preview | A new Do annotation analysis hyperparameter can help improve annotation quality by detecting inconsistencies in manual annotations and providing an annotation error probability figure. To learn more about annotation analysis, see the Improving annotation quality section of the model training documentation.
Public preview | A new List field type is available. From the annotation UI, you can annotate multiple items as part of a List field. For List field predictions, all annotated items must be in the prediction items to count the prediction as correct. When viewing a List field in Refiner, you’ll see the entire list of items in the List field.
- If an annotation set contains List fields, the enable_list_support hyperparameter is set to true by default. You’ll see a warning if the hyperparameter is set to false or if the selected base model doesn’t support the enable_list_support hyperparameter.
- New keyboard shortcuts are available for field annotation:
  - L: Adds a new list item to the selected field.
  - Shift+L: Deletes the currently selected item from the selected field.
  - Backspace/Delete: Clears the selected list item (previously cleared the entire field).
- ML Studio suggests a best match when reviewing model prediction items and annotation items for model training jobs by evaluating the content of each item to suggest a likely match.
When using automatic test record selection for model training, you can see which annotation sets were used as well as the test percentage and randomization seed for each annotation set. From the Train & evaluate tab’s Trained models list, click View details for a given model training job. You can then view the details of which annotation sets were used under the Configuration tab’s Annotation sets used section.
A Try again button displays alongside Error encountered while loading errors in ML Studio’s annotation view and test records view. Clicking Try again lets you try to reload lines when a request has failed.
You can see both the size of the model used for a training job and what version of ibformers was used, if applicable, from the training job’s Configuration tab.
When you create a new extraction model from an annotation set, ML Studio automatically suggests creating a table extraction model for class schemas that contain table fields.
You can star successful model versions in the Trained Models list, letting you mark which model versions you may want to reference later.
The term “annotation set” has replaced the previously used “dataset” throughout ML Studio’s user interface and documentation.

Warning

Extraction model output format was changed in 23.01 and ibformers 1.7+ to support list fields. Make sure you have the latest model_util version in Developer Exchange. Existing Flows and solutions will continue to work as before, but if you want use a newly trained model with existing custom code that manipulates the raw __model_result output, then please update your code to the new format.

Refiner

You can register functions in UDFs using the register_fn decorator. This decorator replaces the previous way of registering functions with the name_to_fn registration dictionary.

The decorator parameters include:
- name (string): The name to register the function with. The registered name is the value used in Refiner to call the function. If no name is specified, the registered name defaults to the name of the function.
- provenance (boolean): Define whether the function is provenance-tracked (True) or not (False). The default value is True.
You can register both a provenance-tracked and untracked version of a function with the same name parameter.

For more details on how to use the decorator, including examples, see the User Defined Functions (UDFs) on Instabase documentation.
You can select and view the provenance of specific cells in Extracted Table outputs. Hover over a cell to highlight its provenance in the corresponding document image.
A new Hide readonly field toggle in Refiner lets you mark fields as read-only and hide them from flow output.

Scheduler

The Scheduler app has a new automatic schedule mode. When set to automatic, the scheduler job listens for changes to a specified folder in the file system. When new files are added to the folder, the scheduler job automatically triggers a new flow job for each file. To learn more, see the Scheduler automatic schedule section of the Flow integration guide.

Validations

Public preview | Classification validation is available in the Validations app. You can define thresholds per class, based on class_score, and flag records that have a model confidence lower than the defined threshold.
You can link multiple conditions together in a run-all group, then run them as a single condition.

Observability

Grafana is only accessible through single sign-on (SSO), using your Instabase session or Deployment Manager token. You can access Grafana from the All apps page as well as from the Deployment Manager’s dashboard (click the Monitor button on the Deployments tab of the Status Dashboard). Your Grafana permissions (Admin or Viewer) are based on your existing session permissions.

You must be running Deployment Manager 23.01 for Grafana SSO to work.
All API responses contain an ib_trace_id response header, to support debugging specific API requests. The trace ID is unique throughout all services and the same trace ID appears in logs across services.
Traces for file-service batch operations are available in Grafana.

Bug fixes

Platform

Site admins can change their own passwords on the All Users user search page. Previously, only non-site admins’ passwords could be changed from the All Users page.
This release resolves a bug that caused ZIP files to download without a file extension. This error occurred when one of the files being compressed contained a comma in its name.
The database initializes even when special characters are present in the database’s login credentials.
Python PDF libraries are used to generate images from PDFs, instead of using the PDF service. The PDF service was being used in cases where images hadn’t already been generated during a Process Files step.

ML Studio

A race condition that impacted ML Studio’s stability when loading long documents is fixed.
The letter case of a file extension no longer affects whether the file can be added to an ML Studio annotation set.
You no longer need to reload the page to see class annotations after applying annotations to multiple files.
This release resolves a bug that caused an error when trying to change the record split of files in an annotation set, after switching between annotation set tabs in an ML Studio model.
You can only edit OCR configurations that are properly loaded. This change helps prevent invalid OCR config errors.
ML Studio defines and verifies the minimum number of train or test documents based on the minimum number of documents required in the ibformers model training script. Previously, ML Studio and ibformers defined the minimum number of training samples for split classification model training differently.
This release resolves a bug that caused long documents to load incorrectly after a new record split was created for the document.
Clicking the Select checkbox in a record’s classification view no longer causes the page to auto-scroll unpredictably.
Files that have undergone OCR processing a second time no longer load incorrectly in annotation view.

Deployment guide

To use the file service cache component described in the File system section, your deployment requires a separate Redis instance, redis-file-service, dedicated to file caching.
As part of the migration from ElasticSearch to OpenSearch, the OpenSearch StatefulSet requires a new PVC to be provisioned, named opensearch-pvc by default.

After deploying OpenSearch, the existing ElasticSearch StatefulSet can be scaled down to 0.
If upgrading to release 23.01 from release 22.10 or earlier, you might encounter the following error during the upgrade:
```
{"statusCode":500,"data":{"failed_to_apply":{"deployment-file-tservice":"Deployment.apps \"deployment-file-tservice\" is invalid: spec.template.spec.containers[1].image: Required value"}...
```
If you encounter this error, or similar, you must apply the delete-file-tservice-container-patch.yml patch that is included in the 23.01 release bundle. For instructions on applying patches with Deployment Manager, see the Configuration management with Deployment Manager documentation.
After upgrading to release 23.01, you must complete a database migration:
1. From the Instabase desktop, open All apps > Admin > Configuration.
2. Under Service setup, verify if the Database tables are set up message displays.
3. If you don’t see the message, click Set up. If successful, you see the Database tables are set up message.
This release requires a minimum ibformers version of 2.0 and a Marketplace base model image update. ibformers versions less than 2.0 and their corresponding base model images are not supported. As of release 23.01, model download has been removed from the platform and instead ibformers is responsible for base model lookup and download. As part of this change, some base models have been published under new names and paths, which must be updated. After upgrading to 23.01, update Marketplace.

Note

For fresh installs of 23.01, manually upgrading ibformers and updating the base models might not be required. First try using the automated post-install actions feature. You can initiate post-install actions from the Deployment Manager Dashboard page, using the Check post-install actions button. If unsuccessful, update Marketplace.
When upgrading from release 22.10 to 23.01 or later, if you have flows that have extraction and refiner steps, you may see that they start failing during refiner execution. To fix this:
- If your flow is using a published model, you can use the ML Studio Utilities app to fix your published models. Go to the Migrate Published Models tab and click Migrate.
  - You need ML Studio Utilities version 2.0.4 to do this. If you don’t have this version, reach out to the Instabase team.
- If your flow is using model projects (that is, unpublished models), you must fix them manually in the file system.
  1. In the file system, open the extraction module folder in your flow modules. This folder contains a JSON file that contains information about the model project. Go to the folder specified in the "model_fs_path" field.
  2. In the folder, there should be a file called package.json. Open that file and check to see if the "result_type" field has the value "ner_result". If not, edit the file so that the "result_type" field has the value "ner_result" (that is, "result_type": "ner_result",). Your flow will now work correctly.

Deprecations and removals

The NLP Search app is deprecated as of release 23.01 and will be removed from the platform in release 23.04.
The Classifier app is scheduled for deprecation in release 23.04 and will be removed from the platform in release 23.07. We recommend using ML Studio to train classifiers.
The model service run batch sync API endpoint (/api/v1/model-service/run_batch_sync) and run test sync API endpoint (/api/v1/model-service/run_test_sync) have been removed from the platform. When calling either API from a version earlier than 23.01, a deprecated error is logged but the call is completed. When calling either API from version 23.01, you’ll receive a deprecated response.
Libpostal protocol buffers and libpostal-related methods have been deprecated from the model service and model service SDK. The list of deprecated libpostal-related methods includes:
- parse_address_string
- parse_address_strings
- parse_address_document
- parse_address_documents
- parse_address_directory
- expand_address_string
- expand_address_strings