22.08 Release notes

Table of Contents

Release 22.08.0 is a major release that introduces new features, improvements, and bug fixes.

Release 22.08.40

Bug fixes

ML Studio

Annotations could be lost when you moved or exported them due to timestamp mismatches in a now-obsolete backwards compatibility check.

Release 22.08.33

Bug fixes

Refiner

A refiner with more than one dropdown output field incorrectly contained the same value in all dropdown output fields.

Release 22.08.0

New features and enhancements

Platform

We’ve upgraded Instabase’s services from Go 1.15 to Go 1.18. You might be affected by this if you are using legacy X.509 certificates. The CommonName field on X.509 certificates is no longer treated as a hostname when no Subject Alternative Names (SAN) are present. If you are using such certificates, you must generate new certificates.
You can provide your own managed services for Redis caching on AWS, instead of using Instabase’s prepackaged containers. Instabase supports using AWS’ managed Redis and RabbitMQ services.
We’ve upgraded RabbitMQ to release 3.10. This upgrade brings support for streams on RabbitMQ and performance improvements.
Envoy sidecar brings throughput improvement for flows with heavy use of MSFT OCR. Envoy runs as a sidecar for all MSFT OCR deployments. You can also review statistics provided by the Envoy sidecar in a new Envoy dashboard in Grafana.

Note: The introduction of Envoy sidecar brings dependency, resource requirement, and network policy changes. See the Deployment changes section below for more details.
You can now create all new database tables for a release using a single button in the Admin app. The Service Setup section of Admin alerts you if any database tables are not yet set up and clicking Set up creates all missing tables.
Documents in Refiner, ML Studio, and Flow Review are now searchable, adding the ability to quickly find text in long documents. To show the search tool, press Ctrl/Command + F.
In this release, high availabilty functionality is available with the Instabase license and job services. High availability (HA) avoids database failures and prevents downtime by allowing up multiple database replicas to run at one time. One replica serves as the active database, serving requests, while the other replicas are passive, available to take over if the active database should fail.

For details about how to enable HA for the job and license services, see the high availability for services documentation.

File service

Instabase now supports Azure Blob storage as a storage provider across the entire platform. File service can use Azure Blob storage as a file system and you can mount drives backed by Azure Blob storage using the Instabase UI or Mount APIs. Any file operations performed on the drive route through the attached Azure Blob storage container. For more details, see the mount APIs documentation.

Note: This feature is in beta and we recommend using it only in development environments.
When deleting subspaces, you have the option to permanently delete all the subspace’s data from the Instabase Drive. For details about purging subspace data, see the subspace docs.
A new version of all file-system APIs have been rolled out. The new APIs provide a number of improvements over the old ones, including things like:
- Better error messaging by returning standard HTTP status codes
- Increased reliability , especially for long running operations such as copy/move/rename/remove tasks, and large upload/download operations
- More performant read/write and upload/download behaviour
Refer to the API documentation for Filesystem APIs (v2) for details. Changes will need to be made to migrate from the old file-system v1 endpoints to the newer ones. Deprecation of the v1 endpoints will be announced at a later date.

Model service

The model service now supports running models that are not published to Marketplace. Previously, all models had to be published to Marketplace in order to run. Now classifier, refiner, and extraction modules can be imported into a flow directly from ML Studio.

We recommend using unpublished models only in development environments or in flows shared among a limited number of people. For models running in a production environment or models shared among many people, we recommend publishing the models to the Marketplace, because Marketplace models have shorter inference time. Additionally, only Marketplace model metadata is recorded in the database.
Model service now supports an unbounded number of models, with no restrictions based on persistent volume (PV) capacity. Previously, all published models were stored on the PV, making it possible to run out of PV space. Now only recently used models are cached on the PV, with others remaining available to download from the remote file storage system.

Note: You might see a slight performance drop when running a model that hasn’t been used recently. If the model is not cached on the PV, it needs to be downloaded from the remote file system, which adds a latency period for model inference requests. To learn more about the effect of PV settings on performance, see the Infrastructure requirements documentation.

ML Studio

ML Studio is out of beta and available for general use. The Annotator app has been removed.
ML Studio now supports table annotation and extraction. You can create table fields from the dataset settings page. Feature highlights include:
- Support for annotating tables that run across multiple pages, so you can include separate table segments in a single table field.
- New data types for extracted tables, which you can use in UDFs to perform operations on extracted tables.
- Detailed provenance tracking for tables extracted by a model. Flow Review and Refiner show the provenance for the table and each cell, not just the table’s text.
To learn more about table annotation and extraction, see the ML Studio annotation documentation and Model training documentation.
You can generate table entity suggestions when adding or re-digitizing files to an ML Studio dataset. To do this, check the Enable Table Entity Detection checkbox in the Reader configuration and add files to the dataset normally. These entities then show up for table fields in the ML Studio dataset annotation view, which can be used for annotating.

To use this feature, make sure that you have the table_entity_model model published into Marketplace. This feature uses the latest version of table_entity_model.
You can see suggestions for how to classify your records, using the Display suggestions toggle in ML Studio. Suggestions are supported for both classification and split classification models. When using a classification model, you’ll see suggestions for the predicted class name for each test record. For split classification models, you’ll see suggestions for how the file should be split and which class should be assigned to each of the resulting records. For more details, see the Model training documentation.
You can now move datasets to a different location in the file system from inside ML Studio, instead of using ML Studio Utilities. You can also generate a ZIP file of your dataset, to upload to a different environment running any version of Instabase.

Note: This is the only supported method of moving datasets between Instabase environments. Moving datasets directly in the file explorer also is not supported.
A new pipeline in ML Studio automatically reduces scope and improves processing speed for long documents. The layout_base_long_doc pipeline uses a document segmentation approach to processing long documents. A retriever model first filters for relevant pages. Then the extraction model is applied only to the pages selected by the retriever model.
Expanded ML Studio metrics provide more useful and detailed information for classification, text extraction, and table extraction. Improvements include:
- Classification metrics include record-level metrics.
- Extraction metrics include a new metric to track confidence score quality.
The interactive hyperparameter editor provides a user-friendly way to tune advanced hyperparameters and increase the accuracy of your models. Previously, hyperparameter tuning required editing a JSON configuration. Feature highlights include:
- The hyperparameter editor dynamically populates with default hyperparameter values, which you can choose to adjust.
- Hyperparameter values are validated before you can train the model, providing a guardrail against invalid values.
- You can still edit the JSON configuration directly to tune advanced hyperparameters or paste in a previous configuration.
You can now unpublish models from model configurations in ML Studio.
You can now select multiple files or records and apply the same action to everything selected. To select records and files, click Select in the Files list. Then, select the checkbox next to your chosen records. From there, you can mark/unmark for testing, mark/unmark as annotated, or assign a class to all selected records.
The Files list now displays the test flag icon at the file level. You can use the file-level test flag icon to mark every record within a file for testing at once.

Flow

A streamlined flow review process lets you change pipelines from within Flow review and continue reviewing until all jobs are reviewed.

After completing a review of a given flow job, you can automatically retrieve the next job associated with their current pipeline. Or, you can use new Get Next options to configure a different pipeline to retrieve the next job from.
More hotkeys in Flow review add support for more seamless navigation across records and between UI views. For a full list of hotkeys, select Help > Keyboard Shortcuts or press Shift + ? in Flow Review.
Changes to how flow run time is calculated give you a more accurate understanding of a flow’s active run time. The calculation now accounts for checkpoints and pauses.
Two improvements to flow execution give you more options for scheduling and triggering flows:
- You can now queue a large number of jobs at one time. Changes to scheduling logic restrict the maximum number of jobs that can execute concurrently.
- Flow now lets you launch flows using job IDs, instead of relying on complex checkpoint logic.
The new Solution Accuracy feature lets you view a report of the overall accuracy of a flow job’s results, compared to a golden set. You can create your golden set by exporting the results of a flow job with a similar extraction schema to the comparison job. Then, you’ll link the golden set with the comparison job. After running the job, Solution Accuracy reports several accuracy metrics compared to the golden set.
The Run Binary Sync API is out of beta and available for general use. This API lets you run a binary much faster using sync mode: sending one or more files as the flow input via API, running the flow, and receiving the extracted results as the response.

Note: Compared to running a binary in async mode using the Run Binary Async API, sync mode is faster. However, the Run Binary Sync API doesn’t support checkpoints or processing large batches of files (more than 5) and has a max processing time of 60 seconds.
A new parameter in the Flow Status and Flow List APIs, run_summary, lets you see a summary of the flow run on completion. You can see the total number of records, the number of records with runtime errors, and the number of records that failed checkpoint validation.
Confidence scores in later steps of a flow have been improved. The word-level and character-level confidence scores produced by the process files step are now available to subsequent apply refiner steps and Refiner functions. Improvements include:
- When the process files step returns word-level confidence scores, apply refiner steps and the provenance_get(field) Refiner function now use that word-level confidence score for each character confidence.
- When there is no character-level confidence score from the process files step, the default confidence score for a character is now 0%, not 1%.
The new flow-level redirection flag overrides the environment variable for individual solutions. You can decide at the flow level whether the flow should use Microsoft V2.0 or V3.2 Lite. Use the flag use_msft_lite in the advanced section of Process Files OCR configuration to control the redirection. Available settings for use_msft_lite are:
- Not present: Uses the environment level behavior.
- True: Forces the flow to use Microsoft V3.2 Lite containers.
- False: Forces the flow to use Microsoft V2.0 containers.
You can now chunk PDF files in the Flow v3 process files step to speed up processing. Chunking breaks each PDF file into smaller chunks and then processes them in parallel.

To enable chunking, set the ENABLE_PROCESS_FILES_SPLIT environment variable to true. When enabled, you can control the size of chunks by setting the PROCESS_FILES_SPLIT_SIZE_PAGES environment variable. By default, the chunk size is set to 10.
PDF utilities allow you to split PDFs, get metadata, and convert PDFs to images. These abilities are available in app tasks in addition to the PDF service. Enable the PDF utilities in app tasks by setting the ENABLE_APP_TASKS_PDF_UTILS environment variable.
To improve performance and memory management, the service for converting documents to PDF has been ported from Python to Java. This might cause some formatting changes in processed documents. Re-verify accuracy for all solutions that convert documents to PDFs.

Reader

Improvements to the Reader runtime optimize resource usage and deliver better throughput when processing a variety of file types. For example, documents longer than 10 pages are now processed in parallel, reducing the time required for digitization processing.

Refiner

Provenance tracking is now supported for all numerical and string functions in Refiner.
The File > Export option in Refiner has been removed, because the related .ibprog files are deprecated.

Scheduler

Scheduler is out of beta and available for general use. Along with a major UI refresh, Scheduler now supports using cron expressions. This enhancement lets you schedule jobs at more finely tuned intervals. For example, scheduling a job to run at 7 p.m. every Friday.
Scheduler offers more options when scheduling a flow job to run. You’ll find the same settings available in Scheduler as you do when using APIs or Flow Run to run a flow binary.

Diff Checker

A new API endpoint lets you export the .ibdiff files produced in Diff Checker to HTML. See the Diffs APIs documentation for more details.

Bug fixes

Images now load properly in Refiner when files are processed through Flow with write_converted_image set to False.

Deprecations and removals

The Annotator app has been removed. For annotation and deep learning model training, use the ML Studio app instead. Models that were created in Annotator and that ran on 21.10 or greater still work.
.ibprog files are deprecated.
Key service has been deprecated.

Deployment changes

Key service can be deleted from the deployment.
Envoy runs as a sidecar for all MSFT OCR deployments. MSFT OCR deployments have the following changes:
- Dependency:
  - New sidecar: Envoy
  - Persistent volume: MSFT OCR no longer requires a persistent volume. (Remove)
  - Node local storage: Prefer flash/SSD 32GB. (Add)
- Resource requirements:
  - Envoy sidecar requires 10m CPU and 100MB memory.
- Network policy:
  - ocr-msft ingress to port 6001 (Add)
  - prometheus egress to pod ocr-msft port 6001 (Add)
  - ocr-msft-v3 ingress to port 6001 (Add)
  - prometheus egress to pod ocr-msft-v3 port 6001 (Add)
  - ocr-msft-lite ingress to port 6001 (Add)
  - prometheus egress to pod ocr-msft-lite port 6001 (Add)

Additions since 22.05

Since the most recent major version, 22.05, Instabase has had two minor releases that included the following changes.

Platform

If you are using server-side encrypted S3 storage for the Instabase drive or mounted S3 drives, you can now specify custom AWS KMS keys for encrypting data uploaded to an S3 bucket.

For the Instabase drive, update the deployment YAML for core-platform-service as follows:
- Set the environment variable HOSTED_S3_ENCRYPTION_TYPE to aws_sse_kms.
- Set the environment variable HOSTED_S3_SSE_KMS_KEY_ID to the custom AWS KMS key ID.
When mounting server-side encrypted S3 drives into a subspace, set the encryption type to S3 KMS, and specify the custom AWS KMS key ID to use.

Note: Enabling this feature in a drive does not encrypt pre-existing content. The KMS key ID is passed in for server-side encryption on future file operations on the drive.

ML Studio

22.07

You can now unpublish models from model configurations in ML Studio.
You can now select multiple files or records and apply the same action to everything selected. To select records and files, click Select in the Files list. Then, select the checkbox next to your chosen records. From there, you can mark or unmark for testing, mark or unmark as annotated, or assign a class to all selected records.
The files list displays the test flag icon at the file level. You can use the file-level test flag icon to mark every record within a file for testing.

22.06

You can search, filter, and download logs on the training logs page. This functionality is the same as that found in Flow. You can search logs for specific text, filter by log level (Debug, Info, Warning, or Error), and download logs as CSV or JSON files.
You can now see the models configuration page for all training jobs, including in-progress, failed, and successfully completed training jobs. This page shows what base model and hyperparameters were used in each job.
After you update the Reader configuration, ML Studio shows an outdated label on files that need to be re-added before the new configuration will apply to them.
The files list in ML Studio has several improvements, including:
- You can filter the files list by annotation status, test record status, or class.
- Any files that have been annotated are now expanded in the files list by default. It’s also easier to see which records in the file have been included or excluded in the filter.
- The files list shows the total number of records marked for testing and the total number of records marked as annotated.
- You can now see new results immediately when you re-add files to the dataset after changing OCR settings.
- When you add files containing attachments to your dataset, such as emails (.eml files), the attachments now appear as separate files in a files table.

Flow

22.07

Context menus are disabled in the Input Samples and Modules panels until the resources in the panels finish loading.
The appearance of Filter steps is now less crowded.
You can find classification accuracy data in the solution accuracy view at the job and class levels. This view also highlights low-accuracy fields in red.
Flow review error messages include actionable information.
Reviewers can change pipelines from within Flow review and continue reviewing until all jobs are reviewed.

After completing a review of a given Flow job, you can automatically retrieve the next job associated with your current pipeline. You can also use the Get Next option to configure a different pipeline from which to retrieve the next job.
Flow review automatically revalidates the most recently edited record after you click away to another record.

22.06

Usability and performance improvements in Flow review include:
- New hotkeys added to select Help > Keyboard shortcuts.
- The filter location is more visible.
- Flow review operations are faster.
When you open a Flow results file (.ibflowresults) from the file explorer, Flow run, or Flow dashboard, after you complete the review, you’re redirected to the Flow dashboard. If you open a results file from a direct link or from the Flow review dashboard, after you complete the review, you’re redirected to the Flow review dashboard.

Refiner

22.07

The behavior of the scan_line_repeated function has changed:
- Previously, more results than the max_scans value could be returned. Now, the max_scans number of results are returned in the output.
- Previously, the same line could be returned multiple times. Now, a given line can only be returned in the output once.
- Resolved an issue causing bad gateway errors when opening Refiner programs that contain large documents (up to 350 pages).

22.06

Refiner’s document display is more consistent with ML Studio and Flow, but its functionality hasn’t changed.
Provenance tracking works in the list_get_range, merge_lists, scan_right_repeated, scan_below_repeated, and scan_line_repeated Refiner functions. This means you no longer have to freeze the input before calling these functions.
The error messages from scan_* Refiner functions list which user-specified labels were not found.

Additional app changes

22.07

The Reader app is now out of beta and generally available to use. See the Reader documentation for more information.
The Test Runner app has been redesigned.

Deprecations

The Annotator app (Beta) is deprecated in 22.06 and 22.07 and removed in 22.08. For annotation and deep learning model training, use the ML Studio app instead. Models that were created in Annotator and that ran on 21.10 or greater still work.