Load testing and resource management
Load testing and resource management can help you optimize flow processing.
Load testing
Load testing can help you estimate Instabase performance during real workloads. Perform load tests after upgrading to check if the new version requires additional resources.
Load test recommendations
Before each load test, run a similar flow to cache any models. When running a load test, use a flow that mirrors your production workload. Flows can have different performance characteristics depending on OCR and model requirements, flow complexity, and other factors. For an accurate throughput measurement, run a flow for approximately 30 minutes.
Load testing uses a wrapper provided by Instabase to run the following operations on your specified flow:
-
Create N threads to simulate N parallel flow executions.
-
On each of the N threads:
-
Trigger a flow using a set of input documents.
-
Poll for the completion of the flow.
-
On completion of the flow, record the number of pages processed.
-
Trigger a new flow using the same set of input documents.
-
At the end of the load test, compute the total number of pages processed by all flow runs. Then, divide this metric by the duration of the test to compute the throughput, measured in the number of pages processed per hour.
Running a load test
Using the load test flow, you can run a load test directly from Instabase. The load test flow runs your sample flow at scale using Instabase APIs, helping you test and validate performance.
-
Create a directory called
loadtest
anywhere in the file system. -
In the
loadtest
directory, create aconfig.json
file, copying this sample JSON file and modifying the values to reflect your environment. -
Download and copy the loadtest.ibflowbin file into your load test directory.
Your load test directory must look similar to this:
loadtest/ – config.json – loadtest.ibflowbin – bin – flow_1.ibflowbin – flow_2.ibflowbin – input - dataset_folder – file_1.jpeg – file_2.pdf – output
-
Run the
loadtest.ibflowbin
flow, specifying any directory as the input folder and providing runtime configuration details. -
Run the flow.
You can monitor flow progress in Grafana. When the flow completes, check your specified output directory for a file with the name
{date}_{time}_loadtest_result.txt
. The file contains a message reporting runtime metrics, for example:Total number of pages processed: {num_pages}; total time evolved: {time_elapsed}; {throughput} pages/hr; completed with {num_errors} errors
Sample config.json
file
The load test configuration file contains settings related to your load test. The configuration consists of four blocks describing your environment, flow, input, and profile.
-
env
specifies your environment details:-
baseURL
- URI of the Instabase server, for example,https://instabase.com
. -
accessToken
- OAuth access token to use with the Instabase APIs. For details, see API authorization. -
outDir
- Directory used as flow output directory.
-
-
flow
specifies your benchmark flows. You can provide the location of a flow binary or a Marketplace app and version. Configure it to run a Marketplace app instead.Sample binary flow configuration:
"benchmarkName": { "bin": {LOCATION-OF-BINARY} }
Sample Marketplace app configuration:
"passport": { "marketplace": true, "name": "Passport", "version": "3.0.2" }
-
input
consists of the directory of your input dataset and the total number of pages contained within the directory. This figure is used to calculate load test throughput in pages per hour. -
profile
consists of the load profile for this test. This value specifies how many flows run concurrently and the total number of flows.
Here’s an example config.json
:
{
"env": {
"prod": {
"baseURL": "https://instabase.com",
"accessToken": "fzKNexrVzDv0cF87pJLoAwLrp4pJ4d",
"outDir": "flow/my-repo/fs/Instabase Drive/loadtest/output"
}
},
"flow": {
"USBankStatement": {
"bin": "flow/my-repo/fs/Instabase Drive/loadtest/bin/US_bank_statements-1.0.0.ibflowbin"
},
"passport": {
"bin": "flow/my-repo/fs/Instabase Drive/loadtest/bin/passport_v4.ibflowbin"
}
},
"input": {
"bankStatement52": {
"dir": "flow/my-repo/fs/Instabase Drive/loadtest/input/bank_statement_52_200",
"numPages": 200
},
"passport116": {
"dir": "flow/my-repo/fs/Instabase Drive/loadtest/input/passport_116_200",
"numPages": 200
}
},
"profile": {
"test" : {
"numConcurrentFlows": 1,
"totalFlowRuns": 1
},
"max" : {
"numConcurrentFlows": 20,
"totalFlowRuns": 50
}
}
}
Sample runtime configuration
The runtime configuration for the loadtest.ibflowbin
flow is a dictionary of key-value pairs that specify details related to your config.json
file. Required values in the runtime configuration include:
-
config
- Absolute path to theconfig.json
file. -
env
- Environment to use fromconfig.json
. -
flow
- Flow to use fromconfig.json
. -
input
- Input folder to use fromconfig.json
. -
profile
- Profile to use fromconfig.json
.
Here’s an example runtime configuration:
{
"config": "flow/my-repo/fs/Instabase Drive/loadtest/config.json",
"env": "prod",
"flow": "USBankStatement",
"input": "bankStatement52",
"profile": "test",
"out" : "flow/my-repo/fs/Instabase Drive/loadtest/results/"
}
Resource management
By default, all flows in a single cluster are treated equally, scheduled in a round-robin fashion across all jobs. Multiple business units or solutions sharing the same cluster can lead to a noisy neighbor problem, where flows from one solution can impact the performance of another.
To effectively manage resources across multiple solutions, you can create groups, where each group represents a business unit or solution. You can then specify the minimum and maximum percentage of workers allocated to each group. Groups enable multiple solutions with different service-level agreements to run in a shared cluster.
Creating groups
Create a group config consisting of a set of groups. For each group, provide a JSON string with a request
parameter specifying the minimum percentage of workers and a limit
parameter specifying the maximum percentage of workers.
{
"group1": { "request": 10, "limit": 50 },
"group2": { "request": 10, "limit": 20 }
}
Update the job service hot config TS_GROUP_CONFIGS
to configure groups. Here’s an example patch to update the hot config:
# target: hot-configs
apiVersion: v1
kind: ConfigMap
metadata:
name: hot-configs
data:
deployment-job-service: |
TS_GROUP_CONFIGS: '{"group1": {"request": 10, "limit": 50}, "group2": {"request": 10, "limit": 20}}'
There is always a default
group. The default group has the settings { "request": 0, "limit": 100 }
. You can override this default by adding it to your TS_GROUP_CONFIGS
. Here’s an example:
{
"group1": { "request": 10, "limit": 50 },
"group2": { "request": 10, "limit": 20 },
"default": { "request": 0, "limit": 20 }
}
Running flows
When you run a flow with the Flow Binary API, optionally specify which group the job belongs to. For example:
settings: {
priority: 5
group: “group1”
}
If a group isn’t specified, the flow runs in the default group.