Model training and inference requirements
Model inference and training have specific requirements to ensure the best performance. Model inference relies on sufficient local volume to reach production level throughput, while model training requires a robust GPU to handle the demands of training jobs.
Model inference
Model inference functionality is provided through the model service. To support reliable inference, you must allocate a sizable amount of resources towards the model service.
Model service pods must be provisioned with at least 16 GB of memory and four cores each. Inference is supported only on CPUs, not GPUs.
To ensure production-level throughput, Instabase utilizes local volume store models. It is essential to provision an adequate disk volume on nodes where model service pods are deployed. It is recommended to allocate a minimum of 50 GB local volume capacity to each model service pod. However, the necessary volume capacity can vary based on the specific models required.
Minimum requirements:
- Model service: 16 GB of memory, 4 CPUs each, 50 GB volume size.
Model training
The model training services in the Instabase platform supports custom fine-tuning of deep learning models, using your proprietary datasets. Model training lets you leverage massive pre-trained models to achieve high accuracy in extraction and classification tasks even with limited datasets.
The Instabase platform provides two infrastructure options to support model training:
-
Celery worker-based model training tasks
-
Ray model training (introduced in public preview in Release 23.04)
Model training tasks
Model training functionality is provided through model-training-tasks-gpu
. Model training is supported only in environments with GPUs. The number of concurrent training tasks is capped by the number of model training task GPU replicas. Training jobs can run for up to six hours, and all models must be trained through Instabase applications: ML Studio, Annotator, or Classifier.
Ray model training
The functionality for Ray model training is made available via ray-head
and ray-model-training-worker
. In a Ray cluster, the Ray head node is a dedicated CPU node that performs singleton processes responsible for managing the cluster and assigning training tasks to the worker node. The number of replicas for the Ray head is always set to 1
. On the other hand, the worker node is a GPU node responsible for executing the actual computational tasks for model training jobs. Additionally, all models must undergo training using Instabase applications, such as ML Studio.
Disable Ray model training
As of release 23.07, Ray is the default infrastructure for model training. You can optionally disable Ray model training and use model training tasks instead. To make this change, use Deployment Manager to apply the following patches, in the order they are listed:
-
Patch 1: Disable ENABLE_RAY_MODEL_TRAINING
This patch sets the
ENABLE_RAY_MODEL_TRAINING
environment variable to"False"
inapi-server
. This change routes model training requests to model training tasks.# target: deployment-api-server apiVersion: apps/v1 kind: Deployment spec: template: spec: containers: - name: api-server env: - name: ENABLE_RAY_MODEL_TRAINING value: "False"
-
Patch 2 and 3: Invert replica count
This patch change inverts the replica count for
deployment-model-training-tasks-gpu
anddeployment-ray-model-training-worker
.deployment-ray-model-training-worker
has its replica count set to0
whiledeployment-model-training-tasks-gpu
has its replica count set to1
. (If binary autoscaling is enabled, the max replica count is defined.)Without binary autoscaling (apply both patches):
# target: deployment-model-training-tasks-gpu apiVersion: apps/v1 kind: Deployment spec: replicas: 1
# target: deployment-ray-model-training-worker apiVersion: apps/v1 kind: Deployment spec: replicas: 0
With binary autoscaling (apply both patches):
# target: deployment-model-training-tasks-gpu apiVersion: apps/v1 kind: Deployment metadata: name: deployment-model-training-tasks-gpu annotations: autoscaling/enabled: "true" autoscaling/max_replicas: "1" autoscaling/queries: max_over_time(rabbitmq_queue_messages_unacked{queue="celery-model-training-tasks-gpu"}[30m])&max_over_time(rabbitmq_queue_messages_ready{queue="celery-model-training-tasks-gpu"}[30m])
# target: deployment-ray-model-training-worker apiVersion: apps/v1 kind: Deployment metadata: name: deployment-ray-model-training-worker annotations: autoscaling/enabled: "true" autoscaling/max_replicas: "0" autoscaling/queries: max_over_time(clamp_min(sum(ray_tasks{State!~"FINISHED|FAILED"}[15s]), 0)[30m]) or vector(0)
GPU requirements
List of supported GPUs and card performance:
GPU | FP16 TFLOPS | VRAM (GB) | NVLink |
---|---|---|---|
A100 | 312 | 40GB | Yes |
V100 | 112 | 32 GB | Yes |
A10 | 125 | 24 GB | No |
A30 | 165 | 24 GB | Yes |
A40 | 150 | 48 GB | Yes |
A10 is the lowest tier GPU Instabase supports.
Alternatively, you can provide GPU support by meeting these requirements:
-
Hardware support for CUDA 11.7 (including any relevant drivers).
-
Ability to run the NVIDIA device plugin.
-
Meets compute requirements (relative to A10, see above).
To achieve faster speeds with distributed multi-GPU training, we recommend using GPUs with NVLink, to enable a direct connection between GPUs during multi-GPU training. Using NVLink significantly reduces the time spent on multi-GPU synchronization.
Node requirements
Model training tasks & Ray model training worker
-
Memory: The amount of memory required is determined by the GPU VRAM and varies depending on the dataset and model. The minimum amount of RAM needed is either 16 GB or the GPU’s VRAM size, whichever is larger.
-
CPU: A minimum of four provisioned cores, primarily for data preparation for model training. The number of CPUs must be greater than or equal to the number of GPUs.
Ray head
-
Memory: The minimum amount of RAM needed is 16 GB.
-
CPU: A minimum of two provisioned cores, primarily for task orchestration and cluster management.
Distributed multi-GPU training
Ray model training lets you specify a GPU allocation mode at the training task level in the ML Studio. There are three GPU allocation modes:
-
The
single
GPU mode is the default mode, where each training job is allocated one GPU. This mode is suitable for most training tasks and ensures dedicated GPU resources for each job. -
The
partial
mode is ideal for small training tasks. In this mode, each training job utilizes 0.5 GPU, enabling the concurrent execution of two jobs on a single GPU. -
The
multi_gpu
mode is ideal for heavy training tasks. In this mode, all available GPUs are utilized for the training job, leveraging the full computational power to accelerate the model training job.
This new capability for specifying GPU allocation mode provides more granular control over GPU allocation, letting you optimize resource utilization based on the specific requirements of their training tasks.
To utilize the distributed multi-GPU training feature, use Deployment Manager to apply a patch that makes the following changes to the ray-model-training-worker
configuration:
- Define the number of available GPUs.
- Define a
NUM_GPUS_REQUESTED
value. - Update the memory value by multiplying its current value by the
NUM_GPUS_REQUESTED
value. For example, if the current memory value is 12Gi and theNUM_GPUS_REQUESTED
value is 4, define the new memory value as 48Gi. - Ensure that the CPU count is greater than or equal to the GPU count.
See the following patch for an example of a ray_model_training_worker
patch that makes the required changes:
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: ray-model-training-worker
env:
- name: NUM_GPUS_REQUESTED
value: "4"
resources:
limits:
nvidia.com/gpu: 4
cpu: 4000m
memory: 48Gi
requests:
nvidia.com/gpu: 4
cpu: 4000m
memory: 48Gi
Binary autoscaling for GPU nodes
Binary autoscaling for GPU nodes is typically used in SaaS deployments. When there are no model training requests, GPU nodes are automatically scaled down to save costs. When a model training request is received, the model training infrastructure automatically scales up the number of GPU nodes to execute model training tasks.
To enable autoscaling, apply the following patches in Deployment Manager:
-
Model training tasks
# This patch enables binary autoscaling for model-training-tasks-gpu # target: deployment-model-training-tasks-gpu apiVersion: apps/v1 kind: Deployment metadata: name: deployment-model-training-tasks-gpu annotations: autoscaling/enabled: "true" autoscaling/max_replicas: "1" autoscaling/queries: "max_over_time(rabbitmq_queue_messages_unacked{queue=\"celery-model-training-tasks-gpu\"}[30m])&max_over_time(rabbitmq_queue_messages_ready{queue=\"celery-model-training-tasks-gpu\"}[30m])" spec: replicas:
-
Ray model training
# This patch enables binary autoscaling for ray-model-training-worker # target: deployment-ray-model-training-worker apiVersion: apps/v1 kind: Deployment metadata: name: deployment-ray-model-training-worker annotations: autoscaling/enabled: "true" autoscaling/max_replicas: "1" autoscaling/queries: "max_over_time(clamp_min(sum(ray_tasks{State!~\"FINISHED|FAILED\"}[15s]), 0)[30m]) or vector(0)" spec: replicas: