You can manage model deployments in GPUStack by navigating to the Models - Deployments page. A model deployment in GPUStack contains one or multiple replicas of model instances. On deployment, GPUStack automatically computes resource requirements for the model instances from model metadata and schedules them to available workers accordingly.
Currently, models from Hugging Face, ModelScope, and local paths are supported.
Click the Deploy Model button, then select Hugging Face in the dropdown.
Search the model by name from Hugging Face using the search bar in the top left. For example, Qwen/Qwen3-0.6B.
Adjust the Name, Cluster, Backend, Backend Version, and Replicas as needed.
Expand the Performance section for performance configurations if needed. Please refer to the Performance-Related Configuration section for more details.
Expand the Scheduling section for scheduling configurations if needed. Please refer to the Scheduling Configuration section for more details.
Expand the Advanced section for advanced configurations if needed. Please refer to the Advanced Configuration section for more details.
Click the Save button.
Click the Deploy Model button, then select ModelScope in the dropdown.
Search the model by name from ModelScope using the search bar in the top left. For example, Qwen/Qwen3-0.6B.
Adjust the Name, Cluster, Backend, Backend Version, and Replicas as needed.
Expand the Performance section for performance configurations if needed. Please refer to the Performance-Related Configuration section for more details.
Expand the Scheduling section for scheduling configurations if needed. Please refer to the Scheduling Configuration section for more details.
Expand the Advanced section for advanced configurations if needed. Please refer to the Advanced Configuration section for more details.
Click the Save button.
You can deploy a model from a local path. The model path can be a directory (e.g., a downloaded Hugging Face model directory) or a file (e.g., a GGUF model file) located on workers. This is useful when running in an air-gapped environment.
!!! note
1. GPUStack uses the model files to estimate resource requirements. If the model path is not accessible on the server, GPUStack will attempt to access it from the workers.
2. GPUStack does not automatically synchronize model files. You must ensure the model path is accessible on the target workers (e.g., using NFS, rsync, etc.). You can also use the worker selector configuration to deploy the model to specific workers.
To deploy a local path model:
Click the Deploy Model button, then select Local Path in the dropdown.
Fill in the Name of the deployment.
Fill in the Model Path.
Adjust the Cluster, Backend, Backend Version, and Replicas as needed.
Expand the Performance section for performance configurations if needed. Please refer to the Performance-Related Configuration section for more details.
Expand the Scheduling section for scheduling configurations if needed. Please refer to the Scheduling Configuration section for more details.
Expand the Advanced section for advanced configurations if needed. Please refer to the Advanced Configuration section for more details.
Click the Save button.
Currently, GPUStack supports some built-in backends: vLLM, SGLang, MindIE and VoxBox.
For more details, please refer to the Inference Backends section.
Select a backend version. The version availability depend on the selected backend. This option is useful for ensuring compatibility or taking advantage of features introduced in specific backend versions.
Edit button in the Operations column.Replicas to scale up or down.Save button.!!! note
After editing the model deployment, the configuration will not be applied to existing model instances. You need to delete the existing model instances. GPUStack will recreate new instances based on the updated model configuration.
Stopping a model deployment will delete all model instances and release the resources. It is equivalent to scaling down the model to zero replicas.
Operations column, then select Stop.Starting a model deployment is equivalent to scaling up the model to one replica.
Operations column, then select Start.Operations column, then select Delete.> symbol to view the instance list of the deployment.> symbol to view the instance list of the deployment.Operations column, then select Delete.!!! note
After a model instance is deleted, GPUStack will recreate a new instance to satisfy the expected replicas of the deployment if necessary.
> symbol to view the instance list of the deployment.View Logs button for the model instance in the Operations column.GPUStack provides the following configuration options to optimize model inference performance.
You can enable extended KV cache to offload the KV cache to CPU memory or remote storage. This feature is particularly useful for setups with limited GPU memory requiring long context lengths. Under the hood, GPUStack leverages LMCache to provide this functionality.
Available options:
RAM-to-VRAM Ratio.This feature works for certain backends and frameworks only.
| Backend | Framework |
|---|---|
| vLLM | CUDA, ROCm |
| SGLang | CUDA, ROCm |
GPUStack automatically schedules model instances to appropriate GPUs/Workers based on current resource availability.
Spread: Make the resources of the entire cluster relatively evenly distributed among all workers. It may produce more resource fragmentation on a single worker.
Binpack: Prioritize the overall utilization of cluster resources, reducing resource fragmentation on Workers/GPUs.
When configured, the scheduler will deploy the model instance to the worker containing specified labels.
Navigate to the Workers page and edit the desired worker. Assign custom labels to the worker by adding them in the labels section.
Go to the Deployments page and click on the Deploy Model button. Expand the Scheduling section and input the previously assigned worker labels in the Worker Selector configuration. During deployment, the Model Instance will be allocated to the corresponding worker based on these labels.
This schedule type allows users to specify which GPU to deploy the model instance on.
Select one or more GPUs from the list. The model instance will attempt to deploy to the selected GPU if resources permit.
Auto: The system automatically calculates the GPU count per replica, using powers of two by default and capped by the selected GPUs.
Manual: Select the number of GPUs each replica should use from the dropdown.
GPUStack supports tailored configurations for model deployment.
The model category helps you organize and filter models. By default, GPUStack automatically detects the model category based on the model's metadata. You can also customize the category by selecting it from the dropdown list.
Input the parameters for the backend you want to customize when running the model. Supported parameter formats:
| Method | Example | Remarks |
|---|---|---|
| Equal Sign Split | --hf-overrides={"architectures": ["NewModel"]} |
- |
| Space Split | --hf-overrides '{"architectures": ["NewModel"]}' |
Supports shell-like style splitting (e.g., for values containing spaces). |
| Separate Fields | --max-model-length, 8192 |
Input parameter name and value as two separate items. |
For full list of supported parameters, please refer to the Inference Backends section.
Environment variables used when running the model. These variables are passed to the backend process at startup.
!!! note
Available for custom backends only.
When CPU offloading is enabled, GPUStack will allocate CPU memory if GPU resources are insufficient. You must correctly configure the inference backend to use hybrid CPU+GPU or full CPU inference.
!!! note
Available for vLLM, SGLang, and MindIE backends.
Enable distributed inference across multiple workers. The primary Model Instance will communicate with backend instances on one or more other workers, offloading computation tasks to them.
Enable automatic restart of the model instance if it encounters an error. This feature ensures high availability and reliability of the model instance. If an error occurs, GPUStack will automatically attempt to restart the model instance using an exponential backoff strategy. The delay between restart attempts increases exponentially, up to a maximum interval of 5 minutes. This approach prevents the system from being overwhelmed by frequent restarts in the case of persistent errors.
While it is common practice to integrate with the OpenAI compatible APIs, users may have different requirements for their use cases. GPUStack supports any inference APIs other than the OpenAI-compatible ones and make it more flexible for AI application development.
GPUStack offers two ways to address the target model when Generic Proxy is enabled. The path-based form is recommended; the header-based form is retained for backward compatibility and is deprecated.
Append the numeric model route id to the proxy prefix — /model/proxy/<model_route_id>/<upstream-path> — and GPUStack dispatches the request to that route's targets. No extra header is needed.
# Assume the model route id is 42.
curl http://<server-url>/model/proxy/42/embed \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <GPUSTACK_API_KEY>" \
-d '{"inputs":["What is Deep Learning?", "Deep Learning is not..."]}'
The gateway strips /model/proxy/<id> before forwarding, so the upstream inference server sees:
curl http://<inference-server-url>/embed \
-X POST \
-H "Content-Type: application/json" \
-d '{"inputs":["What is Deep Learning?", "Deep Learning is not..."]}'
The model route id is stable across renames and can be found on the model detail page or retrieved from GET /v2/model-routes.
!!! warning "Deprecated"
The `/model/proxy` + `X-GPUStack-Model` form is kept for backward compatibility and will be removed in a future release. Migrate to the path-based form above.
curl http://<server-url>/model/proxy/embed \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer <GPUSTACK_API_KEY>" \
-H "X-GPUStack-Model: bge-m3" \
-d '{"inputs":["What is Deep Learning?", "Deep Learning is not..."]}'
The path prefix /model/proxy is stripped before forwarding. You must provide either the X-GPUStack-Model header or the model attribute in the JSON body so the gateway can resolve the target model. On the upstream inference server the request looks like:
curl http://<inference-server-url>/embed \
-X POST \
-H "Content-Type: application/json" \
-H "X-GPUStack-Model: bge-m3" \
-d '{"inputs":["What is Deep Learning?", "Deep Learning is not..."]}'