Inference Backend Management

GPUStack allows admins to configure inference backends and backend versions.

This article serves as an operational guide for the Inference Backend page. For supported built-in backends and their capabilities, see Built-in Inference Backends.

For guidelines for configuring custom backends and examples of custom backends that have been verified to work, see Custom Inference Backends.

Backend Sources

GPUStack supports three types of inference backends:

Built-in: Pre-configured backends maintained by GPUStack (e.g., vLLM, MindIE, VoxBox). These cannot be deleted.
Community: Backends shared by the Community Backend Marketplace. You can enable them as needed.
Custom: Backends you create with your own configurations. These can be freely added, edited, and deleted.

Parameter Description

Parameter Name	Description	Required
Name	Inference backend name	Yes
Health Check Path	Health check path used to verify the backend is up and responding. Default: /v1/models (OpenAI-compatible).	No
Default Execution Command	Container startup command/args. For example (vLLM): `vllm serve {{model_path}} --port {{port}} --served-model-name {{model_name}} --host {{worker_ip}}`. The placeholders `{{model_path}}`, `{{model_name}}`, `{{port}}`, `{{worker_ip}}`, and `{{VAR_NAME}}` (for environment variables) are automatically substituted when the deployment is scheduled to a worker; after placeholder substitution, arguments are split using POSIX-style. Quote values with spaces and avoid shell operators.	No
Default Entrypoint	Container entrypoint override. If set, it replaces the image entrypoint for this backend. Arguments are split using POSIX-style.	No
Default Environment Variables	Environment variables to set for all versions of this backend. Can be referenced in commands using `{{VAR_NAME}}` syntax. Version-specific environment variables take precedence.	No
Default Backend Parameters	Pre-populate the Advanced Backend Parameters section during deployment; you can adjust them before launching	No
Description	Description	No
Version Configs	Configure available versions of this backend	Yes
Default Version	Preselected during deployment. If you don't choose a version, its image is used	No

Environment Variables

You can define environment variables at two levels:

Backend level (Default Environment Variables): Applied to all versions of the backend
Version level (Environment Variables): Specific to a version, overrides backend-level variables

Environment variables can be referenced in commands using {{VAR_NAME}} syntax.

Example:

backend_name: my-backend-custom
default_env:
  MODEL_CACHE_DIR: /cache
  LOG_LEVEL: info
version_configs:
  v1:
    image_name: my-image:v1
    custom_framework: cuda
    run_command: "serve {{model_path}} --cache {{MODEL_CACHE_DIR}} --log-level {{LOG_LEVEL}} --port {{port}}"
    env:
      MODEL_CACHE_DIR: /custom-cache # Overrides backend-level value

In this example:

MODEL_CACHE_DIR is set to /cache at the backend level
Version v1 overrides it to /custom-cache
LOG_LEVEL remains info for all versions
Both variables are referenced in the command using {{VAR_NAME}} syntax

Version Configs parameter description:

Parameter Name	Description	Required
Version	Version name shown in the Backend Version dropdown during deployment	Yes
Image Name	Container image name for the backend (e.g., `ghcr.io/org/image:tag`)	Yes
Framework (custom_framework)	Backend framework (internal identifier: `custom_framework`). Deployment and scheduling are filtered by supported frameworks	Yes
Environment Variables	Environment variables specific to this version. Overrides backend-level `Default Environment Variables`. Can be referenced in `Execution Command` using `{{VAR_NAME}}` syntax.	No
Entrypoint	Version-specific container entrypoint override. If omitted, `Default Entrypoint` is used. Arguments are split using POSIX-style.	No
Execution Command	Version-specific startup command. If omitted, the Default Execution Command is used. Parsing and splitting rules are identical to `Default Execution Command`.	No

Add Custom Inference Backend

Click the "Add Backend" button in the top-right corner.
You can add a custom inference backend by completing the form or by pasting a YAML definition. Refer to the parameter descriptions above for field meanings.
The backend name cannot be modified after creation. Custom backend names must end with "-custom" (pre-filled in the form).
Click "Save" to submit.

There are two ways to add a custom inference backend:

Through the UI form: Navigate to the Resources > Inference Backends page and click the Add Custom Inference Backend button.
Through YAML configuration: Import a YAML file containing the backend configuration.

Enable Community Inference Backend

These are essentially custom backends with a "community" source label, allowing you to quickly create custom backends without manual configuration.

On the Inference Backend page, click the "Add Backend" button in the top-right corner.
Select the "Community" option to browse available community backends from the marketplace.
Locate the backend you want to use and click "Enable" from the card's action menu.
Once enabled, the backend becomes available for model deployments. To disable, delete the backend from the Inference Backend page.

Edit Inference Backend or Add Custom Version

On the Inference Backend page, locate the target backend. From the card's top-right dropdown menu, choose "Edit".
Modify backend properties (the name cannot be changed), or add a new version.
For built-in backends, custom versions must end with "-custom" (pre-filled in the form).
Click "Save" to submit.

Example: Add a Custom Version to the Built-in vLLM Inference Backend

On the Inference Backend page, locate the vLLM inference backend. From the card's top-right dropdown menu, choose "Edit".
In the Version Configs section, click "Add Version".
Fill in the fields as follows:

Version: 0.19.0
Image Name: vllm/vllm-openai:v0.19.0
Framework: cuda
Override Image Entrypoint: vllm serve
Execution Command: {{model_path}} --host {{worker_ip}} --port {{port}} --served-model-name {{model_name}}

Click "Save" to submit.

!!! note

vLLM has changed the entrypoint of its Docker image since v0.11.1. Therefore, when adding a custom version for vLLM v0.11.1 or later, you must specify the `Override Image Entrypoint` and `Execution Command` field; otherwise, the model will fail to start. If you use newer versions of `gpustack/runner` images, you don't need to set the `Execution Command` field.

Example: Add a Custom Version to the Built-in SGLang Inference Backend

On the Inference Backend page, locate the SGLang inference backend. From the card's top-right dropdown menu, choose "Edit".
In the Version Configs section, click "Add Version".
Fill in the fields as follows:

Version: 0.5.10
Image Name: lmsysorg/sglang:v0.5.10
Framework: cuda
Override Image Entrypoint: sglang serve
Execution Command: --model-path {{model_path}} --host {{worker_ip}} --port {{port}}

Click "Save" to submit.

Delete Custom Inference Backend

On the Inference Backend page, locate the target backend and select "Delete" from the card's top-right dropdown menu.
Built-in backends cannot be deleted.
Click "Delete" in the confirmation dialog.

List Versions of Inference Backend

On the Inference Backend page, click anywhere on the backend card (except the action buttons) to open a modal where you can browse all built-in and custom-added versions.

Flexible Testing Deployment

Use this mode to quickly verify or tweak the image and startup command without editing the backend definition.

Navigate to the Deployments page, click the "Deploy Model" button, and choose any model source.
In the Basic tab, open the "Backend" dropdown and select "Custom" under the "Built-in" section.
Two fields appear: image_name and run_command. These override the backend configuration for this deployment only.
Review the remaining required settings and submit the deployment.

Limitations of Custom Inference Backends

Custom inference backends do not support distributed inference across multiple workers.

inference-backend-management.md 13 KB Geçmiş Ham