This guide explains how to add custom inference backends in GPUStack, including using verified community configurations and creating your own from scratch.
For parameter descriptions, see the User Guide.
GPUStack supports three types of inference backends:
Community backends provide the fastest way to add popular inference engines.
Steps:
The following uses TensorRT-LLM as an example to illustrate how to add and use an inference backend.
These examples are functional demonstrations, not performance-optimized configurations. For better performance, consult each backend’s official documentation for tuning.
trtllm-serve; otherwise, they start an interactive shell session. The run_command supports placeholders such as {{model_path}} and {{port}} (and optionally {{model_name}}, {{worker_ip}}), which are automatically replaced with the actual values when the deployment is scheduled to a worker.Add configuration on the Inference Backend page; YAML import is supported. Example:
backend_name: TensorRT-LLM-custom
default_version: 1.2.0rc0
version_configs:
1.2.0rc0:
image_name: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc0
run_command: 'trtllm-serve {{model_path}} --host 0.0.0.0 --port {{port}}'
custom_framework: cuda
On the Deployments page, select the newly added backend and deploy the model.

Result
After the inference backend service starts, you can see the model_instance status becomes RUNNING.
You can engage in conversations in the Playground.

Environment variables provide flexible configuration without hardcoding values in commands:
backend_name: advanced-backend-custom
default_env:
CACHE_DIR: /models/cache
LOG_LEVEL: info
version_configs:
v1:
image_name: my-backend:v1
custom_framework: cuda
run_command: 'serve {{model_path}} --cache {{CACHE_DIR}} --log-level {{LOG_LEVEL}} --port {{port}}'
env:
LOG_LEVEL: debug # Override for this version
In this example:
CACHE_DIR and LOG_LEVEL are defined at the backend levelv1 overrides LOG_LEVEL to debug{{VAR_NAME}} syntaxOverride the container's default entrypoint when the image requires custom initialization. You can set entrypoints at both backend and version levels:
backend_name: custom-entry-backend-custom
default_entrypoint: /usr/local/bin/default-init
version_configs:
v1:
image_name: my-backend:v1
custom_framework: cuda
run_command: 'serve {{model_path}} --port {{port}}'
v2:
image_name: my-backend:v2
custom_framework: cuda
entrypoint: /usr/local/bin/v2-init # Version-specific entrypoint overrides default
run_command: 'serve {{model_path}} --port {{port}}'