This guide explains how to add custom inference backends in MaaS-Base, including using verified community configurations and creating your own from scratch.
For parameter descriptions, see the User Guide.
MaaS-Base supports three types of inference backends:
Community backends provide the fastest way to add popular inference engines.
Steps:
The following uses TensorRT-LLM as an example to illustrate how to add and use an inference backend.
These examples are functional demonstrations, not performance-optimized configurations. For better performance, consult each backend’s official documentation for tuning.
trtllm-serve; otherwise, they start an interactive shell session. The run_command supports placeholders such as {{model_path}} and {{port}} (and optionally {{model_name}}, {{worker_ip}}), which are automatically replaced with the actual values when the deployment is scheduled to a worker.Add configuration on the Inference Backend page; YAML import is supported. Example:
backend_name: TensorRT-LLM-custom
default_version: 1.2.0rc0
version_configs:
1.2.0rc0:
image_name: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc0
run_command: 'trtllm-serve {{model_path}} --host 0.0.0.0 --port {{port}}'
custom_framework: cuda
On the Deployments page, select the newly added backend and deploy the model.

Result
After the inference backend service starts, you can see the model_instance status becomes RUNNING.
You can engage in conversations in the Playground.

Environment variables provide flexible configuration without hardcoding values in commands:
backend_name: advanced-backend-custom
default_env:
CACHE_DIR: /models/cache
LOG_LEVEL: info
version_configs:
v1:
image_name: my-backend:v1
custom_framework: cuda
run_command: 'serve {{model_path}} --cache {{CACHE_DIR}} --log-level {{LOG_LEVEL}} --port {{port}}'
env:
LOG_LEVEL: debug # Override for this version
In this example:
CACHE_DIR and LOG_LEVEL are defined at the backend levelv1 overrides LOG_LEVEL to debug{{VAR_NAME}} syntaxOverride the container's default entrypoint when the image requires custom initialization. You can set entrypoints at both backend and version levels:
backend_name: custom-entry-backend-custom
default_entrypoint: /usr/local/bin/default-init
version_configs:
v1:
image_name: my-backend:v1
custom_framework: cuda
run_command: 'serve {{model_path}} --port {{port}}'
v2:
image_name: my-backend:v2
custom_framework: cuda
entrypoint: /usr/local/bin/v2-init # Version-specific entrypoint overrides default
run_command: 'serve {{model_path}} --port {{port}}'