# Using Custom Inference Backends This guide explains how to add custom inference backends in MaaS-Base, including using verified community configurations and creating your own from scratch. For parameter descriptions, see the [User Guide](../user-guide/inference-backend-management.md). ## Backend Types MaaS-Base supports three types of inference backends: - **Built-in**: Pre-configured backends (vLLM, MindIE, VoxBox, SGLang...) maintained by MaaS-Base, automatically optimized for different hardware. - **Community**: Pre-verified custom backend configurations. These are essentially CustomBackends labeled "community" to simplify manual setup. - **Custom**: Backends you configure yourself with custom Docker images and commands. ## Using Community Backends Community backends provide the fastest way to add popular inference engines. **Steps:** 1. Navigate to Inference Backend page → Click "Add Backend" 2. Select "Community" option 3. Browse the "Community Backend Marketplace" and enable the backends you need ## Creating Custom Backends ### Core Steps 1. Prepare the Docker image for the required inference backend 2. Understand the image's ENTRYPOINT or CMD to determine the startup command 3. Add configuration on the Inference Backend page 4. Deploy models and select the newly added backend ### Example: TensorRT-LLM The following uses TensorRT-LLM as an example to illustrate how to add and use an inference backend. > These examples are functional demonstrations, not performance-optimized configurations. For better performance, consult each backend’s official documentation for tuning. 1. Find the required image from the [release page](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/tensorrt-llm/containers/release) linked from the TensorRT-LLM documentation. 2. TensorRT-LLM images must launch the inference service using `trtllm-serve`; otherwise, they start an interactive shell session. The `run_command` supports placeholders such as `{{model_path}}` and `{{port}}` (and optionally `{{model_name}}`, `{{worker_ip}}`), which are automatically replaced with the actual values when the deployment is scheduled to a worker. 3. Add configuration on the Inference Backend page; YAML import is supported. Example: ```yaml backend_name: TensorRT-LLM-custom default_version: 1.2.0rc0 version_configs: 1.2.0rc0: image_name: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc0 run_command: 'trtllm-serve {{model_path}} --host 0.0.0.0 --port {{port}}' custom_framework: cuda ``` 4. On the Deployments page, select the newly added backend and deploy the model. ![image.png](../assets/tutorials/using-custom-backend/deploy-by-custom-backend.png) **Result** After the inference backend service starts, you can see the model_instance status becomes RUNNING. ![image.png](../assets/tutorials/using-custom-backend/custom-backend-running.png) You can engage in conversations in the Playground. ![image.png](../assets/tutorials/using-custom-backend/use-custom-backend-in-playground.png) ## Advanced Configuration ### Using Environment Variables Environment variables provide flexible configuration without hardcoding values in commands: ```yaml backend_name: advanced-backend-custom default_env: CACHE_DIR: /models/cache LOG_LEVEL: info version_configs: v1: image_name: my-backend:v1 custom_framework: cuda run_command: 'serve {{model_path}} --cache {{CACHE_DIR}} --log-level {{LOG_LEVEL}} --port {{port}}' env: LOG_LEVEL: debug # Override for this version ``` In this example: - `CACHE_DIR` and `LOG_LEVEL` are defined at the backend level - Version `v1` overrides `LOG_LEVEL` to `debug` - Both variables are referenced in the command using `{{VAR_NAME}}` syntax ### Custom Entrypoint Override the container's default entrypoint when the image requires custom initialization. You can set entrypoints at both backend and version levels: ```yaml backend_name: custom-entry-backend-custom default_entrypoint: /usr/local/bin/default-init version_configs: v1: image_name: my-backend:v1 custom_framework: cuda run_command: 'serve {{model_path}} --port {{port}}' v2: image_name: my-backend:v2 custom_framework: cuda entrypoint: /usr/local/bin/v2-init # Version-specific entrypoint overrides default run_command: 'serve {{model_path}} --port {{port}}' ```