|
@@ -1,6 +1,6 @@
|
|
|
# Built-in Inference Backends
|
|
# Built-in Inference Backends
|
|
|
|
|
|
|
|
-MASS-Base supports the following inference backends:
|
|
|
|
|
|
|
+MaaS-Base supports the following inference backends:
|
|
|
|
|
|
|
|
- [vLLM](#vllm)
|
|
- [vLLM](#vllm)
|
|
|
- [SGLang](#sglang)
|
|
- [SGLang](#sglang)
|
|
@@ -31,13 +31,13 @@ vLLM seamlessly supports most state-of-the-art open-source models, including:
|
|
|
- Embedding Models (e.g. `Qwen3-Embedding`)
|
|
- Embedding Models (e.g. `Qwen3-Embedding`)
|
|
|
- Reranker Models (e.g. `Qwen3-Reranker`)
|
|
- Reranker Models (e.g. `Qwen3-Reranker`)
|
|
|
|
|
|
|
|
-By default, MASS-Base estimates the VRAM requirement for the model instance based on the model's metadata.
|
|
|
|
|
|
|
+By default, MaaS-Base estimates the VRAM requirement for the model instance based on the model's metadata.
|
|
|
|
|
|
|
|
You can customize the parameters to fit your needs. The following vLLM parameters might be useful:
|
|
You can customize the parameters to fit your needs. The following vLLM parameters might be useful:
|
|
|
|
|
|
|
|
- `--gpu-memory-utilization` (default: 0.9): The fraction of GPU memory to use for the model instance.
|
|
- `--gpu-memory-utilization` (default: 0.9): The fraction of GPU memory to use for the model instance.
|
|
|
-- `--max-model-len`: Model context length. For large-context models, MASS-Base automatically sets this parameter to `8192` to simplify model deployment, especially in resource constrained environments. You can customize this parameter to fit your needs.
|
|
|
|
|
-- `--tensor-parallel-size`: Number of tensor parallel replicas. By default, MASS-Base sets this parameter given the GPU resources available and the estimation of the model's memory requirement. You can customize this parameter to fit your needs.
|
|
|
|
|
|
|
+- `--max-model-len`: Model context length. For large-context models, MaaS-Base automatically sets this parameter to `8192` to simplify model deployment, especially in resource constrained environments. You can customize this parameter to fit your needs.
|
|
|
|
|
+- `--tensor-parallel-size`: Number of tensor parallel replicas. By default, MaaS-Base sets this parameter given the GPU resources available and the estimation of the model's memory requirement. You can customize this parameter to fit your needs.
|
|
|
|
|
|
|
|
For more details, please refer to [vLLM CLI Reference](https://docs.vllm.ai/en/stable/cli/serve/).
|
|
For more details, please refer to [vLLM CLI Reference](https://docs.vllm.ai/en/stable/cli/serve/).
|
|
|
|
|
|
|
@@ -56,11 +56,11 @@ Please refer to the vLLM [documentation](https://docs.vllm.ai/en/stable/models/s
|
|
|
- **Video Tasks**: Video generation and editing (e.g., `Wan2.2`)
|
|
- **Video Tasks**: Video generation and editing (e.g., `Wan2.2`)
|
|
|
- **Audio Tasks**: Speech synthesis, voice cloning, and more (e.g., `Qwen3-TTS`)
|
|
- **Audio Tasks**: Speech synthesis, voice cloning, and more (e.g., `Qwen3-TTS`)
|
|
|
|
|
|
|
|
-MASS-Base integrates with vLLM-Omni to deliver a seamless experience for deploying and managing omni-modal models. When a model is deployed via the vLLM backend, GPUStack automatically detects whether it is omni-modal based on its metadata and sets the required parameters for vLLM-Omni.
|
|
|
|
|
|
|
+MaaS-Base integrates with vLLM-Omni to deliver a seamless experience for deploying and managing omni-modal models. When a model is deployed via the vLLM backend, GPUStack automatically detects whether it is omni-modal based on its metadata and sets the required parameters for vLLM-Omni.
|
|
|
|
|
|
|
|
#### Distributed Inference Across Workers (Experimental)
|
|
#### Distributed Inference Across Workers (Experimental)
|
|
|
|
|
|
|
|
-vLLM supports distributed inference across multiple workers using [Ray](https://ray.io). You can enable a Ray cluster in MASS-Base by checking the `Allow Distributed Inference Across Workers` option when deploying a model. This allows vLLM to run distributed inference across multiple workers.
|
|
|
|
|
|
|
+vLLM supports distributed inference across multiple workers using [Ray](https://ray.io). You can enable a Ray cluster in MaaS-Base by checking the `Allow Distributed Inference Across Workers` option when deploying a model. This allows vLLM to run distributed inference across multiple workers.
|
|
|
|
|
|
|
|
!!! warning "Known Limitations"
|
|
!!! warning "Known Limitations"
|
|
|
|
|
|
|
@@ -86,15 +86,15 @@ See the full list of supported parameters for vLLM [here](https://docs.vllm.ai/e
|
|
|
|
|
|
|
|
It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters.
|
|
It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters.
|
|
|
|
|
|
|
|
-By default, MASS-Base estimates the VRAM requirement for the model instance based on model metadata.
|
|
|
|
|
|
|
+By default, MaaS-Base estimates the VRAM requirement for the model instance based on model metadata.
|
|
|
|
|
|
|
|
-When needed, MASS-Base also sets several parameters automatically for large-context models. Common SGLang parameters include:
|
|
|
|
|
|
|
+When needed, MaaS-Base also sets several parameters automatically for large-context models. Common SGLang parameters include:
|
|
|
|
|
|
|
|
- `--mem-fraction-static` (default: `0.9`): The per-GPU allocatable VRAM fraction. The scheduler uses this value for resource matching and candidate selection. You can override it via the model's `backend_parameters`.
|
|
- `--mem-fraction-static` (default: `0.9`): The per-GPU allocatable VRAM fraction. The scheduler uses this value for resource matching and candidate selection. You can override it via the model's `backend_parameters`.
|
|
|
-- `--context-length`: Model context length. For large-context models, if the automatically estimated context length exceeds device capability, MASS-Base sets this parameter to `8192` to simplify deployment in resource-constrained environments. You can customize this parameter as needed.
|
|
|
|
|
-- `--tp-size`: Tensor parallel size. When not explicitly provided, MASS-Base infers and sets this parameter based on the selected GPUs.
|
|
|
|
|
-- `--pp-size`: Pipeline parallel size. In multi-node deployments, MASS-Base determines a combination of `--tp-size` and `--pp-size` according to the model and cluster configuration.
|
|
|
|
|
-- Multi-node arguments: `--nnodes`, `--node-rank`, `--dist-init-addr`. When distributed inference is enabled, MASS-Base injects these arguments to initialize multi-node communication.
|
|
|
|
|
|
|
+- `--context-length`: Model context length. For large-context models, if the automatically estimated context length exceeds device capability, MaaS-Base sets this parameter to `8192` to simplify deployment in resource-constrained environments. You can customize this parameter as needed.
|
|
|
|
|
+- `--tp-size`: Tensor parallel size. When not explicitly provided, MaaS-Base infers and sets this parameter based on the selected GPUs.
|
|
|
|
|
+- `--pp-size`: Pipeline parallel size. In multi-node deployments, MaaS-Base determines a combination of `--tp-size` and `--pp-size` according to the model and cluster configuration.
|
|
|
|
|
+- Multi-node arguments: `--nnodes`, `--node-rank`, `--dist-init-addr`. When distributed inference is enabled, MaaS-Base injects these arguments to initialize multi-node communication.
|
|
|
|
|
|
|
|
For more details, please refer to [SGLang documentation](https://docs.sglang.ai/index.html).
|
|
For more details, please refer to [SGLang documentation](https://docs.sglang.ai/index.html).
|
|
|
|
|
|
|
@@ -108,7 +108,7 @@ SGLang also supports image models. The ones we have verified include: Qwen-Image
|
|
|
|
|
|
|
|
#### Distributed Inference Across Workers (Experimental)
|
|
#### Distributed Inference Across Workers (Experimental)
|
|
|
|
|
|
|
|
-You can enable distributed SGLang inference across multiple workers in MASS-Base.
|
|
|
|
|
|
|
+You can enable distributed SGLang inference across multiple workers in MaaS-Base.
|
|
|
|
|
|
|
|
!!! warning "Known Limitations"
|
|
!!! warning "Known Limitations"
|
|
|
|
|
|
|
@@ -151,7 +151,7 @@ See the full list of supported parameters for SGLang [here](https://docs.sglang.
|
|
|
|
|
|
|
|
MindIE supports various models listed [here](https://www.hiascend.com/software/mindie/modellist).
|
|
MindIE supports various models listed [here](https://www.hiascend.com/software/mindie/modellist).
|
|
|
|
|
|
|
|
-Within MASS-Base, support [large language models (LLMs)](https://www.hiascend.com/software/mindie/modellist) and [multimodal language models (VLMs)](https://www.hiascend.com/software/mindie/modellist).
|
|
|
|
|
|
|
+Within MaaS-Base, support [large language models (LLMs)](https://www.hiascend.com/software/mindie/modellist) and [multimodal language models (VLMs)](https://www.hiascend.com/software/mindie/modellist).
|
|
|
|
|
|
|
|
However, _embedding models_ and _multimodal generation models_ are not supported yet.
|
|
However, _embedding models_ and _multimodal generation models_ are not supported yet.
|
|
|
|
|
|
|
@@ -159,7 +159,7 @@ However, _embedding models_ and _multimodal generation models_ are not supported
|
|
|
|
|
|
|
|
MindIE owns a variety of features outlined [here](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0001.html).
|
|
MindIE owns a variety of features outlined [here](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0001.html).
|
|
|
|
|
|
|
|
-At present, MASS-Base supports a subset of these capabilities, including
|
|
|
|
|
|
|
+At present, MaaS-Base supports a subset of these capabilities, including
|
|
|
[Quantization](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0279.html),
|
|
[Quantization](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0279.html),
|
|
|
[Extending Context Size](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0295.html),
|
|
[Extending Context Size](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0295.html),
|
|
|
[Distributed Inference](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0296.html),
|
|
[Distributed Inference](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0296.html),
|
|
@@ -189,7 +189,7 @@ At present, MASS-Base supports a subset of these capabilities, including
|
|
|
|
|
|
|
|
MindIE has configurable [parameters](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_service0285.html) and [environment variables](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0416.html).
|
|
MindIE has configurable [parameters](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_service0285.html) and [environment variables](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0416.html).
|
|
|
|
|
|
|
|
-To avoid directly configuring JSON, MASS-Base provides a set of command line parameters as below.
|
|
|
|
|
|
|
+To avoid directly configuring JSON, MaaS-Base provides a set of command line parameters as below.
|
|
|
|
|
|
|
|
| Parameter | Default | Range | Scope | Description |
|
|
| Parameter | Default | Range | Scope | Description |
|
|
|
|------------------------------------------------------|---------|--------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
|------------------------------------------------------|---------|--------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
@@ -253,9 +253,9 @@ To avoid directly configuring JSON, MASS-Base provides a set of command line par
|
|
|
|
|
|
|
|
!!! note
|
|
!!! note
|
|
|
|
|
|
|
|
- MASS-Base allows users to inject custom environment variables during model deployment, however, some variables may be conflicted with MASS-Base managment.
|
|
|
|
|
|
|
+ MaaS-Base allows users to inject custom environment variables during model deployment, however, some variables may be conflicted with MaaS-Base managment.
|
|
|
|
|
|
|
|
- Hence, MASS-Base will override/prevent those variables. Please compare the model instance logs' output with your expectations.
|
|
|
|
|
|
|
+ Hence, MaaS-Base will override/prevent those variables. Please compare the model instance logs' output with your expectations.
|
|
|
|
|
|
|
|
## VoxBox
|
|
## VoxBox
|
|
|
|
|
|