|
@@ -1,6 +1,6 @@
|
|
|
# Built-in Inference Backends
|
|
# Built-in Inference Backends
|
|
|
|
|
|
|
|
-GPUStack supports the following inference backends:
|
|
|
|
|
|
|
+MASS-Base supports the following inference backends:
|
|
|
|
|
|
|
|
- [vLLM](#vllm)
|
|
- [vLLM](#vllm)
|
|
|
- [SGLang](#sglang)
|
|
- [SGLang](#sglang)
|
|
@@ -31,13 +31,13 @@ vLLM seamlessly supports most state-of-the-art open-source models, including:
|
|
|
- Embedding Models (e.g. `Qwen3-Embedding`)
|
|
- Embedding Models (e.g. `Qwen3-Embedding`)
|
|
|
- Reranker Models (e.g. `Qwen3-Reranker`)
|
|
- Reranker Models (e.g. `Qwen3-Reranker`)
|
|
|
|
|
|
|
|
-By default, GPUStack estimates the VRAM requirement for the model instance based on the model's metadata.
|
|
|
|
|
|
|
+By default, MASS-Base estimates the VRAM requirement for the model instance based on the model's metadata.
|
|
|
|
|
|
|
|
You can customize the parameters to fit your needs. The following vLLM parameters might be useful:
|
|
You can customize the parameters to fit your needs. The following vLLM parameters might be useful:
|
|
|
|
|
|
|
|
- `--gpu-memory-utilization` (default: 0.9): The fraction of GPU memory to use for the model instance.
|
|
- `--gpu-memory-utilization` (default: 0.9): The fraction of GPU memory to use for the model instance.
|
|
|
-- `--max-model-len`: Model context length. For large-context models, GPUStack automatically sets this parameter to `8192` to simplify model deployment, especially in resource constrained environments. You can customize this parameter to fit your needs.
|
|
|
|
|
-- `--tensor-parallel-size`: Number of tensor parallel replicas. By default, GPUStack sets this parameter given the GPU resources available and the estimation of the model's memory requirement. You can customize this parameter to fit your needs.
|
|
|
|
|
|
|
+- `--max-model-len`: Model context length. For large-context models, MASS-Base automatically sets this parameter to `8192` to simplify model deployment, especially in resource constrained environments. You can customize this parameter to fit your needs.
|
|
|
|
|
+- `--tensor-parallel-size`: Number of tensor parallel replicas. By default, MASS-Base sets this parameter given the GPU resources available and the estimation of the model's memory requirement. You can customize this parameter to fit your needs.
|
|
|
|
|
|
|
|
For more details, please refer to [vLLM CLI Reference](https://docs.vllm.ai/en/stable/cli/serve/).
|
|
For more details, please refer to [vLLM CLI Reference](https://docs.vllm.ai/en/stable/cli/serve/).
|
|
|
|
|
|
|
@@ -56,11 +56,11 @@ Please refer to the vLLM [documentation](https://docs.vllm.ai/en/stable/models/s
|
|
|
- **Video Tasks**: Video generation and editing (e.g., `Wan2.2`)
|
|
- **Video Tasks**: Video generation and editing (e.g., `Wan2.2`)
|
|
|
- **Audio Tasks**: Speech synthesis, voice cloning, and more (e.g., `Qwen3-TTS`)
|
|
- **Audio Tasks**: Speech synthesis, voice cloning, and more (e.g., `Qwen3-TTS`)
|
|
|
|
|
|
|
|
-GPUStack integrates with vLLM-Omni to deliver a seamless experience for deploying and managing omni-modal models. When a model is deployed via the vLLM backend, GPUStack automatically detects whether it is omni-modal based on its metadata and sets the required parameters for vLLM-Omni.
|
|
|
|
|
|
|
+MASS-Base integrates with vLLM-Omni to deliver a seamless experience for deploying and managing omni-modal models. When a model is deployed via the vLLM backend, GPUStack automatically detects whether it is omni-modal based on its metadata and sets the required parameters for vLLM-Omni.
|
|
|
|
|
|
|
|
#### Distributed Inference Across Workers (Experimental)
|
|
#### Distributed Inference Across Workers (Experimental)
|
|
|
|
|
|
|
|
-vLLM supports distributed inference across multiple workers using [Ray](https://ray.io). You can enable a Ray cluster in GPUStack by checking the `Allow Distributed Inference Across Workers` option when deploying a model. This allows vLLM to run distributed inference across multiple workers.
|
|
|
|
|
|
|
+vLLM supports distributed inference across multiple workers using [Ray](https://ray.io). You can enable a Ray cluster in MASS-Base by checking the `Allow Distributed Inference Across Workers` option when deploying a model. This allows vLLM to run distributed inference across multiple workers.
|
|
|
|
|
|
|
|
!!! warning "Known Limitations"
|
|
!!! warning "Known Limitations"
|
|
|
|
|
|
|
@@ -86,15 +86,15 @@ See the full list of supported parameters for vLLM [here](https://docs.vllm.ai/e
|
|
|
|
|
|
|
|
It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters.
|
|
It is designed to deliver low-latency and high-throughput inference across a wide range of setups, from a single GPU to large distributed clusters.
|
|
|
|
|
|
|
|
-By default, GPUStack estimates the VRAM requirement for the model instance based on model metadata.
|
|
|
|
|
|
|
+By default, MASS-Base estimates the VRAM requirement for the model instance based on model metadata.
|
|
|
|
|
|
|
|
-When needed, GPUStack also sets several parameters automatically for large-context models. Common SGLang parameters include:
|
|
|
|
|
|
|
+When needed, MASS-Base also sets several parameters automatically for large-context models. Common SGLang parameters include:
|
|
|
|
|
|
|
|
- `--mem-fraction-static` (default: `0.9`): The per-GPU allocatable VRAM fraction. The scheduler uses this value for resource matching and candidate selection. You can override it via the model's `backend_parameters`.
|
|
- `--mem-fraction-static` (default: `0.9`): The per-GPU allocatable VRAM fraction. The scheduler uses this value for resource matching and candidate selection. You can override it via the model's `backend_parameters`.
|
|
|
-- `--context-length`: Model context length. For large-context models, if the automatically estimated context length exceeds device capability, GPUStack sets this parameter to `8192` to simplify deployment in resource-constrained environments. You can customize this parameter as needed.
|
|
|
|
|
-- `--tp-size`: Tensor parallel size. When not explicitly provided, GPUStack infers and sets this parameter based on the selected GPUs.
|
|
|
|
|
-- `--pp-size`: Pipeline parallel size. In multi-node deployments, GPUStack determines a combination of `--tp-size` and `--pp-size` according to the model and cluster configuration.
|
|
|
|
|
-- Multi-node arguments: `--nnodes`, `--node-rank`, `--dist-init-addr`. When distributed inference is enabled, GPUStack injects these arguments to initialize multi-node communication.
|
|
|
|
|
|
|
+- `--context-length`: Model context length. For large-context models, if the automatically estimated context length exceeds device capability, MASS-Base sets this parameter to `8192` to simplify deployment in resource-constrained environments. You can customize this parameter as needed.
|
|
|
|
|
+- `--tp-size`: Tensor parallel size. When not explicitly provided, MASS-Base infers and sets this parameter based on the selected GPUs.
|
|
|
|
|
+- `--pp-size`: Pipeline parallel size. In multi-node deployments, MASS-Base determines a combination of `--tp-size` and `--pp-size` according to the model and cluster configuration.
|
|
|
|
|
+- Multi-node arguments: `--nnodes`, `--node-rank`, `--dist-init-addr`. When distributed inference is enabled, MASS-Base injects these arguments to initialize multi-node communication.
|
|
|
|
|
|
|
|
For more details, please refer to [SGLang documentation](https://docs.sglang.ai/index.html).
|
|
For more details, please refer to [SGLang documentation](https://docs.sglang.ai/index.html).
|
|
|
|
|
|
|
@@ -108,7 +108,7 @@ SGLang also supports image models. The ones we have verified include: Qwen-Image
|
|
|
|
|
|
|
|
#### Distributed Inference Across Workers (Experimental)
|
|
#### Distributed Inference Across Workers (Experimental)
|
|
|
|
|
|
|
|
-You can enable distributed SGLang inference across multiple workers in GPUStack.
|
|
|
|
|
|
|
+You can enable distributed SGLang inference across multiple workers in MASS-Base.
|
|
|
|
|
|
|
|
!!! warning "Known Limitations"
|
|
!!! warning "Known Limitations"
|
|
|
|
|
|
|
@@ -151,7 +151,7 @@ See the full list of supported parameters for SGLang [here](https://docs.sglang.
|
|
|
|
|
|
|
|
MindIE supports various models listed [here](https://www.hiascend.com/software/mindie/modellist).
|
|
MindIE supports various models listed [here](https://www.hiascend.com/software/mindie/modellist).
|
|
|
|
|
|
|
|
-Within GPUStack, support [large language models (LLMs)](https://www.hiascend.com/software/mindie/modellist) and [multimodal language models (VLMs)](https://www.hiascend.com/software/mindie/modellist).
|
|
|
|
|
|
|
+Within MASS-Base, support [large language models (LLMs)](https://www.hiascend.com/software/mindie/modellist) and [multimodal language models (VLMs)](https://www.hiascend.com/software/mindie/modellist).
|
|
|
|
|
|
|
|
However, _embedding models_ and _multimodal generation models_ are not supported yet.
|
|
However, _embedding models_ and _multimodal generation models_ are not supported yet.
|
|
|
|
|
|
|
@@ -159,7 +159,7 @@ However, _embedding models_ and _multimodal generation models_ are not supported
|
|
|
|
|
|
|
|
MindIE owns a variety of features outlined [here](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0001.html).
|
|
MindIE owns a variety of features outlined [here](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0001.html).
|
|
|
|
|
|
|
|
-At present, GPUStack supports a subset of these capabilities, including
|
|
|
|
|
|
|
+At present, MASS-Base supports a subset of these capabilities, including
|
|
|
[Quantization](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0279.html),
|
|
[Quantization](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0279.html),
|
|
|
[Extending Context Size](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0295.html),
|
|
[Extending Context Size](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0295.html),
|
|
|
[Distributed Inference](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0296.html),
|
|
[Distributed Inference](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0296.html),
|
|
@@ -189,7 +189,7 @@ At present, GPUStack supports a subset of these capabilities, including
|
|
|
|
|
|
|
|
MindIE has configurable [parameters](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_service0285.html) and [environment variables](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0416.html).
|
|
MindIE has configurable [parameters](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_service0285.html) and [environment variables](https://www.hiascend.com/document/detail/zh/mindie/22RC1/mindiellm/llmdev/mindie_llm0416.html).
|
|
|
|
|
|
|
|
-To avoid directly configuring JSON, GPUStack provides a set of command line parameters as below.
|
|
|
|
|
|
|
+To avoid directly configuring JSON, MASS-Base provides a set of command line parameters as below.
|
|
|
|
|
|
|
|
| Parameter | Default | Range | Scope | Description |
|
|
| Parameter | Default | Range | Scope | Description |
|
|
|
|------------------------------------------------------|---------|--------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
|------------------------------------------------------|---------|--------------------------|----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
|
@@ -253,9 +253,9 @@ To avoid directly configuring JSON, GPUStack provides a set of command line para
|
|
|
|
|
|
|
|
!!! note
|
|
!!! note
|
|
|
|
|
|
|
|
- GPUStack allows users to inject custom environment variables during model deployment, however, some variables may be conflicted with GPUStack managment.
|
|
|
|
|
|
|
+ MASS-Base allows users to inject custom environment variables during model deployment, however, some variables may be conflicted with MASS-Base managment.
|
|
|
|
|
|
|
|
- Hence, GPUStack will override/prevent those variables. Please compare the model instance logs' output with your expectations.
|
|
|
|
|
|
|
+ Hence, MASS-Base will override/prevent those variables. Please compare the model instance logs' output with your expectations.
|
|
|
|
|
|
|
|
## VoxBox
|
|
## VoxBox
|
|
|
|
|
|