Răsfoiți Sursa

fix: P0 critical fixes and VRAM estimation improvements

P0 Security & Crash Fixes:
- auth: Fix OIDC CSRF by generating random state and validating on callback
- auth: Fix SAML XXE/signature bypass by using process_response() instead of manual XML parsing
- policies/utils: Add None guard for resource_claim.vram in update_allocated_vram
- policies/utils: Add None guard for worker.system_reserved (ram/vram)
- base_candidate_selector: Fix claim.vram.items() None check order
- ascend_mindie: Fix MLA d_head copy-paste bug (qk_nope_head_dim -> qk_rope_head_dim)
- gguf_resource_fit_selector: Fix _get_max_offload_layers_candidates return type ([] not 0)

VRAM Estimation Improvements:
- Add estimated_vram field to ComputedResourceClaim
- SGLang/VLLM selectors populate estimated_vram for accurate resource reporting
- evaluator: summarize_candidate_resource_claim prefers estimated_vram over allocated
- calculator: Add safety checks for GGUF estimate items

Model Catalog:
- Remove HuggingFace-only catalog, keep ModelScope catalog for China mainland access
- Update documentation links to point to ModelScope catalog
kinglee 2 zile în urmă
părinte
comite
6c9089b1f5

+ 8 - 0
docs/deployment/docker-compose.md

@@ -81,6 +81,14 @@ cd maas-base/docker-compose
 GPUSTACK_VERSION=$(git -C .. rev-parse --short HEAD) docker compose -f docker-compose.external-observability.yaml up -d --build
 ```
 
+```BASH
+ nohup sh -c 'GPUSTACK_VERSION=$(git -C .. rev-parse --short HEAD) docker compose -f docker-compose.external-observability.yaml up -d --build' > deploy.log 2>&1 &
+```
+
+
+
+
+
 该命令会启动以下四个容器:
 
 - **`maas-base-db`** — PostgreSQL 16 数据库

+ 1 - 1
docs/user-guide/model-catalog.md

@@ -14,7 +14,7 @@ You can deploy a model from the Model Catalog by clicking the model card. A mode
 
 ## Customize Model Catalog
 
-You can customize the Model Catalog by providing a YAML file via GPUStack server configuration using the `--model-catalog-file` flag. It accepts either a local file path or a URL. You can refer to the built-in model catalog file [here](https://github.com/gpustack/gpustack/blob/main/gpustack/assets/model-catalog.yaml) for the schema.
+You can customize the Model Catalog by providing a YAML file via GPUStack server configuration using the `--model-catalog-file` flag. It accepts either a local file path or a URL. You can refer to the built-in model catalog file [here](https://github.com/gpustack/gpustack/blob/main/gpustack/assets/model-catalog-modelscope.yaml) for the schema.
 
 The following is an example of a custom model catalog YAML file:
 

+ 1 - 0
gpustack/api/auth.py

@@ -42,6 +42,7 @@ logger = logging.getLogger(__name__)
 
 SESSION_COOKIE_NAME = "gpustack_session"
 OIDC_ID_TOKEN_COOKIE_NAME = "gpustack_oidc_id_token"
+OIDC_STATE_COOKIE_NAME = "gpustack_oidc_state"
 SSO_LOGIN_COOKIE_NAME = "gpustack_sso_login"
 SYSTEM_USER_PREFIX = "system/"
 SYSTEM_WORKER_USER_PREFIX = "system/worker/"

+ 0 - 2521
gpustack/assets/model-catalog.yaml

@@ -1,2521 +0,0 @@
-# YAML Variables
-.vllm_omni_ascend_stable_version: &vllm_omni_ascend_stable_version "0.14.1"
-.vllm_omni_stable_version: &vllm_omni_stable_version "0.16.0"
-
-draft_models:
-- name: Qwen3-8B-EAGLE3
-  algorithm: eagle3
-  source: huggingface
-  huggingface_repo_id: Tengyunw/qwen3_8b_eagle3
-- name: Qwen3-30B-A3B-EAGLE3
-  algorithm: eagle3
-  source: huggingface
-  huggingface_repo_id: Tengyunw/qwen3_30b_moe_eagle3
-- name: Qwen3-235B-A22B-EAGLE3
-  algorithm: eagle3
-  source: huggingface
-  huggingface_repo_id: lmsys/Qwen3-235B-A22B-EAGLE3
-- name: gpt-oss-120b-EAGLE3
-  algorithm: eagle3
-  source: huggingface
-  huggingface_repo_id: lmsys/EAGLE3-gpt-oss-120b-bf16
-model_sets:
-- name: Qwen3-0.6B
-  description: Qwen3 is a family of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 0.6
-  categories:
-    - llm
-  capabilities:
-    - context/128K
-    - tools
-  licenses:
-    - apache-2.0
-  release_date: "2025-04-19"
-  specs:
-    # Ascend NPUs
-    - mode: throughput
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-0.6B
-      backend: MindIE
-      backend_parameters:
-        - --max-seq-len=8192
-    # Other GPUs
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-0.6B
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=deepseek_r1
-        - --max-model-len=8192
-- name: Qwen3-8B
-  description: Qwen3 is a family of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 8
-  categories:
-    - llm
-  capabilities:
-    - context/128K
-    - tools
-  licenses:
-    - apache-2.0
-  release_date: "2025-04-19"
-  specs:
-    # Ascend NPUs
-    - mode: throughput
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-8B
-      backend: MindIE
-      backend_parameters:
-        - --max-seq-len=32768
-    # Other GPUs
-    - mode: throughput
-      quantization: FP8
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-8B-FP8
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=deepseek_r1
-        - --max-model-len=32768
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-8B
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=deepseek_r1
-        - --max-model-len=32768
-- name: Falcon-H1R-7B
-  description: Falcon-H1R-7B is a reasoning-specialized language model built on top of Falcon-H1-7B-Base, featuring a Hybrid-Head Language Model (Transformer-SSM) architecture that delivers outstanding performance in mathematics, programming, and instruction following.
-  home: https://huggingface.co/tiiuae
-  icon: /static/catalog_icons/tii.png
-  size: 7
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-  licenses:
-    - falcon-llm-license
-  release_date: "2026-01-05"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: tiiuae/Falcon-H1R-7B
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=deepseek_r1
-        - --max-model-len=65536
-- name: Qwen3-14B
-  description: Qwen3 is a family of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 14
-  categories:
-    - llm
-  capabilities:
-    - context/128K
-    - tools
-  licenses:
-    - apache-2.0
-  release_date: "2025-04-19"
-  specs:
-    # Ascend NPUs
-    - mode: throughput
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-14B
-      backend: MindIE
-      backend_parameters:
-        - --max-seq-len=32768
-    # Other GPUs
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0" # Hopper or later
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-14B-FP8
-      backend: SGLang
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --context-length=32768
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: "<9.0" # Before Hopper
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-14B-FP8
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=deepseek_r1
-        - --max-model-len=32768
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-14B
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=deepseek_r1
-        - --max-model-len=32768
-- name: Qwen3-32B
-  description: Qwen3 is a family of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 32
-  categories:
-    - llm
-  capabilities:
-    - context/128K
-    - tools
-  licenses:
-    - apache-2.0
-  release_date: "2025-04-19"
-  specs:
-    # Ascend NPUs
-    - mode: throughput
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-32B
-      backend: MindIE
-      backend_parameters:
-        - --max-seq-len=32768
-    # Other GPUs
-    - mode: throughput
-      quantization: FP8
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-32B-FP8
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=deepseek_r1
-        - --max-model-len=32768
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-32B
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=deepseek_r1
-        - --max-model-len=32768
-- name: Qwen3-Coder-Next
-  description: Qwen3-Coder-Next is a super-efficient coding model with 80B total parameters and 3B activated parameters (MoE architecture). It achieves performance comparable to models with 10-20x more active parameters, excelling at long-horizon reasoning, complex tool usage, and IDE integration.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 80
-  activated_size: 3
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - tools
-  licenses:
-    - apache-2.0
-  release_date: "2026-02-03"
-  specs:
-    - mode: throughput
-      quantization: FP8
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-Coder-Next-FP8
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-        - --enable-auto-tool-choice
-        - --tool-call-parser=qwen3_coder
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-Coder-Next
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-        - --enable-auto-tool-choice
-        - --tool-call-parser=qwen3_coder
-- name: Qwen3-30B-A3B-Instruct-2507
-  description: Qwen3 is a family of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 30
-  activated_size: 3
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - tools
-  licenses:
-    - apache-2.0
-  release_date: "2025-07-21"
-  specs:
-    # Ascend NPUs
-    - mode: throughput
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-30B-A3B-Instruct-2507
-      backend: MindIE
-      backend_parameters:
-        - --max-seq-len=32768
-    # Other GPUs
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0" # Hopper or later
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-30B-A3B-Instruct-2507-FP8
-      backend: SGLang
-      backend_parameters:
-        - --tool-call-parser=qwen25
-        - --context-length=32768
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: "<9.0" # Before Hopper
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-30B-A3B-Instruct-2507-FP8
-      backend: vLLM
-      backend_parameters:
-        - --tool-call-parser=hermes
-        - --enable-auto-tool-choice
-        - --max-model-len=32768
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-30B-A3B-Instruct-2507
-      backend: vLLM
-      backend_parameters:
-        - --tool-call-parser=hermes
-        - --enable-auto-tool-choice
-        - --max-model-len=32768
-- name: Qwen3-30B-A3B-Thinking-2507
-  description: Qwen3 is a family of large language models in Qwen series, offering a comprehensive suite of dense and mixture-of-experts (MoE) models. Built upon extensive training, Qwen3 delivers groundbreaking advancements in reasoning, instruction-following, agent capabilities, and multilingual support.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 30
-  activated_size: 3 
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - tools
-  licenses:
-    - apache-2.0
-  release_date: "2025-07-21"
-  specs:
-    # Ascend NPUs
-    - mode: throughput
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-30B-A3B-Thinking-2507
-      backend: MindIE
-      backend_parameters:
-        - --max-seq-len=32768
-    # Other GPUs
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0" # Hopper or later
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
-      backend: SGLang
-      backend_parameters:
-        - --reasoning-parser=deepseek-r1
-        - --tool-call-parser=qwen25
-        - --context-length=32768
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: "<9.0" # Before Hopper
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-30B-A3B-Thinking-2507-FP8
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=deepseek_r1
-        - --tool-call-parser=hermes
-        - --enable-auto-tool-choice
-        - --max-model-len=32768
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-30B-A3B-Thinking-2507
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=deepseek_r1
-        - --tool-call-parser=hermes
-        - --enable-auto-tool-choice
-        - --max-model-len=32768
-- name: Qwen3-235B-A22B-Instruct-2507
-  description: The updated version of the Qwen3-235B-A22B non-thinking mode.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 235
-  activated_size: 22
-  categories:
-    - llm
-  capabilities:
-    - context/1M
-    - tools
-  licenses:
-    - apache-2.0
-  release_date: "2025-07-21"
-  specs:
-    # Ascend NPUs
-    - mode: throughput
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-235B-A22B-Instruct-2507
-      backend: MindIE
-      backend_parameters:
-        - --max-seq-len=65536
-    # Other GPUs
-    - mode: throughput
-      quantization: FP8
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-235B-A22B-Instruct-2507-FP8
-      backend: vLLM
-      backend_parameters:
-        - --tool-call-parser=hermes
-        - --enable-auto-tool-choice
-        - --max-model-len=65536
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-235B-A22B-Instruct-2507
-      backend: vLLM
-      backend_parameters:
-        - --tool-call-parser=hermes
-        - --enable-auto-tool-choice
-        - --max-model-len=65536
-- name: Qwen3-235B-A22B-Thinking-2507
-  description: The updated version of the Qwen3-235B-A22B thinking mode.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 235
-  activated_size: 22
-  categories:
-    - llm
-  capabilities:
-    - context/1M
-    - tools
-  licenses:
-    - apache-2.0
-  release_date: "2025-07-21"
-  specs:
-    # Ascend NPUs
-    - mode: throughput
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-235B-A22B-Thinking-2507
-      backend: MindIE
-      backend_parameters:
-        - --max-seq-len=65536
-    # Other GPUs
-    - mode: throughput
-      quantization: FP8
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-235B-A22B-Thinking-2507-FP8
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=deepseek_r1
-        - --tool-call-parser=hermes
-        - --enable-auto-tool-choice
-        - --max-model-len=65536
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-235B-A22B-Thinking-2507
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=deepseek_r1
-        - --tool-call-parser=hermes
-        - --enable-auto-tool-choice
-        - --max-model-len=65536
-- name: Qwen3.5-0.8B
-  description: Qwen3.5-0.8B is a compact language model from the Qwen family, designed for efficient reasoning, coding, and multilingual understanding across diverse tasks.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 0.8
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - reasoning
-    - tools
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2026-03-02"
-  specs:
-    # Ascend NPUs
-    - mode: standard
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-0.8B
-      backend: SGLang
-      backend_version: 0.5.9
-      backend_parameters:
-        - --context-length=32768
-        - --disable-radix-cache
-        - --chunked-prefill-size=4096
-        - --max-prefill-tokens=4096
-        - --max-total-tokens=40960
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-0.8B
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-- name: Qwen3.5-2B
-  description: Qwen3.5-2B is a compact language model from the Qwen family, designed for efficient reasoning, coding, and multilingual understanding across diverse tasks.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 2
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - reasoning
-    - tools
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2026-03-02"
-  specs:
-    # Ascend NPUs
-    - mode: standard
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-2B
-      backend: SGLang
-      backend_version: 0.5.9
-      backend_parameters:
-        - --context-length=32768
-        - --disable-radix-cache
-        - --chunked-prefill-size=4096
-        - --max-prefill-tokens=4096
-        - --max-total-tokens=40960
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-2B
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-- name: Qwen3.5-4B
-  description: Qwen3.5-4B is a compact language model from the Qwen family, designed for efficient reasoning, coding, and multilingual understanding across diverse tasks.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 4
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - reasoning
-    - tools
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2026-03-02"
-  specs:
-    # Ascend NPUs
-    - mode: standard
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-4B
-      backend: SGLang
-      backend_version: 0.5.9
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --context-length=32768
-        - --disable-radix-cache
-        - --chunked-prefill-size=4096
-        - --max-prefill-tokens=4096
-        - --max-total-tokens=40960
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-4B
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-- name: Qwen3.5-9B
-  description: Qwen3.5-9B is a model from the Qwen family, designed for strong reasoning, coding, and multilingual understanding with competitive performance across a wide range of tasks.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 9
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - reasoning
-    - tools
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2026-03-02"
-  specs:
-    # Ascend NPUs
-    - mode: standard
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-9B
-      backend: SGLang
-      backend_version: 0.5.9
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --context-length=32768
-        - --disable-radix-cache
-        - --chunked-prefill-size=4096
-        - --max-prefill-tokens=4096
-        - --max-total-tokens=40960
-    - mode: throughput
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-9B
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-        - --performance-mode=throughput
-        - --enable-prefix-caching
-    - mode: latency
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-9B
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-        - --performance-mode=interactivity
-        - --language-model-only
-      speculative_config:
-        enabled: true
-        algorithm: mtp
-        num_draft_tokens: 1
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-9B
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-- name: Qwen3.5-27B
-  description: Qwen3.5-27B is a model designed for strong reasoning, coding, and multilingual understanding with competitive performance across a wide range of tasks.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 27
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - reasoning
-    - tools
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2026-02-24"
-  specs:
-    # Ascend NPUs
-    - mode: standard
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-27B
-      backend: SGLang
-      backend_version: 0.5.9
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --context-length=32768
-        - --disable-radix-cache
-        - --chunked-prefill-size=4096
-        - --max-prefill-tokens=4096
-        - --max-total-tokens=40960
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-27B
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-27B-FP8
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-        - --performance-mode=throughput
-        - --enable-prefix-caching
-- name: Qwen3.5-35B-A3B
-  description: Qwen3.5-35B-A3B is a 35-billion-parameter open-source large language model from the Qwen family, designed for strong reasoning, code generation, and multilingual understanding across diverse tasks.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 35
-  activated_size: 3
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - reasoning
-    - tools
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2026-02-24"
-  specs:
-    # Ascend NPUs
-    - mode: standard
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-35B-A3B
-      backend: SGLang
-      backend_version: 0.5.9
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --context-length=32768
-        - --disable-radix-cache
-        - --chunked-prefill-size=4096
-        - --max-prefill-tokens=4096
-        - --max-total-tokens=40960
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-35B-A3B
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-35B-A3B-FP8
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-        - --performance-mode=throughput
-        - --enable-prefix-caching
-    - mode: latency
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-35B-A3B-FP8
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-        - --performance-mode=interactivity
-      speculative_config:
-        enabled: true
-        algorithm: mtp
-        num_draft_tokens: 1
-- name: Qwen3.5-122B-A10B
-  description: Qwen3.5-122B-A10B is a 122-billion-parameter open-source large language model from the Qwen family, designed for strong reasoning, code generation, and multilingual understanding across diverse tasks.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 122
-  activated_size: 10
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - reasoning
-    - tools
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2026-02-24"
-  specs:
-    # Ascend NPUs
-    - mode: standard
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-122B-A10B
-      backend: SGLang
-      backend_version: 0.5.9
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --context-length=32768
-        - --disable-radix-cache
-        - --chunked-prefill-size=4096
-        - --max-prefill-tokens=4096
-        - --max-total-tokens=40960
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-122B-A10B
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-122B-A10B-FP8
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-        - --performance-mode=throughput
-        - --enable-prefix-caching
-- name: Qwen3.5-397B-A17B
-  description: Qwen3.5-397B-A17B is a flagship MoE-hybrid model that delivers state-of-the-art reasoning and multimodal performance with ultra-efficient inference capabilities.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 397
-  activated_size: 17
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - reasoning
-    - tools
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2026-02-16"
-  specs:
-    # Ascend NPUs
-    - mode: standard
-      quantization: BF16
-      gpu_filters:
-        vendor: ascend
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-397B-A17B
-      backend: SGLang
-      backend_version: 0.5.9
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --context-length=32768
-        - --disable-radix-cache
-        - --chunked-prefill-size=4096
-        - --max-prefill-tokens=4096
-        - --max-total-tokens=40960
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-397B-A17B
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3.5-397B-A17B-FP8
-      backend: vLLM
-      backend_version: 0.17.1
-      backend_parameters:
-        - --reasoning-parser=qwen3
-        - --max-model-len=32768
-        - --performance-mode=throughput
-        - --enable-prefix-caching
-- name: GLM-4.7
-  description: GLM-4.7 is a large language model developed by Zhipu AI, featuring advanced agentic, reasoning, and coding capabilities.
-  home: https://z.ai
-  icon: /static/catalog_icons/zai.png
-  size: 355
-  activated_size: 32
-  categories:
-    - llm
-  capabilities:
-    - context/1M
-    - reasoning
-    - tools
-  licenses:
-    - mit
-  release_date: "2025-12-22"
-  specs:
-    # TODO: tool-call-parser glm47 not yet available in the latest vLLM/SGLang release
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0" # Hopper or later
-      source: huggingface
-      huggingface_repo_id: zai-org/GLM-4.7-FP8
-      backend: SGLang
-      backend_parameters:
-        - --reasoning-parser=glm45
-        - --context-length=65536
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: "<9.0" # Before Hopper
-      source: huggingface
-      huggingface_repo_id: zai-org/GLM-4.7-FP8
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=glm45
-        - --max-model-len=65536
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: zai-org/GLM-4.7
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=glm45
-        - --max-model-len=65536
-- name: GLM-4.6
-  description: GLM-4.6 is a large language model developed by Zhipu AI, featuring advanced agentic, reasoning, and coding capabilities.
-  home: https://z.ai
-  icon: /static/catalog_icons/zai.png
-  size: 355
-  activated_size: 32
-  categories:
-    - llm
-  capabilities:
-    - context/1M
-    - reasoning
-    - tools
-  licenses:
-    - mit
-  release_date: "2025-09-30"
-  specs:
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0" # Hopper or later
-      source: huggingface
-      huggingface_repo_id: zai-org/GLM-4.6-FP8
-      backend: SGLang
-      backend_parameters:
-        - --tool-call-parser=glm
-        - --reasoning-parser=glm45
-        - --context-length=65536
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: "<9.0" # Before Hopper
-      source: huggingface
-      huggingface_repo_id: zai-org/GLM-4.6-FP8
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=glm45
-        - --tool-call-parser=glm45
-        - --enable-auto-tool-choice
-        - --max-model-len=65536
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: zai-org/GLM-4.6
-      backend: vLLM
-      backend_parameters:
-        - --reasoning-parser=glm45
-        - --tool-call-parser=glm45
-        - --enable-auto-tool-choice
-        - --max-model-len=65536
-- name: gpt-oss-120b
-  description: The gpt-oss series is OpenAI's family of open-weight models, designed for powerful reasoning, agentic tasks, and versatile developer use cases.
-  home: https://openai.com
-  icon: /static/catalog_icons/openai.png
-  categories:
-    - llm
-  capabilities:
-    - context/128K
-  size: 120
-  licenses:
-    - apache-2.0
-  release_date: "2025-08-05"
-  specs:
-    - mode: throughput
-      quantization: "MXFP4"
-      source: huggingface
-      huggingface_repo_id: openai/gpt-oss-120b
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=32768
-        - --tool-call-parser=openai
-        - --enable-auto-tool-choice
-        - --async-scheduling
-    - mode: standard
-      quantization: "MXFP4"
-      source: huggingface
-      huggingface_repo_id: openai/gpt-oss-120b
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=32768
-        - --tool-call-parser=openai
-        - --enable-auto-tool-choice
-- name: gpt-oss-20b
-  description: The gpt-oss series is OpenAI's family of open-weight models, designed for powerful reasoning, agentic tasks, and versatile developer use cases.
-  home: https://openai.com
-  icon: /static/catalog_icons/openai.png
-  categories:
-    - llm
-  capabilities:
-    - context/128K
-  size: 20
-  licenses:
-    - apache-2.0
-  release_date: "2025-08-05"
-  specs:
-    - mode: throughput
-      quantization: "MXFP4"
-      source: huggingface
-      huggingface_repo_id: openai/gpt-oss-20b
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=32768
-        - --tool-call-parser=openai
-        - --enable-auto-tool-choice
-        - --async-scheduling
-    - mode: standard
-      quantization: "MXFP4"
-      source: huggingface
-      huggingface_repo_id: openai/gpt-oss-20b
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=32768
-        - --tool-call-parser=openai
-        - --enable-auto-tool-choice
-- name: Deepseek-R1-0528
-  description: DeepSeek-R1-0528 is a minor version of the DeepSeek R1 model that features enhanced reasoning depth and inference capabilities. These improvements are achieved through increased computational resources and algorithmic optimizations applied during post-training. The model delivers strong performance across a range of benchmark evaluations, including mathematics, programming, and general logic, with overall capabilities approaching those of leading models such as O3 and Gemini 2.5 Pro.
-  home: https://www.deepseek.com
-  icon: /static/catalog_icons/deepseek.png
-  categories:
-    - llm
-  capabilities:
-    - context/128K
-  size: 671
-  licenses:
-    - mit
-  release_date: "2025-05-28"
-  specs:
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0" # Hopper or later
-      source: huggingface
-      huggingface_repo_id: deepseek-ai/DeepSeek-R1-0528
-      backend: SGLang
-      backend_parameters:
-        - --enable-dp-attention
-        - --context-length=32768
-    - mode: standard
-      quantization: FP8
-      source: huggingface
-      huggingface_repo_id: deepseek-ai/DeepSeek-R1-0528
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=32768
-- name: DeepSeek-OCR
-  description: DeepSeek-OCR is an advanced optical character recognition (OCR) model developed by DeepSeek AI. It is designed to accurately extract text from images and scanned documents.
-  home: https://www.deepseek.com
-  icon: /static/catalog_icons/deepseek.png
-  size: 3
-  categories:
-    - llm
-  licenses:
-    - mit
-  release_date: "2025-10-20"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      gpu_filters:
-        vendor: 
-          - nvidia
-          - amd
-      source: huggingface
-      huggingface_repo_id: deepseek-ai/DeepSeek-OCR
-      backend: vLLM
-      backend_version: 0.11.2
-      backend_parameters:
-        - --logits_processors=vllm.model_executor.models.deepseek_ocr:NGramPerReqLogitsProcessor
-        - --no-enable-prefix-caching
-        - --mm-processor-cache-gb=0
-- name: PaddleOCR-VL-1.5
-  description: PaddleOCR-VL-1.5 is an advanced optical character recognition (OCR) vision-language model developed by PaddlePaddle. It is designed to accurately extract and understand text from images and documents.
-  home: https://www.paddleocr.com
-  icon: /static/catalog_icons/paddlepaddle.jpeg
-  size: 0.9
-  categories:
-    - llm
-  capabilities:
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-29"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: PaddlePaddle/PaddleOCR-VL-1.5
-      backend: vLLM
-      backend_parameters:
-        - --trust-remote-code
-        - --max-num-batched-tokens=16384
-        - --no-enable-prefix-caching
-        - --mm-processor-cache-gb=0
-- name: LightOnOCR-2-1B
-  description: LightOnOCR-2-1B is an efficient end-to-end vision-language model for optical character recognition (OCR), converting documents (PDFs, scans, images) into clean, naturally ordered text. It achieves state-of-the-art performance on OlmOCR-Bench while being significantly faster and more cost-effective than competitors.
-  home: https://www.lighton.ai
-  icon: /static/catalog_icons/lighton.png
-  size: 1
-  categories:
-    - llm
-  capabilities:
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-19"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: lightonai/LightOnOCR-2-1B
-      backend: vLLM
-      backend_parameters:
-        - '--limit-mm-per-prompt={"image": 1}'
-        - --mm-processor-cache-gb=0
-        - --no-enable-prefix-caching
-- name: Deepseek-V3.2
-  description: 'DeepSeek-V3.2 is a model that balances computational efficiency with strong reasoning and agent capabilities through three technical innovations: DeepSeek Sparse Attention (DSA), Scalable Reinforcement Learning Framework, Large-Scale Agentic Task Synthesis Pipeline.'
-  home: https://www.deepseek.com
-  icon: /static/catalog_icons/deepseek.png
-  categories:
-    - llm
-  capabilities:
-    - context/128K
-  size: 685
-  licenses:
-    - mit
-  release_date: "2025-12-01"
-  specs:
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0" # Hopper or later
-      source: huggingface
-      huggingface_repo_id: deepseek-ai/DeepSeek-V3.2
-      backend: SGLang
-      backend_version: 0.5.6.post2
-      backend_parameters:
-        - --enable-dp-attention
-        - --context-length=65536
-        - --reasoning-parser=deepseek-v3
-        - --tool-call-parser=deepseek_v32
-        - --chat-template={data_dir}/chat_templates/tool_chat_template_deepseekv32.jinja
-    - mode: standard
-      quantization: FP8
-      source: huggingface
-      huggingface_repo_id: deepseek-ai/DeepSeek-V3.2
-      backend: vLLM
-      backend_version: 0.13.0
-      backend_parameters:
-        - --max-model-len=65536
-        - --tokenizer-mode=deepseek_v32
-        - --reasoning-parser=deepseek_v3
-        - --tool-call-parser=deepseek_v32
-        - --enable-auto-tool-choice
-- name: Deepseek-V3.2-Speciale
-  description: This model is the high-compute variant of DeepSeek-V3.2, surpasses GPT-5 and matches Gemini-3.0-Pro in reasoning, achieving gold-medal level performance in the 2025 IMO and IOI competitions. 
-  home: https://www.deepseek.com
-  icon: /static/catalog_icons/deepseek.png
-  categories:
-    - llm
-  capabilities:
-    - context/128K
-  size: 685
-  licenses:
-    - mit
-  release_date: "2025-12-01"
-  specs:
-    - mode: throughput
-      quantization: FP8
-      gpu_filters:
-        vendor: nvidia
-        compute_capability: ">=9.0" # Hopper or later
-      source: huggingface
-      huggingface_repo_id: deepseek-ai/DeepSeek-V3.2-Speciale
-      backend: SGLang
-      backend_version: 0.5.6.post2
-      backend_parameters:
-        - --enable-dp-attention
-        - --context-length=65536
-        - --reasoning-parser=deepseek-v3
-    - mode: standard
-      quantization: FP8
-      source: huggingface
-      huggingface_repo_id: deepseek-ai/DeepSeek-V3.2-Speciale
-      backend: vLLM
-      backend_version: 0.13.0
-      backend_parameters:
-        - --max-model-len=65536
-        - --tokenizer-mode=deepseek_v32
-        - --reasoning-parser=deepseek_v3
-- name: MiniMax-M2.1
-  description: MiniMax-M2.1 is a high-performance agentic model, optimized for robustness in coding, tool use, instruction following, and long-horizon planning. It excels in multilingual software development and complex multi-step workflows.
-  home: https://www.minimax.io
-  icon: /static/catalog_icons/minimax.png
-  size: 230
-  activated_size: 10
-  categories:
-    - llm
-  capabilities:
-    - context/192K
-    - tools
-  licenses:
-    - modified-mit
-  release_date: "2025-12-23"
-  specs:
-    - mode: standard
-      quantization: FP8
-      source: huggingface
-      huggingface_repo_id: MiniMaxAI/MiniMax-M2.1
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-        - --reasoning-parser=minimax_m2_append_think
-        - --tool-call-parser=minimax_m2
-        - --enable-auto-tool-choice
-        - --trust-remote-code
-- name: MiniMax-M2.5
-  description: MiniMax-M2.5 is a powerful MoE (Mixture-of-Experts) model that delivers exceptional performance in logical reasoning, coding, and complex agent tasks through highly efficient inference.
-  home: https://www.minimax.io/
-  icon: /static/catalog_icons/minimax.png
-  size: 230
-  activated_size: 10
-  categories:
-    - llm
-  capabilities:
-    - context/196K
-    - reasoning
-    - tools
-  licenses:
-    - modified-mit
-  release_date: "2026-02-12"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: MiniMaxAI/MiniMax-M2.5
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-        - --reasoning-parser=minimax_m2_append_think
-        - --tool-call-parser=minimax_m2
-        - --enable-auto-tool-choice
-        - --trust-remote-code
-        - --enable-expert-parallel
-- name: Kimi-K2.5
-  description: Kimi-K2.5 is a multimodal mixture-of-experts model with 1T total parameters and 32B activated parameters. It features native INT4 quantization, vision support, dual operating modes (thinking/instant), agent swarm capabilities, and excels at visual reasoning, coding with vision, and complex tool orchestration.
-  home: https://www.moonshot.ai
-  icon: /static/catalog_icons/kimi.png
-  size: 1
-  size_unit: T
-  activated_size: 32
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - vision
-    - tools
-  licenses:
-    - modified-mit
-  release_date: "2026-01-26"
-  specs:
-    - mode: standard
-      quantization: INT4
-      source: huggingface
-      huggingface_repo_id: moonshotai/Kimi-K2.5
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-        - --mm-encoder-tp-mode=data
-        - --tool-call-parser=kimi_k2
-        - --reasoning-parser=kimi_k2
-        - --trust-remote-code
-- name: Step-3.5-Flash
-  description: Step-3.5-Flash is a fast, cost-effective multimodal model with 196B total parameters and 11B active parameters (MoE), optimized for quick inference. Built on StepFun's Step3 architecture, it delivers strong performance across text and vision tasks with efficient token usage.
-  home: https://www.stepfun.com
-  icon: /static/catalog_icons/stepfun.png
-  size: 196
-  activated_size: 11
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - tools
-  licenses:
-    - apache-2.0
-  release_date: "2026-02-02"
-  specs:
-    - mode: throughput
-      quantization: FP8
-      source: huggingface
-      huggingface_repo_id: stepfun-ai/Step-3.5-Flash-FP8
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-        - --disable-cascade-attn
-        - --reasoning-parser=step3p5
-        - --enable-auto-tool-choice
-        - --tool-call-parser=step3p5
-        - --trust-remote-code
-        - --quantization=fp8
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: stepfun-ai/Step-3.5-Flash
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-        - --disable-cascade-attn
-        - --reasoning-parser=step3p5
-        - --enable-auto-tool-choice
-        - --tool-call-parser=step3p5
-        - --trust-remote-code
-- name: Nanbeige4.1-3B
-  description: Nanbeige4.1-3B is a 3B-parameter language model from Nanbeige LLM Lab, optimized for long-context reasoning, agentic tasks, and tool use.
-  home: https://huggingface.co/Nanbeige
-  icon: /static/catalog_icons/nanbeige.png
-  size: 3
-  categories:
-    - llm
-  capabilities:
-    - context/256K
-    - reasoning
-    - tools
-  licenses:
-    - apache-2.0
-  release_date: "2026-02-13"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Nanbeige/Nanbeige4.1-3B
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=32768
-# Embedding models
-- name: Qwen3-Embedding-0.6B
-  description: Qwen3-Embedding is a multilingual embedding model series optimized for retrieval, clustering, classification, and bitext mining. It supports 100+ languages, with flexible vector dimensions and instruction tuning.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 0.6
-  categories:
-    - embedding
-  capabilities:
-    - dimensions/4096
-    - max_tokens/32K
-  licenses:
-    - apache-2.0
-  release_date: "2025-06-09"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-Embedding-0.6B
-      categories:
-        - embedding
-      backend: vLLM
-- name: Qwen3-Embedding-4B
-  description: Qwen3-Embedding is a multilingual embedding model series optimized for retrieval, clustering, classification, and bitext mining. It supports 100+ languages, with flexible vector dimensions and instruction tuning.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 4
-  categories:
-    - embedding
-  capabilities:
-    - dimensions/4096
-    - max_tokens/32K
-  licenses:
-    - apache-2.0
-  release_date: "2025-06-09"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-Embedding-4B
-      categories:
-        - embedding
-      backend: vLLM
-- name: Qwen3-Embedding-8B
-  description: Qwen3-Embedding is a multilingual embedding model series optimized for retrieval, clustering, classification, and bitext mining. It supports 100+ languages, with flexible vector dimensions and instruction tuning.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 8
-  categories:
-    - embedding
-  capabilities:
-    - dimensions/4096
-    - max_tokens/32K
-  licenses:
-    - apache-2.0
-  release_date: "2025-06-09"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-Embedding-8B
-      categories:
-        - embedding
-      backend: vLLM
-- name: Qwen3-VL-Embedding-2B
-  description: Qwen3-VL-Embedding is a multimodal embedding model series optimized for multimodal retrieval, clustering, and classification. It supports image-text retrieval and unified multimodal representation learning with 30+ languages support.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 2
-  categories:
-    - embedding
-  capabilities:
-    - vision
-    - dimensions/2048
-    - max_tokens/32K
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-08"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-VL-Embedding-2B
-      categories:
-        - embedding
-      backend: vLLM
-      backend_parameters:
-        - --runner=pooling
-- name: Qwen3-VL-Embedding-8B
-  description: Qwen3-VL-Embedding is a multimodal embedding model series optimized for multimodal retrieval, clustering, and classification. It supports image-text retrieval and unified multimodal representation learning with 30+ languages support.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 8
-  categories:
-    - embedding
-  capabilities:
-    - vision
-    - dimensions/4096
-    - max_tokens/32K
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-08"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-VL-Embedding-8B
-      categories:
-        - embedding
-      backend: vLLM
-      backend_parameters:
-        - --runner=pooling
-- name: BGE-M3
-  description: BGE-M3 is a new model from BAAI distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
-  home: https://bge-model.com
-  icon: /static/catalog_icons/bge_logo.jpeg
-  categories:
-    - embedding
-  capabilities:
-    - dimensions/1024
-    - max_tokens/8192
-  size: 567
-  size_unit: M
-  licenses:
-    - mit
-  release_date: "2024-01-28"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: BAAI/bge-m3
-      categories:
-        - embedding
-      backend: vLLM
-- name: BGE-Large-ZH-V1.5
-  description: BGE is short for BAAI general embedding. This is a Chinese text embedding model with more reasonable similarity distribution.
-  home: https://bge-model.com
-  icon: /static/catalog_icons/bge_logo.jpeg
-  categories:
-    - embedding
-  capabilities:
-    - dimensions/1024
-    - max_tokens/512
-  size: 335
-  size_unit: M
-  licenses:
-    - mit
-  release_date: "2023-09-12"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: BAAI/bge-large-zh-v1.5
-      categories:
-        - embedding
-      backend: vLLM
-- name: BGE-Large-EN-V1.5
-  description: BGE is short for BAAI general embedding. This is an English text embedding model with more reasonable similarity distribution.
-  home: https://bge-model.com
-  icon: /static/catalog_icons/bge_logo.jpeg
-  categories:
-    - embedding
-  capabilities:
-    - dimensions/1024
-    - max_tokens/512
-  size: 335
-  size_unit: M
-  licenses:
-    - mit
-  release_date: "2023-09-12"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: BAAI/bge-large-en-v1.5
-      categories:
-        - embedding
-      backend: vLLM
-- name: Nomic-Embed-Text-V1.5
-  description: Nomic-embed-text is a large context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks.
-  home: https://nomic.ai
-  icon: /static/catalog_icons/nomic.png
-  categories:
-    - embedding
-  capabilities:
-    - dimensions/768
-    - max_tokens/8192
-  size: 137
-  size_unit: M
-  licenses:
-    - apache-2.0
-  release_date: "2024-02-14"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: nomic-ai/nomic-embed-text-v1.5
-      categories:
-        - embedding
-      backend: vLLM
-      backend_parameters:
-        - --trust-remote-code
-- name: Jina-Embeddings-V3
-  description: jina-embeddings-v3 is a multilingual multi-task text embedding model designed for a variety of NLP applications. Based on the Jina-XLM-RoBERTa architecture, this model supports Rotary Position Embeddings to handle long input sequences up to 8192 tokens.
-  home: https://jina.ai
-  icon: /static/catalog_icons/jina.png
-  categories:
-    - embedding
-  capabilities:
-    - dimensions/1024
-    - max_tokens/8192
-  size: 570
-  size_unit: M
-  licenses:
-    - cc-by-nc-4.0
-  release_date: "2024-09-18"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: jinaai/jina-embeddings-v3
-      categories:
-        - embedding
-      backend: vLLM
-      backend_parameters:
-        - --trust-remote-code
-# Reranker models
-- name: Qwen3-Reranker-0.6B
-  description: Qwen3-Reranker is a multilingual text reranking model series optimized for retrieval, clustering, classification, and bitext mining. It supports 100+ languages, with flexible vector dimensions and instruction tuning.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 0.6
-  categories:
-    - reranker
-  capabilities:
-    - max_tokens/32K
-  licenses:
-    - apache-2.0
-  release_date: "2025-06-09"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-Reranker-0.6B
-      categories:
-        - reranker
-      backend: vLLM
-      backend_parameters:
-        - '--hf_overrides={"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}'
-- name: Qwen3-Reranker-4B
-  description: Qwen3-Reranker is a multilingual text reranking model series optimized for retrieval, clustering, classification, and bitext mining. It supports 100+ languages, with flexible vector dimensions and instruction tuning.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 4
-  categories:
-    - reranker
-  capabilities:
-    - max_tokens/32K
-  licenses:
-    - apache-2.0
-  release_date: "2025-06-09"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-Reranker-4B
-      categories:
-        - reranker
-      env:
-        GPUSTACK_APPLY_QWEN3_RERANKER_TEMPLATES: "true"
-      backend: vLLM
-      backend_parameters:
-        - '--hf_overrides={"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}'
-- name: Qwen3-Reranker-8B
-  description: Qwen3-Reranker is a multilingual text reranking model series optimized for retrieval, clustering, classification, and bitext mining. It supports 100+ languages, with flexible vector dimensions and instruction tuning.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 8
-  categories:
-    - reranker
-  capabilities:
-    - max_tokens/32K
-  licenses:
-    - apache-2.0
-  release_date: "2025-06-09"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-Reranker-8B
-      categories:
-        - reranker
-      backend: vLLM
-      backend_parameters:
-        - '--hf_overrides={"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}'
-- name: Qwen3-VL-Reranker-2B
-  description: Qwen3-VL-Reranker is a multimodal text reranking model series optimized for multimodal retrieval, clustering, classification, and bitext mining. It consistently outperforms the base embedding model and baseline rerankers.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 2
-  categories:
-    - reranker
-  capabilities:
-    - vision
-    - max_tokens/32K
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-08"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-VL-Reranker-2B
-      categories:
-        - reranker
-      backend: vLLM
-      backend_parameters:
-        - '--hf_overrides={"architectures":["Qwen3VLForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}'
-- name: Qwen3-VL-Reranker-8B
-  description: Qwen3-VL-Reranker is a multimodal text reranking model series optimized for multimodal retrieval, clustering, classification, and bitext mining. It consistently outperforms the base embedding model and baseline rerankers, with the 8B model showing particularly strong results.
-  home: https://qwenlm.github.io
-  icon: /static/catalog_icons/qwen.png
-  size: 8
-  categories:
-    - reranker
-  capabilities:
-    - vision
-    - max_tokens/32K
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-08"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-VL-Reranker-8B
-      categories:
-        - reranker
-      backend: vLLM
-      backend_parameters:
-        - '--hf_overrides={"architectures":["Qwen3VLForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}'
-- name: BGE-Reranker-V2-M3
-  description: BGE-Reranker-V2-M3 is a reranker model from BAAI.
-  home: https://bge-model.com
-  icon: /static/catalog_icons/bge_logo.jpeg
-  categories:
-    - reranker
-  size: 568
-  size_unit: M
-  licenses:
-    - apache-2.0
-  release_date: "2024-03-19"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: BAAI/bge-reranker-v2-m3
-      categories:
-        - reranker
-      backend: vLLM
-- name: Jina-Reranker-M0
-  description: Jina-Reranker-M0 is a multilingual multimodal document reranker model with 2.4B parameters. It accepts a query alongside visually rich documents and outputs ranked documents by relevance. Supports 29 languages and multimodal content including text, figures, tables, and infographics.
-  home: https://jina.ai
-  icon: /static/catalog_icons/jina.png
-  size: 2.4
-  categories:
-    - reranker
-  capabilities:
-    - max_tokens/10K
-    - vision
-  licenses:
-    - cc-by-nc-4.0
-  release_date: "2025-04-08"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      source: huggingface
-      huggingface_repo_id: jinaai/jina-reranker-m0
-      backend: vLLM
-# Image models
-- name: FLUX.1-dev
-  description: FLUX.1 [dev] is a 12 billion parameter rectified flow transformer capable of generating images from text descriptions.
-  home: https://blackforestlabs.ai
-  icon: /static/catalog_icons/blackforestlabs.png
-  size: 12
-  categories:
-    - image
-  licenses:
-    - flux-1-dev-non-commercial-license
-  release_date: "2024-08-02"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      gpu_filters:
-        vendor: nvidia
-      source: huggingface
-      huggingface_repo_id: black-forest-labs/FLUX.1-dev
-      backend: SGLang
-      backend_version: 0.5.6.post2
-      env:
-        GPUSTACK_MODEL_VRAM_CLAIM: "37580963840" # 35 GiB, observed empirically
-- name: FLUX.2-klein-4B
-  description: FLUX.2-klein-4B is a 4 billion parameter image generation model from Black Forest Labs.
-  home: https://blackforestlabs.ai
-  icon: /static/catalog_icons/blackforestlabs.png
-  size: 4
-  categories:
-    - image
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-15"
-  .base_spec: &flux_2_klein_4b_base_spec
-    mode: standard
-    quantization: "BF16"
-    source: huggingface
-    huggingface_repo_id: black-forest-labs/FLUX.2-klein-4B
-    backend: vLLM
-    backend_parameters:
-      - --omni
-  specs:
-    - <<: *flux_2_klein_4b_base_spec
-      gpu_filters:
-        vendor: ascend
-      backend_version: *vllm_omni_ascend_stable_version
-    - <<: *flux_2_klein_4b_base_spec
-      backend_version: *vllm_omni_stable_version
-- name: FLUX.2-klein-9B
-  description: FLUX.2-klein-9B is a 9 billion parameter image generation model from Black Forest Labs.
-  home: https://blackforestlabs.ai
-  icon: /static/catalog_icons/blackforestlabs.png
-  size: 9
-  categories:
-    - image
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-15"
-  .base_spec: &flux_2_klein_9b_base_spec
-    mode: standard
-    quantization: "BF16"
-    source: huggingface
-    huggingface_repo_id: black-forest-labs/FLUX.2-klein-9B
-    backend: vLLM
-    backend_parameters:
-      - --omni
-  specs:
-    - <<: *flux_2_klein_9b_base_spec
-      gpu_filters:
-        vendor: ascend
-      backend_version: *vllm_omni_ascend_stable_version
-    - <<: *flux_2_klein_9b_base_spec
-      backend_version: *vllm_omni_stable_version
-- name: Qwen-Image
-  description: Qwen-Image is an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 20
-  categories:
-    - image
-  licenses:
-    - apache-2.0
-  release_date: "2025-08-04"
-  .base_spec: &qwen_image_base_spec
-    mode: standard
-    quantization: "BF16"
-    source: huggingface
-    huggingface_repo_id: Qwen/Qwen-Image
-    backend: vLLM
-    backend_parameters:
-      - --omni
-  specs:
-    - <<: *qwen_image_base_spec
-      gpu_filters:
-        vendor: ascend
-      backend_version: *vllm_omni_ascend_stable_version
-    - <<: *qwen_image_base_spec
-      backend_version: *vllm_omni_stable_version
-- name: Qwen-Image-Edit
-  description: Built upon the 20B Qwen-Image model, Qwen-Image-Edit successfully extends Qwen-Image's unique text rendering capabilities to image editing tasks, enabling precise text editing.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 20
-  categories:
-    - image
-  licenses:
-    - apache-2.0
-  release_date: "2025-08-19"
-  specs:
-    - mode: standard
-      quantization: "BF16"
-      gpu_filters:
-        vendor: nvidia
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen-Image-Edit
-      backend: SGLang
-      backend_version: 0.5.6.post2
-- name: Qwen-Image-2512
-  description: Qwen-Image-2512 is the December update of Qwen-Image's text-to-image foundational model, delivering enhanced image generation capabilities.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 20
-  categories:
-    - image
-  licenses:
-    - apache-2.0
-  release_date: "2025-12-30"
-  .base_spec: &qwen_image_2512_base_spec
-    mode: standard
-    quantization: "BF16"
-    source: huggingface
-    huggingface_repo_id: Qwen/Qwen-Image-2512
-    backend: vLLM
-    backend_parameters:
-      - --omni
-  specs:
-    - <<: *qwen_image_2512_base_spec
-      gpu_filters:
-        vendor: ascend
-      backend_version: *vllm_omni_ascend_stable_version
-    - <<: *qwen_image_2512_base_spec
-      backend_version: *vllm_omni_stable_version
-- name: Z-Image
-  description: Z-Image is the foundation model of the Z-Image family, engineered for good quality, robust generative diversity, broad stylistic coverage, and precise prompt adherence.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 6
-  categories:
-    - image
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-28"
-  .base_spec: &z_image_base_spec
-    mode: standard
-    quantization: "BF16"
-    source: huggingface
-    huggingface_repo_id: Tongyi-MAI/Z-Image
-    backend: vLLM
-    backend_parameters:
-      - --omni
-  specs:
-    - <<: *z_image_base_spec
-      gpu_filters:
-        vendor: ascend
-      backend_version: *vllm_omni_ascend_stable_version
-    - <<: *z_image_base_spec
-      backend_version: *vllm_omni_stable_version
-- name: Z-Image-Turbo
-  description: Z-Image is a powerful and highly efficient image generation model with 6B parameters.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 6
-  categories:
-    - image
-  licenses:
-    - apache-2.0
-  release_date: "2025-11-27"
-  .base_spec: &z_image_turbo_base_spec
-    mode: standard
-    quantization: "BF16"
-    source: huggingface
-    huggingface_repo_id: Tongyi-MAI/Z-Image-Turbo
-    backend: vLLM
-    backend_parameters:
-      - --omni
-    env:
-      GPUSTACK_MODEL_VRAM_CLAIM: "24696061952" # 23 GiB observed. Weight file size is 33 GiB in F32 while vLLM loads in BF16.
-  specs:
-    - <<: *z_image_turbo_base_spec
-      gpu_filters:
-        vendor: ascend
-      backend_version: *vllm_omni_ascend_stable_version
-    - <<: *z_image_turbo_base_spec
-      backend_version: *vllm_omni_stable_version
-- name: Qwen3-VL-8B-Instruct
-  description: Qwen3-VL-8B-Instruct is a vision-language model that delivers comprehensive upgrades across text understanding, visual perception, and reasoning capabilities, supporting image/video/text unified understanding.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 8
-  categories:
-    - llm
-  capabilities:
-    - context/1M
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2025-10-15"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-VL-8B-Instruct
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-- name: Qwen3-VL-8B-Thinking
-  description: Qwen3-VL-8B-Thinking is a vision-language model that delivers comprehensive upgrades across text understanding, visual perception, and reasoning capabilities, supporting image/video/text unified understanding with thinking mode.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 8
-  categories:
-    - llm
-  capabilities:
-    - context/1M
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2025-10-15"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-VL-8B-Thinking
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-- name: Qwen3-VL-32B-Instruct
-  description: Qwen3-VL-32B-Instruct is a vision-language model featuring superior visual intelligence, enhanced spatial awareness capabilities, and OCR functionality.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 32
-  categories:
-    - llm
-  capabilities:
-    - context/1M
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2025-10-21"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-VL-32B-Instruct
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-- name: Qwen3-VL-32B-Thinking
-  description: Qwen3-VL-32B-Thinking is a vision-language model featuring superior visual intelligence, enhanced spatial awareness capabilities, and OCR functionality with thinking mode.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 32
-  categories:
-    - llm
-  capabilities:
-    - context/1M
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2025-10-21"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-VL-32B-Thinking
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-- name: Qwen3-VL-30B-A3B-Instruct
-  description: Qwen3-VL-30B-A3B-Instruct is a mixture-of-experts vision-language model with 30B total parameters and 3B active parameters, featuring advanced spatial perception, 2D and 3D grounding.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 30
-  activated_size: 3
-  categories:
-    - llm
-  capabilities:
-    - context/1M
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2025-10-05"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-VL-30B-A3B-Instruct
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-- name: Qwen3-VL-30B-A3B-Thinking
-  description: Qwen3-VL-30B-A3B-Thinking is a mixture-of-experts vision-language model with 30B total parameters and 3B active parameters, featuring advanced spatial perception, 2D and 3D grounding with thinking mode.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 30
-  activated_size: 3
-  categories:
-    - llm
-  capabilities:
-    - context/1M
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2025-10-05"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-VL-30B-A3B-Thinking
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-- name: Qwen3-VL-235B-A22B-Instruct
-  description: Qwen3-VL-235B-A22B-Instruct is the largest vision-language model in the Qwen3-VL series with 235B total parameters and 22B active parameters, featuring state-of-the-art visual understanding and reasoning capabilities.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 235
-  activated_size: 22
-  categories:
-    - llm
-  capabilities:
-    - context/1M
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2025-09-23"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-VL-235B-A22B-Instruct
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-- name: Qwen3-VL-235B-A22B-Thinking
-  description: Qwen3-VL-235B-A22B-Thinking is the largest vision-language model in the Qwen3-VL series with 235B total parameters and 22B active parameters, featuring state-of-the-art visual understanding and reasoning capabilities with thinking mode.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 235
-  activated_size: 22
-  categories:
-    - llm
-  capabilities:
-    - context/1M
-    - vision
-  licenses:
-    - apache-2.0
-  release_date: "2025-09-23"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-VL-235B-A22B-Thinking
-      backend: vLLM
-      backend_parameters:
-        - --max-model-len=65536
-# Audio models
-- name: CosyVoice2-0.5B
-  description: CosyVoice2-0.5B is a speech generation model. It supports multilingual speech synthesis with high naturalness and expressiveness.
-  home: https://github.com/FunAudioLLM
-  icon: /static/catalog_icons/FunAudioLLM.png
-  size: 0.5
-  categories:
-    - text_to_speech
-  licenses:
-    - apache-2.0
-  release_date: "2024-12-01"
-  specs:
-    - mode: standard
-      quantization: FP16
-      source: huggingface
-      huggingface_repo_id: gpustack/CosyVoice2-0.5B
-      backend: VoxBox
-      env:
-        GPUSTACK_MODEL_VRAM_CLAIM: "3221225472" # 3 GiB, CosyVoice family empirical estimate.
-- name: CosyVoice-300M
-  description: CosyVoice is a multi-lingual large voice generation model developed by Alibaba.
-  home: https://github.com/FunAudioLLM
-  icon: /static/catalog_icons/FunAudioLLM.png
-  size: 300
-  size_unit: M
-  categories:
-    - text_to_speech
-  licenses:
-    - apache-2.0
-  release_date: "2024-07-05"
-  specs:
-    - mode: standard
-      quantization: FP16
-      source: huggingface
-      huggingface_repo_id: gpustack/CosyVoice-300M
-      backend: VoxBox
-      env:
-        GPUSTACK_MODEL_VRAM_CLAIM: "3221225472" # 3 GiB, CosyVoice family empirical estimate.
-- name: CosyVoice-300M-SFT
-  description: CosyVoice is a multi-lingual large voice generation model developed by Alibaba.
-  home: https://github.com/FunAudioLLM
-  icon: /static/catalog_icons/FunAudioLLM.png
-  size: 300
-  size_unit: M
-  categories:
-    - text_to_speech
-  licenses:
-    - apache-2.0
-  release_date: "2024-07-05"
-  specs:
-    - mode: standard
-      quantization: FP16
-      source: huggingface
-      huggingface_repo_id: gpustack/CosyVoice-300M-SFT
-      backend: VoxBox
-      env:
-        GPUSTACK_MODEL_VRAM_CLAIM: "3221225472" # 3 GiB, CosyVoice family empirical estimate.
-- name: CosyVoice-300M-Instruct
-  description: CosyVoice is a multi-lingual large voice generation model developed by Alibaba.
-  home: https://github.com/FunAudioLLM
-  icon: /static/catalog_icons/FunAudioLLM.png
-  size: 300
-  size_unit: M
-  categories:
-    - text_to_speech
-  licenses:
-    - apache-2.0
-  release_date: "2024-07-05"
-  specs:
-    - mode: standard
-      quantization: FP16
-      source: huggingface
-      huggingface_repo_id: gpustack/CosyVoice-300M-Instruct
-      backend: VoxBox
-      env:
-        GPUSTACK_MODEL_VRAM_CLAIM: "3221225472" # 3 GiB, CosyVoice family empirical estimate.
-- name: Faster-Whisper-Large-V3
-  description: Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting. This is the conversion of openai/whisper-large-v3 to the CTranslate2 model format.
-  home: https://huggingface.co/Systran
-  icon: /static/catalog_icons/systran.png
-  size: 1.55
-  categories:
-    - speech_to_text
-  licenses:
-    - mit
-  release_date: "2023-11-23"
-  specs:
-    - mode: standard
-      quantization: FP16
-      source: huggingface
-      huggingface_repo_id: Systran/faster-whisper-large-v3
-      backend: VoxBox
-      env:
-        GPUSTACK_MODEL_VRAM_CLAIM: "10737418240" # 10 GiB, per OpenAI Whisper large reference VRAM.
-- name: Faster-Whisper-Medium
-  description: Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. This is the conversion of openai/whisper-medium to the CTranslate2 model format.
-  home: https://huggingface.co/Systran
-  icon: /static/catalog_icons/systran.png
-  size: 769
-  size_unit: M
-  categories:
-    - speech_to_text
-  licenses:
-    - mit
-  release_date: "2023-03-23"
-  specs:
-    - mode: standard
-      quantization: FP16
-      source: huggingface
-      huggingface_repo_id: Systran/faster-whisper-medium
-      backend: VoxBox
-      env:
-        GPUSTACK_MODEL_VRAM_CLAIM: "5368709120" # 5 GiB, per OpenAI Whisper medium reference VRAM.
-- name: Faster-Whisper-Small
-  description: Whisper is a pre-trained model for automatic speech recognition (ASR) and speech translation. Trained on 680k hours of labelled data, Whisper models demonstrate a strong ability to generalise to many datasets and domains without the need for fine-tuning. This is the conversion of openai/whisper-small to the CTranslate2 model format.
-  home: https://huggingface.co/Systran
-  icon: /static/catalog_icons/systran.png
-  size: 244
-  size_unit: M
-  categories:
-    - speech_to_text
-  licenses:
-    - mit
-  release_date: "2023-03-23"
-  specs:
-    - mode: standard
-      quantization: FP16
-      source: huggingface
-      huggingface_repo_id: Systran/faster-whisper-small
-      backend: VoxBox
-      env:
-        GPUSTACK_MODEL_VRAM_CLAIM: "2147483648" # 2 GiB, per OpenAI Whisper small reference VRAM.
-- name: Whisper-Large-V3-Turbo
-  description: Whisper large-v3-turbo is a finetuned version of a pruned Whisper large-v3. In other words, it's the exact same model, except that the number of decoding layers have reduced from 32 to 4. As a result, the model is way faster, at the expense of a minor quality degradation.
-  home: https://openai.com
-  icon: /static/catalog_icons/openai.png
-  size: 809
-  size_unit: M
-  categories:
-    - speech_to_text
-  licenses:
-    - mit
-  release_date: "2024-10-01"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: openai/whisper-large-v3-turbo
-      backend: vLLM
-- name: Whisper-Large-V3
-  description: Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation. Trained on 5M hours of labeled data, Whisper large-v3 demonstrates strong ability to generalise to many datasets and domains in a zero-shot setting.
-  home: https://openai.com
-  icon: /static/catalog_icons/openai.png
-  size: 1.55
-  categories:
-    - speech_to_text
-  licenses:
-    - mit
-  release_date: "2023-11-06"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: openai/whisper-large-v3
-      backend: vLLM
-      env:
-        GPUSTACK_MODEL_VRAM_CLAIM: "4294967296" # 4 GiB. The repo stores weight files in multiple formats so explicitly set VRAM claim to avoid over-allocation.
-- name: Voxtral-Mini-3B-2507
-  description: Voxtral-Mini-3B-2507 is a speech-to-text model from Mistral AI, designed for automatic speech recognition with high accuracy and efficiency.
-  home: https://mistral.ai
-  icon: /static/catalog_icons/mistral.png
-  size: 3
-  categories:
-    - speech_to_text
-  licenses:
-    - apache-2.0
-  release_date: "2025-07-18"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: mistralai/Voxtral-Mini-3B-2507
-      backend: vLLM
-- name: Granite-Speech-3.3-2B
-  description: Granite-Speech-3.3-2B is a speech-to-text model from IBM, part of the Granite series, designed for automatic speech recognition with strong multilingual capabilities.
-  home: https://www.ibm.com
-  icon: /static/catalog_icons/ibm.png
-  size: 2
-  categories:
-    - speech_to_text
-  licenses:
-    - apache-2.0
-  release_date: "2025-06-19"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: ibm-granite/granite-speech-3.3-2b
-      backend: vLLM
-- name: Granite-Speech-3.3-8B
-  description: Granite-Speech-3.3-8B is a speech-to-text model from IBM, part of the Granite series, designed for automatic speech recognition with enhanced accuracy and multilingual support.
-  home: https://www.ibm.com
-  icon: /static/catalog_icons/ibm.png
-  size: 8
-  categories:
-    - speech_to_text
-  licenses:
-    - apache-2.0
-  release_date: "2025-06-19"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: ibm-granite/granite-speech-3.3-8b
-      backend: vLLM
-- name: Qwen3-ASR-1.7B
-  description: Qwen3-ASR-1.7B support language identification and ASR for 52 languages and dialects. It leverages large-scale speech training data and the strong audio understanding capability of its foundation model, Qwen3-Omni.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 1.7
-  categories:
-    - speech_to_text
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-29"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-ASR-1.7B
-      backend: vLLM
-      categories:
-        - speech_to_text
-- name: Qwen3-ASR-0.6B
-  description: Qwen3-ASR-0.6B support language identification and ASR for 52 languages and dialects. It leverages large-scale speech training data and the strong audio understanding capability of its foundation model, Qwen3-Omni.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 0.6
-  categories:
-    - speech_to_text
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-29"
-  specs:
-    - mode: standard
-      quantization: BF16
-      source: huggingface
-      huggingface_repo_id: Qwen/Qwen3-ASR-0.6B
-      backend: vLLM
-      categories:
-        - speech_to_text
-- name: Dia-1.6B
-  description: Dia is a text-to-speech model created by Nari Labs. Dia directly generates highly realistic dialogue from a transcript. You can condition the output on audio, enabling emotion and tone control. The model can also produce nonverbal communications like laughter, coughing, clearing throat, etc.
-  home: https://narilabs.org
-  icon: /static/catalog_icons/narilabs.png
-  size: 1.6
-  categories:
-    - text_to_speech
-  licenses:
-    - apache-2.0
-  release_date: "2025-04-21"
-  specs:
-    - mode: standard
-      quantization: FP32
-      source: huggingface
-      huggingface_repo_id: nari-labs/Dia-1.6B
-      backend: VoxBox
-      env:
-        GPUSTACK_MODEL_VRAM_CLAIM: "10737418240" # 10 GiB, Dia model empirical estimate.
-- name: Qwen3-TTS-12Hz-1.7B-Base
-  description: Qwen3-TTS-12Hz-1.7B-Base is a text-to-speech model from the Qwen3-TTS series with 1.7B parameters, supporting 12kHz audio generation.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 1.7
-  categories:
-    - text_to_speech
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-22"
-  .base_spec: &qwen3_tts_12hz_1_7b_base_base_spec
-    mode: standard
-    quantization: "BF16"
-    source: huggingface
-    huggingface_repo_id: Qwen/Qwen3-TTS-12Hz-1.7B-Base
-    backend: vLLM
-    backend_parameters:
-      - --omni
-  specs:
-    - <<: *qwen3_tts_12hz_1_7b_base_base_spec
-      gpu_filters:
-        vendor: ascend
-      backend_version: *vllm_omni_ascend_stable_version
-    - <<: *qwen3_tts_12hz_1_7b_base_base_spec
-      backend_version: *vllm_omni_stable_version
-- name: Qwen3-TTS-12Hz-1.7B-CustomVoice
-  description: Qwen3-TTS-12Hz-1.7B-CustomVoice is a text-to-speech model from the Qwen3-TTS series with 1.7B parameters, supporting custom voice cloning and 12kHz audio generation.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 1.7
-  categories:
-    - text_to_speech
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-22"
-  .base_spec: &qwen3_tts_12hz_1_7b_customvoice_base_spec
-    mode: standard
-    quantization: "BF16"
-    source: huggingface
-    huggingface_repo_id: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
-    backend: vLLM
-    backend_parameters:
-      - --omni
-  specs:
-    - <<: *qwen3_tts_12hz_1_7b_customvoice_base_spec
-      gpu_filters:
-        vendor: ascend
-      backend_version: *vllm_omni_ascend_stable_version
-    - <<: *qwen3_tts_12hz_1_7b_customvoice_base_spec
-      backend_version: *vllm_omni_stable_version
-- name: Qwen3-TTS-12Hz-1.7B-VoiceDesign
-  description: Qwen3-TTS-12Hz-1.7B-VoiceDesign is a text-to-speech model from the Qwen3-TTS series with 1.7B parameters, supporting voice design and 12kHz audio generation.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 1.7
-  categories:
-    - text_to_speech
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-22"
-  .base_spec: &qwen3_tts_12hz_1_7b_voicedesign_base_spec
-    mode: standard
-    quantization: "BF16"
-    source: huggingface
-    huggingface_repo_id: Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign
-    backend: vLLM
-    backend_parameters:
-      - --omni
-  specs:
-    - <<: *qwen3_tts_12hz_1_7b_voicedesign_base_spec
-      gpu_filters:
-        vendor: ascend
-      backend_version: *vllm_omni_ascend_stable_version
-    - <<: *qwen3_tts_12hz_1_7b_voicedesign_base_spec
-      backend_version: *vllm_omni_stable_version
-- name: Qwen3-TTS-12Hz-0.6B-Base
-  description: Qwen3-TTS-12Hz-0.6B-Base is a text-to-speech model from the Qwen3-TTS series with 0.6B parameters, supporting 12kHz audio generation.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 0.6
-  categories:
-    - text_to_speech
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-22"
-  .base_spec: &qwen3_tts_12hz_0_6b_base_base_spec
-    mode: standard
-    quantization: "BF16"
-    source: huggingface
-    huggingface_repo_id: Qwen/Qwen3-TTS-12Hz-0.6B-Base
-    backend: vLLM
-    backend_parameters:
-      - --omni
-  specs:
-    - <<: *qwen3_tts_12hz_0_6b_base_base_spec
-      gpu_filters:
-        vendor: ascend
-      backend_version: *vllm_omni_ascend_stable_version
-    - <<: *qwen3_tts_12hz_0_6b_base_base_spec
-      backend_version: *vllm_omni_stable_version
-- name: Qwen3-TTS-12Hz-0.6B-CustomVoice
-  description: Qwen3-TTS-12Hz-0.6B-CustomVoice is a text-to-speech model from the Qwen3-TTS series with 0.6B parameters, supporting custom voice cloning and 12kHz audio generation.
-  home: https://qwen.ai
-  icon: /static/catalog_icons/qwen.png
-  size: 0.6
-  categories:
-    - text_to_speech
-  licenses:
-    - apache-2.0
-  release_date: "2026-01-22"
-  .base_spec: &qwen3_tts_12hz_0_6b_customvoice_base_spec
-    mode: standard
-    quantization: "BF16"
-    source: huggingface
-    huggingface_repo_id: Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
-    backend: vLLM
-    backend_parameters:
-      - --omni
-  specs:
-    - <<: *qwen3_tts_12hz_0_6b_customvoice_base_spec
-      gpu_filters:
-        vendor: ascend
-      backend_version: *vllm_omni_ascend_stable_version
-    - <<: *qwen3_tts_12hz_0_6b_customvoice_base_spec
-      backend_version: *vllm_omni_stable_version
-- name: SenseVoice-Small
-  description: SenseVoice is a speech foundation model with multiple speech understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event detection (AED).
-  home: https://github.com/FunAudioLLM
-  icon: /static/catalog_icons/FunAudioLLM.png
-  categories:
-    - speech_to_text
-  licenses:
-    - apache-2.0
-  release_date: "2024-07-31"
-  specs:
-    - mode: standard
-      quantization: FP16
-      source: huggingface
-      huggingface_repo_id: FunAudioLLM/SenseVoiceSmall
-      backend: VoxBox
-      env:
-        GPUSTACK_MODEL_VRAM_CLAIM: "12884901888" # 12 GiB, it depends on the audio length. This value works for ~10 minutes audio input.

+ 1 - 1
gpustack/policies/candidate_selectors/ascend_mindie_resource_fit_selector.py

@@ -275,7 +275,7 @@ class AscendMindIEResourceFitSelector(ScheduleCandidatesSelector):
             elif attention_type == ModelAttentionTypeEnum.MLA:
                 d_head = (
                     self._model_params.qk_nope_head_dim
-                    + self._model_params.qk_nope_head_dim
+                    + self._model_params.qk_rope_head_dim
                 )
 
             vram_kv_cache = n_heads * d_head * n_layers * n_tokens * ct_size

+ 1 - 1
gpustack/policies/candidate_selectors/base_candidate_selector.py

@@ -513,7 +513,7 @@ class ScheduleCandidatesSelector(ABC):
             w = worker_set.get(wn)
             wa = self.get_worker_allocatable_resource(w, gpu_type)
             used_vram_claim_total += (
-                sum(v for k, v in claim.vram.items() if claim.vram is not None) or 0
+                sum(v for k, v in claim.vram.items()) if claim.vram else 0
             )
             used_ram_claim_total += claim.ram or 0
 

+ 1 - 1
gpustack/policies/candidate_selectors/gguf_resource_fit_selector.py

@@ -2501,7 +2501,7 @@ def _get_max_offload_layers_candidates(
     candidates: List[ModelInstanceScheduleCandidate],
 ) -> List[ModelInstanceScheduleCandidate]:
     if not candidates:
-        return 0
+        return []
 
     max_offload_layers = _get_max_offload_layers(candidates)
     return _filter_candidates_by_max_offload_layers(candidates, max_offload_layers)

+ 14 - 1
gpustack/policies/candidate_selectors/sglang_resource_fit_selector.py

@@ -457,6 +457,10 @@ class SGLangResourceFitSelector(ScheduleCandidatesSelector):
         for _, gpu in enumerate(worker.status.gpu_devices):
             gpu_index = gpu.index
             allocatable_vram = allocatable.vram.get(gpu_index, 0)
+
+            if gpu.memory is None or gpu.memory.total == 0:
+                continue
+
             allocatable_gpu_utilization = allocatable_vram / gpu.memory.total
 
             if allocatable_vram > largest_single_gpu_vram:
@@ -502,6 +506,7 @@ class SGLangResourceFitSelector(ScheduleCandidatesSelector):
                         vram_utilization=self._mem_fraction_static_by_gpu_type.get(
                             gpu_type
                         ),
+                        estimated_vram=self._vram_claim,
                     ),
                 )
             )
@@ -637,6 +642,7 @@ class SGLangResourceFitSelector(ScheduleCandidatesSelector):
                             vram_utilization=self._mem_fraction_static_by_gpu_type.get(
                                 gpu_type
                             ),
+                            estimated_vram=self._vram_claim,
                         ),
                     )
                 ]
@@ -728,6 +734,9 @@ class SGLangResourceFitSelector(ScheduleCandidatesSelector):
                 if not first_cnt:
                     continue
 
+                if not self._is_tp_size_divisible(first_cnt):
+                    continue
+
                 workers_candidates = self.auto_select_multi_worker_multi_gpu_candidates(
                     workers_of_type, gpu_type
                 )
@@ -806,6 +815,7 @@ class SGLangResourceFitSelector(ScheduleCandidatesSelector):
                             self._model,
                             selected_workers,
                             self._mem_fraction_static_by_gpu_type.get(gpu_type),
+                            estimated_vram=self._vram_claim,
                         )
                     ]
             if vram_sum > largest_vram:
@@ -846,6 +856,7 @@ def _create_candidate(
     model: Model,
     selected_workers: List[Worker],
     mem_fraction_static: float = 0.9,
+    estimated_vram: Optional[int] = None,
 ) -> ModelInstanceScheduleCandidate:
     """Create a candidate with SGLang-specific parameters and primary node confirmation."""
     if not selected_workers:
@@ -888,6 +899,7 @@ def _create_candidate(
                     vram=vram_claim,
                     ram=get_computed_ram_claim(model, vram_claim),
                     vram_utilization=mem_fraction_static,
+                    estimated_vram=estimated_vram,
                 ),
             )
             subordinate_workers.append(subordinate_worker)
@@ -895,6 +907,7 @@ def _create_candidate(
     computed_resource_claim = ComputedResourceClaim(
         vram=primary_vram_claim,
         ram=get_computed_ram_claim(model, primary_vram_claim),
+        estimated_vram=estimated_vram,
     )
 
     return ModelInstanceScheduleCandidate(
@@ -1000,7 +1013,7 @@ class MemFractionStaticCalculator:
             workers: List of workers used to determine GPU memory characteristics, the input workers should only contain same type GPUs.
         """
         if find_parameter(self._model.backend_parameters, ["mem-fraction-static"]):
-            return
+            return self._param_mem_fraction_static
 
         is_npu = self._is_npu(workers)
         gpu_mem_bytes = self._get_min_gpu_sum(workers)

+ 6 - 0
gpustack/policies/candidate_selectors/vllm_resource_fit_selector.py

@@ -433,6 +433,7 @@ class VLLMResourceFitSelector(ScheduleCandidatesSelector):
                     computed_resource_claim=ComputedResourceClaim(
                         vram=vram_claim,
                         ram=get_computed_ram_claim(self._model, vram_claim),
+                        estimated_vram=self._vram_claim,
                     ),
                 )
             )
@@ -563,6 +564,7 @@ class VLLMResourceFitSelector(ScheduleCandidatesSelector):
                     computed_resource_claim=ComputedResourceClaim(
                         vram=vram_claim,
                         ram=get_computed_ram_claim(self._model, vram_claim),
+                        estimated_vram=self._vram_claim,
                     ),
                 )
             ]
@@ -682,6 +684,7 @@ class VLLMResourceFitSelector(ScheduleCandidatesSelector):
                             self._model,
                             selected_workers,
                             self._gpu_memory_utilization,
+                            estimated_vram=self._vram_claim,
                         )
                     ]
             if vram_sum > largest_vram:
@@ -735,6 +738,7 @@ def _create_candidate(
     model: Model,
     selected_workers: List[Worker],
     gpu_memory_utilization: float = 0.9,
+    estimated_vram: Optional[int] = None,
 ) -> ModelInstanceScheduleCandidate:
     """
     Create a candidate with all GPUs from the selected workers.
@@ -752,6 +756,7 @@ def _create_candidate(
         computed_resource_claim=ComputedResourceClaim(
             vram=vram_claim_main,
             ram=get_computed_ram_claim(model, vram_claim_main),
+            estimated_vram=estimated_vram,
         ),
     )
     candidate.subordinate_workers = []
@@ -772,6 +777,7 @@ def _create_candidate(
                 computed_resource_claim=ComputedResourceClaim(
                     vram=vram_claim_subworker,
                     ram=get_computed_ram_claim(model, vram_claim_subworker),
+                    estimated_vram=estimated_vram,
                 ),
             )
         )

+ 22 - 6
gpustack/policies/utils.py

@@ -48,6 +48,8 @@ def get_worker_allocatable_resource(  # noqa: C901
     """
 
     def update_allocated_vram(allocated, resource_claim):
+        if not resource_claim.vram:
+            return
         for gpu_index, vram in resource_claim.vram.items():
             allocated.vram[gpu_index] = allocated.vram.get(gpu_index, 0) + vram
 
@@ -55,6 +57,14 @@ def get_worker_allocatable_resource(  # noqa: C901
     model_instances = get_worker_model_instances(all_model_instances, worker)
     allocated = Allocated(ram=0, vram={})
 
+    # 保护 system_reserved 为 None 的情况
+    system_reserved = worker.system_reserved
+    if system_reserved is None:
+        from types import SimpleNamespace
+        system_reserved = SimpleNamespace(ram=0, vram=0)
+    system_reserved_ram = getattr(system_reserved, 'ram', 0) or 0
+    system_reserved_vram = getattr(system_reserved, 'vram', 0) or 0
+
     for model_instance in model_instances:
         # Handle resource allocation for main worker
         if model_instance.worker_id == worker.id and (
@@ -100,20 +110,20 @@ def get_worker_allocatable_resource(  # noqa: C901
                 (
                     gpu.memory.total
                     - allocated.vram.get(gpu_index, 0)
-                    - worker.system_reserved.vram
+                    - system_reserved_vram
                 ),
                 0,
             )
             allocatable.vram[gpu_index] = allocatable_vram
 
     allocatable.ram = max(
-        (worker.status.memory.total - allocated.ram - worker.system_reserved.ram), 0
+        (worker.status.memory.total - allocated.ram - system_reserved_ram), 0
     )
 
     if is_unified_memory:
         allocatable.ram = max(
             allocatable.ram
-            - worker.system_reserved.vram
+            - system_reserved_vram
             - sum(allocated.vram.values()),
             0,
         )
@@ -125,8 +135,8 @@ def get_worker_allocatable_resource(  # noqa: C901
 
     logger.debug(
         f"Worker {worker.name} gpu_type {gpu_type}, "
-        f"reserved memory: {worker.system_reserved.ram}, "
-        f"reserved gpu memory: {worker.system_reserved.vram}, "
+        f"reserved memory: {system_reserved_ram}, "
+        f"reserved gpu memory: {system_reserved_vram}, "
         f"allocatable memory: {allocatable.ram}, "
         f"allocatable gpu memory: {allocatable.vram}"
     )
@@ -396,7 +406,13 @@ async def estimate_diffusion_model_vram(
     except Exception as e:
         logger.warning(f"Cannot get weight size for model {model.name}: {e}")
 
-    return weight_size
+    # Diffusion models need extra VRAM for VAE decode/encode, text encoder forward pass,
+    # UNet/Transformer activations during denoising, and intermediate latent buffers.
+    # Use 1.2x activation overhead (similar to LLM) + 1 GiB framework overhead.
+    activation_overhead_factor = 1.2
+    framework_overhead = 1 * 1024**3
+
+    return int(weight_size * activation_overhead_factor) + framework_overhead
 
 
 def get_worker_model_instances(

+ 48 - 14
gpustack/routes/auth.py

@@ -1,6 +1,8 @@
+import base64
 import json
 import os
 from pathlib import Path
+import secrets
 import httpx
 import logging
 import jwt
@@ -24,6 +26,7 @@ from gpustack import envs
 from gpustack.api.auth import (
     SESSION_COOKIE_NAME,
     OIDC_ID_TOKEN_COOKIE_NAME,
+    OIDC_STATE_COOKIE_NAME,
     SSO_LOGIN_COOKIE_NAME,
     authenticate_user,
 )
@@ -34,7 +37,6 @@ from gpustack.server.services import (
 )
 from onelogin.saml2.auth import OneLogin_Saml2_Auth
 from fastapi.responses import RedirectResponse
-from lxml import etree
 from gpustack.utils.convert import safe_b64decode, inflate_data
 from urllib.parse import urlencode
 
@@ -196,26 +198,38 @@ async def saml_login(request: Request):
 async def saml_callback(request: Request, session: SessionDep):
     logger.debug("Invoke saml callback.")
     try:
+        auth = await init_saml_auth(request)
+
         if request.method == "GET":
+            # For HTTP-Redirect binding, decode the deflated SAML response
+            # and set it in the request data for process_response to handle.
             query = dict(request.query_params)
             SAMLResponse = query['SAMLResponse']
             decoded = safe_b64decode(SAMLResponse)
             xml_bytes = inflate_data(decoded)
+            auth._request_data['post_data']['SAMLResponse'] = base64.b64encode(
+                xml_bytes
+            ).decode()
         else:
+            # For HTTP-POST binding, SAMLResponse is base64-encoded XML.
             form_data = await request.form()
-            form_dict = dict(form_data)
-            SAMLResponse = form_dict.get('SAMLResponse')
-            xml_bytes = safe_b64decode(SAMLResponse)
-
-        root = etree.fromstring(xml_bytes)
-        name_id = root.find('.//{*}NameID').text
-        ns = {'saml': 'urn:oasis:names:tc:SAML:2.0:assertion'}
+            SAMLResponse = dict(form_data).get('SAMLResponse')
+            auth._request_data['post_data']['SAMLResponse'] = SAMLResponse
+
+        # Use the SAML library to validate signatures, conditions, and prevent XXE.
+        auth.process_response()
+        errors = auth.get_errors()
+        if errors:
+            error_reason = auth.get_last_error_reason() or "Unknown error"
+            raise Exception(f"SAML validation failed: {error_reason}")
+
+        # Extract validated attributes from the SAML response.
+        name_id = auth.get_nameid()
+        saml_attributes = auth.get_attributes()
         attributes = {}
         attributes['name_id'] = name_id
-        for attr in root.xpath('//saml:Attribute', namespaces=ns):
-            attr_name = attr.get('Name')
-            values = [v.text for v in attr.xpath('saml:AttributeValue', namespaces=ns)]
-            attributes[attr_name] = values[0] if len(values) == 1 else values
+        for key, values in saml_attributes.items():
+            attributes[key] = values[0] if len(values) == 1 else values
 
         config: Config = request.app.state.server_config
 
@@ -354,14 +368,24 @@ def get_saml_attributes(
 async def oidc_login(request: Request):
     config: Config = request.app.state.server_config
     authorization_endpoint = config.openid_configuration["authorization_endpoint"]
+    state = secrets.token_urlsafe(32)
     authUrl = (
         f'{authorization_endpoint}?response_type=code&'
         f'client_id={config.oidc_client_id}&'
         f'redirect_uri={config.oidc_redirect_uri}&'
-        f'scope=openid profile email&state=random_state_string'
+        f'scope=openid profile email&state={state}'
     )
 
-    return RedirectResponse(url=authUrl)
+    response = RedirectResponse(url=authUrl)
+    response.set_cookie(
+        key=OIDC_STATE_COOKIE_NAME,
+        value=state,
+        httponly=True,
+        secure=request.url.scheme == "https",
+        samesite="lax",
+        max_age=600,
+    )
+    return response
 
 
 @router.get("/oidc/callback")
@@ -369,6 +393,15 @@ async def oidc_callback(request: Request, session: SessionDep):
     logger.debug("Invoke oidc callback.")
     config: Config = request.app.state.server_config
     query = dict(request.query_params)
+
+    # Verify OIDC state to prevent CSRF attacks
+    callback_state = query.get('state')
+    cookie_state = request.cookies.get(OIDC_STATE_COOKIE_NAME)
+    if not callback_state or not cookie_state:
+        raise BadRequestException(message="Missing OIDC state parameter")
+    if not secrets.compare_digest(callback_state, cookie_state):
+        raise BadRequestException(message="OIDC state mismatch, possible CSRF attack")
+
     code = query['code']
     data = {
         "grant_type": "authorization_code",
@@ -487,6 +520,7 @@ async def oidc_callback(request: Request, session: SessionDep):
             )
     except Exception as e:
         logger.warning(f"Failed to set id_token cookie: {str(e)}")
+    response.delete_cookie(OIDC_STATE_COOKIE_NAME)
     return response
 
 

+ 15 - 12
gpustack/scheduler/calculator.py

@@ -1141,20 +1141,23 @@ async def calculate_gguf_model_resource_claim(
         latency = time.time() - start_time
 
         if offload == GPUOffloadEnum.Full:
-            logger.trace(
-                f"Finished running parser for full offload model instance {model.name}, latency: {latency:.2f}, "
-                f"{claim.estimate.items[0].to_log_string()}"
-            )
+            if claim.estimate.items:
+                logger.trace(
+                    f"Finished running parser for full offload model instance {model.name}, latency: {latency:.2f}, "
+                    f"{claim.estimate.items[0].to_log_string()}"
+                )
         elif offload == GPUOffloadEnum.Partial:
-            logger.trace(
-                f"Finished running parser for partial offloading model instance {model.name}, latency: {latency:.2f}, at least: "
-                f"{claim.estimate.items[1].to_log_string() if len(claim.estimate.items) > 1 else claim.estimate.items[0].to_log_string()}"
-            )
+            if claim.estimate.items:
+                logger.trace(
+                    f"Finished running parser for partial offloading model instance {model.name}, latency: {latency:.2f}, at least: "
+                    f"{claim.estimate.items[1].to_log_string() if len(claim.estimate.items) > 1 else claim.estimate.items[0].to_log_string()}"
+                )
         elif offload == GPUOffloadEnum.Disable:
-            logger.trace(
-                f"Finished running parser for disabled offloading model instance {model.name}, latency: {latency:.2f}, "
-                f"{claim.estimate.items[0].to_log_string()}"
-            )
+            if claim.estimate.items:
+                logger.trace(
+                    f"Finished running parser for disabled offloading model instance {model.name}, latency: {latency:.2f}, "
+                    f"{claim.estimate.items[0].to_log_string()}"
+                )
             clear_vram_claim(claim)
 
         return ModelResourceClaim(

+ 18 - 1
gpustack/scheduler/evaluator.py

@@ -259,6 +259,10 @@ def summarize_candidate_resource_claim(
 ) -> ResourceClaim:
     """
     Summarize the computed resource claim for a schedule candidate.
+
+    For VRAM, prefer `estimated_vram` (the model's actual estimated requirement)
+    over the allocated `vram` (total_gpu_memory * utilization_rate), so the
+    reported value reflects the model's actual need rather than the GPU allocation.
     """
     computed_resource_claims = [candidate.computed_resource_claim]
 
@@ -270,13 +274,26 @@ def summarize_candidate_resource_claim(
         )
 
     ram, vram = 0, 0
+    has_estimated_vram = False
     for computed_resource_claim in computed_resource_claims:
         ram += computed_resource_claim.ram or 0
-        if computed_resource_claim.vram:
+        if computed_resource_claim.estimated_vram is not None:
+            has_estimated_vram = True
+            vram += computed_resource_claim.estimated_vram
+        elif computed_resource_claim.vram:
             vram += sum(
                 v for v in computed_resource_claim.vram.values() if v is not None
             )
 
+    # If any claim has estimated_vram, use it for all (ignore allocated vram from
+    # claims that don't have estimated_vram to avoid mixing semantics).
+    if has_estimated_vram:
+        vram = sum(
+            c.estimated_vram
+            for c in computed_resource_claims
+            if c.estimated_vram is not None
+        )
+
     return ResourceClaim(ram=ram, vram=vram)
 
 

+ 4 - 0
gpustack/schemas/models.py

@@ -365,6 +365,10 @@ class ComputedResourceClaim(BaseModel):
     vram: Optional[Dict[int, int]] = Field(default=None)  # in bytes
     tensor_split: Optional[List[int]] = Field(default=None)
     vram_utilization: Optional[float] = Field(default=None)
+    # estimated_vram is the model's actual estimated VRAM requirement in bytes,
+    # independent of the target GPU's total memory. This differs from `vram`
+    # which represents the allocated amount (total_gpu_memory * utilization_rate).
+    estimated_vram: Optional[int] = Field(default=None)  # in bytes
 
 
 class ModelInstanceSubordinateWorker(BaseModel):

+ 5 - 20
gpustack/server/catalog.py

@@ -118,23 +118,8 @@ def prepare_chat_templates(data_dir: str):
 
 
 def get_builtin_model_catalog_file() -> str:
-    huggingface_url = "https://huggingface.co"
-    modelscope_url = "https://modelscope.cn"
-
-    model_catalog_file_name = "model-catalog-modelscope.yaml"
-    if not can_access(modelscope_url) and can_access(huggingface_url):
-        model_catalog_file_name = "model-catalog.yaml"
-        logger.info(f"Cannot access {modelscope_url}, using HuggingFace model catalog.")
-
-    return str(pkg_resources.files("gpustack.assets").joinpath(model_catalog_file_name))
-
-
-def can_access(url: str) -> bool:
-    """
-    Check if the URL is accessible
-    """
-    try:
-        response = requests.get(url, timeout=3)
-        return response.status_code >= 200 and response.status_code < 300
-    except requests.RequestException:
-        return False
+    return str(
+        pkg_resources.files("gpustack.assets").joinpath(
+            "model-catalog-modelscope.yaml"
+        )
+    )