(APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297] (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297] █ █ █▄ ▄█ (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.18.0 (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297] █▄█▀ █ █ █ █ model /model/Qwen3.5-122B-A10B (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297] (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:233] non-default args: {'model_tag': '/model/Qwen3.5-122B-A10B', 'host': '0.0.0.0', 'port': 30000, 'api_key': ['lq123456'], 'model': '/model/Qwen3.5-122B-A10B', 'trust_remote_code': True, 'tensor_parallel_size': 2} (APIServer pid=8) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. (APIServer pid=8) INFO 03-28 12:06:01 [model.py:533] Resolved architecture: Qwen3_5MoeForConditionalGeneration (APIServer pid=8) INFO 03-28 12:06:01 [model.py:1582] Using max model len 262144 (APIServer pid=8) INFO 03-28 12:06:01 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192. (APIServer pid=8) INFO 03-28 12:06:01 [config.py:212] Setting attention block size to 2096 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=8) INFO 03-28 12:06:01 [config.py:243] Padding mamba page size by 0.58% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=8) INFO 03-28 12:06:01 [vllm.py:754] Asynchronous scheduling is enabled. (APIServer pid=8) :1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (APIServer pid=8) :1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. (APIServer pid=8) INFO 03-28 12:06:02 [compilation.py:289] Enabled custom fusions: allreduce_rms (EngineCore pid=413) INFO 03-28 12:06:14 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='/model/Qwen3.5-122B-A10B', speculative_config=None, tokenizer='/model/Qwen3.5-122B-A10B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/model/Qwen3.5-122B-A10B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': , 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': , 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': , 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore pid=413) WARNING 03-28 12:06:14 [multiproc_executor.py:997] Reducing Torch parallelism from 88 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (EngineCore pid=413) INFO 03-28 12:06:14 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.19.0.5 (local), world_size=2, local_world_size=2 (Worker pid=612) INFO 03-28 12:06:20 [parallel_state.py:1395] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:59153 backend=nccl (Worker pid=613) INFO 03-28 12:06:20 [parallel_state.py:1395] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:59153 backend=nccl (Worker pid=612) :1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (Worker pid=612) :1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. (Worker pid=613) :1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (Worker pid=613) :1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. (Worker pid=612) INFO 03-28 12:06:21 [pynccl.py:111] vLLM is using nccl==2.27.5 (Worker pid=612) INFO 03-28 12:06:22 [parallel_state.py:1717] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A (Worker pid=613) INFO 03-28 12:06:22 [parallel_state.py:1717] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1, EPLB rank N/A (Worker_TP0 pid=612) INFO 03-28 12:06:27 [gpu_model_runner.py:4481] Starting to load model /model/Qwen3.5-122B-A10B... (Worker_TP1 pid=613) INFO 03-28 12:06:27 [cuda.py:373] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention (Worker_TP1 pid=613) INFO 03-28 12:06:27 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (Worker_TP0 pid=612) INFO 03-28 12:06:27 [cuda.py:373] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention (Worker_TP0 pid=612) INFO 03-28 12:06:27 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (Worker_TP1 pid=613) INFO 03-28 12:06:27 [qwen3_next.py:191] Using FlashInfer GDN prefill kernel (Worker_TP1 pid=613) INFO 03-28 12:06:27 [qwen3_next.py:192] FlashInfer GDN prefill kernel is JIT-compiled; first run may take a while to compile. Set `--gdn-prefill-backend triton` to avoid JIT compile time. (Worker_TP0 pid=612) INFO 03-28 12:06:27 [qwen3_next.py:191] Using FlashInfer GDN prefill kernel (Worker_TP0 pid=612) INFO 03-28 12:06:27 [qwen3_next.py:192] FlashInfer GDN prefill kernel is JIT-compiled; first run may take a while to compile. Set `--gdn-prefill-backend triton` to avoid JIT compile time. (Worker_TP0 pid=612) INFO 03-28 12:06:27 [unquantized.py:186] Using TRITON backend for Unquantized MoE (Worker_TP0 pid=612) INFO 03-28 12:06:27 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. (Worker_TP0 pid=612) INFO 03-28 12:06:27 [flash_attn.py:598] Using FlashAttention version 3 (Worker_TP0 pid=612) Loading safetensors checkpoint shards: 0% Completed | 0/39 [00:00:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (EngineCore pid=413) :1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. (EngineCore pid=413) INFO 03-28 12:07:59 [compilation.py:289] Enabled custom fusions: allreduce_rms (APIServer pid=8) INFO 03-28 12:07:59 [api_server.py:576] Supported tasks: ['generate'] (APIServer pid=8) WARNING 03-28 12:07:59 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`. (APIServer pid=8) INFO 03-28 12:08:00 [hf.py:320] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this. (APIServer pid=8) INFO 03-28 12:08:04 [base.py:216] Multi-modal warmup completed in 3.630s (APIServer pid=8) INFO 03-28 12:08:04 [api_server.py:580] Starting vLLM server on http://0.0.0.0:30000 (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:37] Available routes are: (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /docs, Methods: HEAD, GET (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /redoc, Methods: HEAD, GET (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /tokenize, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /detokenize, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /load, Methods: GET (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /version, Methods: GET (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /health, Methods: GET (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /metrics, Methods: GET (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /v1/models, Methods: GET (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /ping, Methods: GET (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /ping, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /invocations, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /v1/chat/completions, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /v1/responses, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /v1/completions, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /v1/messages, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /inference/v1/generate, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST (APIServer pid=8) INFO 03-28 12:08:04 [launcher.py:46] Route: /v1/completions/render, Methods: POST (APIServer pid=8) INFO: Started server process [8] (APIServer pid=8) INFO: Waiting for application startup. (APIServer pid=8) INFO: Application startup complete. (APIServer pid=8) INFO: 172.19.0.1:47954 - "POST /v1/chat/completions HTTP/1.1" 200 OK (APIServer pid=8) INFO 03-28 12:08:34 [loggers.py:259] Engine 000: Avg prompt throughput: 1.6 tokens/s, Avg generation throughput: 5.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0% (APIServer pid=8) INFO 03-28 12:08:44 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%