qwen3_5-35b-server.log 23 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108
  1. (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297]
  2. (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297] █ █ █▄ ▄█
  3. (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.18.0
  4. (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297] █▄█▀ █ █ █ █ model /model/Qwen3.5-35B-A3B
  5. (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
  6. (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297]
  7. (APIServer pid=8) INFO 03-28 12:05:54 [utils.py:233] non-default args: {'model_tag': '/model/Qwen3.5-35B-A3B', 'host': '0.0.0.0', 'port': 30000, 'api_key': ['lq123456'], 'model': '/model/Qwen3.5-35B-A3B', 'trust_remote_code': True}
  8. (APIServer pid=8) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
  9. (APIServer pid=8) INFO 03-28 12:06:01 [model.py:533] Resolved architecture: Qwen3_5MoeForConditionalGeneration
  10. (APIServer pid=8) INFO 03-28 12:06:01 [model.py:1582] Using max model len 262144
  11. (APIServer pid=8) INFO 03-28 12:06:01 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
  12. (APIServer pid=8) INFO 03-28 12:06:02 [config.py:212] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
  13. (APIServer pid=8) INFO 03-28 12:06:02 [config.py:243] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
  14. (APIServer pid=8) INFO 03-28 12:06:02 [vllm.py:754] Asynchronous scheduling is enabled.
  15. (EngineCore pid=412) INFO 03-28 12:06:14 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='/model/Qwen3.5-35B-A3B', speculative_config=None, tokenizer='/model/Qwen3.5-35B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/model/Qwen3.5-35B-A3B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
  16. (EngineCore pid=412) INFO 03-28 12:06:14 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.19.0.4:40683 backend=nccl
  17. (EngineCore pid=412) INFO 03-28 12:06:14 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
  18. (EngineCore pid=412) INFO 03-28 12:06:19 [gpu_model_runner.py:4481] Starting to load model /model/Qwen3.5-35B-A3B...
  19. (EngineCore pid=412) INFO 03-28 12:06:20 [cuda.py:373] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
  20. (EngineCore pid=412) INFO 03-28 12:06:20 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
  21. (EngineCore pid=412) INFO 03-28 12:06:20 [qwen3_next.py:191] Using FlashInfer GDN prefill kernel
  22. (EngineCore pid=412) INFO 03-28 12:06:20 [qwen3_next.py:192] FlashInfer GDN prefill kernel is JIT-compiled; first run may take a while to compile. Set `--gdn-prefill-backend triton` to avoid JIT compile time.
  23. (EngineCore pid=412) INFO 03-28 12:06:20 [unquantized.py:186] Using TRITON backend for Unquantized MoE
  24. (EngineCore pid=412) INFO 03-28 12:06:20 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
  25. (EngineCore pid=412) INFO 03-28 12:06:20 [flash_attn.py:598] Using FlashAttention version 3
  26. (EngineCore pid=412) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
  27. (EngineCore pid=412) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
  28. (EngineCore pid=412) Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s]
  29. (EngineCore pid=412) Loading safetensors checkpoint shards: 7% Completed | 1/14 [00:00<00:12, 1.08it/s]
  30. (EngineCore pid=412) Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:01<00:11, 1.04it/s]
  31. (EngineCore pid=412) Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:02<00:10, 1.03it/s]
  32. (EngineCore pid=412) Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:03<00:09, 1.03it/s]
  33. (EngineCore pid=412) Loading safetensors checkpoint shards: 36% Completed | 5/14 [00:04<00:08, 1.02it/s]
  34. (EngineCore pid=412) Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:05<00:07, 1.02it/s]
  35. (EngineCore pid=412) Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:06<00:06, 1.02it/s]
  36. (EngineCore pid=412) Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:07<00:05, 1.02it/s]
  37. (EngineCore pid=412) Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:08<00:04, 1.04it/s]
  38. (EngineCore pid=412) Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:09<00:03, 1.02it/s]
  39. (EngineCore pid=412) Loading safetensors checkpoint shards: 79% Completed | 11/14 [00:10<00:02, 1.02it/s]
  40. (EngineCore pid=412) Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:11<00:01, 1.01it/s]
  41. (EngineCore pid=412) Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:12<00:00, 1.03it/s]
  42. (EngineCore pid=412) Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00, 1.34it/s]
  43. (EngineCore pid=412) Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:12<00:00, 1.09it/s]
  44. (EngineCore pid=412)
  45. (EngineCore pid=412) INFO 03-28 12:06:33 [default_loader.py:384] Loading weights took 12.91 seconds
  46. (EngineCore pid=412) INFO 03-28 12:06:34 [gpu_model_runner.py:4566] Model loading took 65.53 GiB memory and 13.676205 seconds
  47. (EngineCore pid=412) INFO 03-28 12:06:34 [gpu_model_runner.py:5488] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
  48. (EngineCore pid=412) INFO 03-28 12:06:47 [backends.py:988] Using cache directory: /root/.cache/vllm/torch_compile_cache/3bdc5b50d8/rank_0_0/backbone for vLLM's torch.compile
  49. (EngineCore pid=412) INFO 03-28 12:06:47 [backends.py:1048] Dynamo bytecode transform time: 6.23 s
  50. (EngineCore pid=412) INFO 03-28 12:06:49 [backends.py:371] Cache the graph of compile range (1, 8192) for later use
  51. (EngineCore pid=412) INFO 03-28 12:07:04 [backends.py:387] Compiling a graph for compile range (1, 8192) takes 16.50 s
  52. (EngineCore pid=412) INFO 03-28 12:07:05 [decorators.py:627] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/3e3bdc36d02dca2e3ee7c587a07e74de294ff2ad2f3b34ff2a5d9b2a151e831b/rank_0_0/model
  53. (EngineCore pid=412) INFO 03-28 12:07:05 [monitor.py:48] torch.compile took 24.81 s in total
  54. (EngineCore pid=412) WARNING 03-28 12:07:07 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_H20-3e.json
  55. (EngineCore pid=412) INFO 03-28 12:07:08 [monitor.py:76] Initial profiling/warmup run took 2.72 s
  56. (EngineCore pid=412) INFO 03-28 12:07:15 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
  57. (EngineCore pid=412) INFO 03-28 12:07:15 [gpu_model_runner.py:5607] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)
  58. (EngineCore pid=412) INFO 03-28 12:07:18 [gpu_model_runner.py:5686] Estimated CUDA graph memory: 0.55 GiB total
  59. (EngineCore pid=412) INFO 03-28 12:07:18 [gpu_worker.py:456] Available KV cache memory: 55.87 GiB
  60. (EngineCore pid=412) INFO 03-28 12:07:18 [gpu_worker.py:490] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9039 to maintain the same effective KV cache size.
  61. (EngineCore pid=412) INFO 03-28 12:07:18 [kv_cache_utils.py:1316] GPU KV cache size: 731,808 tokens
  62. (EngineCore pid=412) INFO 03-28 12:07:18 [kv_cache_utils.py:1321] Maximum concurrency for 262,144 tokens per request: 11.01x
  63. (EngineCore pid=412) 2026-03-28 12:07:18,722 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
  64. (EngineCore pid=412) 2026-03-28 12:07:18,763 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
  65. (EngineCore pid=412) Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 0%| | 0/51 [00:00<?, ?it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 4%|▍ | 2/51 [00:00<00:04, 11.56it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 8%|▊ | 4/51 [00:00<00:04, 11.11it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 12%|█▏ | 6/51 [00:00<00:04, 10.77it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 16%|█▌ | 8/51 [00:00<00:04, 10.66it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 20%|█▉ | 10/51 [00:00<00:03, 10.65it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 24%|██▎ | 12/51 [00:01<00:03, 10.54it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 27%|██▋ | 14/51 [00:01<00:03, 10.42it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 31%|███▏ | 16/51 [00:01<00:03, 10.26it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 35%|███▌ | 18/51 [00:01<00:03, 9.98it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 39%|███▉ | 20/51 [00:01<00:03, 9.94it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 41%|████ | 21/51 [00:02<00:03, 9.72it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 43%|████▎ | 22/51 [00:02<00:02, 9.74it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 45%|████▌ | 23/51 [00:02<00:02, 9.74it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 47%|████▋ | 24/51 [00:02<00:02, 9.75it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 49%|████▉ | 25/51 [00:02<00:02, 9.50it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 51%|█████ | 26/51 [00:02<00:02, 9.62it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 53%|█████▎ | 27/51 [00:02<00:02, 9.57it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 55%|█████▍ | 28/51 [00:02<00:02, 9.51it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 57%|█████▋ | 29/51 [00:02<00:02, 9.49it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 59%|█████▉ | 30/51 [00:02<00:02, 9.53it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 61%|██████ | 31/51 [00:03<00:02, 9.42it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 63%|██████▎ | 32/51 [00:03<00:02, 9.48it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 65%|██████▍ | 33/51 [00:03<00:05, 3.49it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 67%|██████▋ | 34/51 [00:04<00:03, 4.29it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 69%|██████▊ | 35/51 [00:04<00:03, 5.14it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 71%|███████ | 36/51 [00:04<00:02, 5.95it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 73%|███████▎ | 37/51 [00:04<00:04, 3.31it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 75%|███████▍ | 38/51 [00:04<00:03, 4.09it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 76%|███████▋ | 39/51 [00:05<00:02, 4.89it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 78%|███████▊ | 40/51 [00:05<00:01, 5.69it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 80%|████████ | 41/51 [00:05<00:03, 3.12it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 82%|████████▏ | 42/51 [00:05<00:02, 3.91it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 84%|████████▍ | 43/51 [00:06<00:01, 4.71it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 86%|████████▋ | 44/51 [00:06<00:01, 5.51it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 88%|████████▊ | 45/51 [00:06<00:01, 3.29it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 90%|█████████ | 46/51 [00:06<00:01, 4.08it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 92%|█████████▏| 47/51 [00:06<00:00, 4.91it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 94%|█████████▍| 48/51 [00:07<00:00, 5.72it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 96%|█████████▌| 49/51 [00:07<00:00, 6.53it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 98%|█████████▊| 50/51 [00:07<00:00, 7.23it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:08<00:00, 2.78it/s] Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:08<00:00, 6.25it/s]
  66. (EngineCore pid=412) Capturing CUDA graphs (decode, FULL): 0%| | 0/51 [00:00<?, ?it/s] Capturing CUDA graphs (decode, FULL): 2%|▏ | 1/51 [00:00<00:11, 4.43it/s] Capturing CUDA graphs (decode, FULL): 6%|▌ | 3/51 [00:00<00:05, 8.10it/s] Capturing CUDA graphs (decode, FULL): 10%|▉ | 5/51 [00:00<00:04, 9.61it/s] Capturing CUDA graphs (decode, FULL): 14%|█▎ | 7/51 [00:00<00:04, 10.33it/s] Capturing CUDA graphs (decode, FULL): 18%|█▊ | 9/51 [00:00<00:03, 10.77it/s] Capturing CUDA graphs (decode, FULL): 22%|██▏ | 11/51 [00:01<00:03, 11.11it/s] Capturing CUDA graphs (decode, FULL): 25%|██▌ | 13/51 [00:01<00:03, 11.27it/s] Capturing CUDA graphs (decode, FULL): 29%|██▉ | 15/51 [00:01<00:03, 11.40it/s] Capturing CUDA graphs (decode, FULL): 33%|███▎ | 17/51 [00:01<00:03, 11.14it/s] Capturing CUDA graphs (decode, FULL): 37%|███▋ | 19/51 [00:01<00:03, 8.93it/s] Capturing CUDA graphs (decode, FULL): 41%|████ | 21/51 [00:02<00:03, 9.34it/s] Capturing CUDA graphs (decode, FULL): 45%|████▌ | 23/51 [00:02<00:02, 9.66it/s] Capturing CUDA graphs (decode, FULL): 49%|████▉ | 25/51 [00:02<00:02, 9.92it/s] Capturing CUDA graphs (decode, FULL): 53%|█████▎ | 27/51 [00:02<00:02, 10.06it/s] Capturing CUDA graphs (decode, FULL): 57%|█████▋ | 29/51 [00:02<00:02, 10.20it/s] Capturing CUDA graphs (decode, FULL): 61%|██████ | 31/51 [00:03<00:01, 10.26it/s] Capturing CUDA graphs (decode, FULL): 65%|██████▍ | 33/51 [00:03<00:01, 10.31it/s] Capturing CUDA graphs (decode, FULL): 69%|██████▊ | 35/51 [00:03<00:01, 10.36it/s] Capturing CUDA graphs (decode, FULL): 73%|███████▎ | 37/51 [00:03<00:01, 10.43it/s] Capturing CUDA graphs (decode, FULL): 76%|███████▋ | 39/51 [00:03<00:01, 10.45it/s] Capturing CUDA graphs (decode, FULL): 80%|████████ | 41/51 [00:04<00:00, 10.51it/s] Capturing CUDA graphs (decode, FULL): 84%|████████▍ | 43/51 [00:04<00:00, 10.51it/s] Capturing CUDA graphs (decode, FULL): 88%|████████▊ | 45/51 [00:04<00:00, 10.53it/s] Capturing CUDA graphs (decode, FULL): 92%|█████████▏| 47/51 [00:04<00:00, 10.53it/s] Capturing CUDA graphs (decode, FULL): 96%|█████████▌| 49/51 [00:04<00:00, 10.58it/s] Capturing CUDA graphs (decode, FULL): 100%|██████████| 51/51 [00:05<00:00, 8.74it/s] Capturing CUDA graphs (decode, FULL): 100%|██████████| 51/51 [00:05<00:00, 9.97it/s]
  67. (EngineCore pid=412) INFO 03-28 12:07:32 [gpu_model_runner.py:5746] Graph capturing finished in 14 secs, took 1.07 GiB
  68. (EngineCore pid=412) INFO 03-28 12:07:32 [gpu_worker.py:617] CUDA graph pool memory: 1.07 GiB (actual), 0.55 GiB (estimated), difference: 0.53 GiB (49.0%).
  69. (EngineCore pid=412) INFO 03-28 12:07:32 [core.py:281] init engine (profile, create kv cache, warmup model) took 58.68 seconds
  70. (EngineCore pid=412) INFO 03-28 12:07:33 [vllm.py:754] Asynchronous scheduling is enabled.
  71. (APIServer pid=8) INFO 03-28 12:07:33 [api_server.py:576] Supported tasks: ['generate']
  72. (APIServer pid=8) WARNING 03-28 12:07:33 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
  73. (APIServer pid=8) INFO 03-28 12:07:34 [hf.py:320] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
  74. (APIServer pid=8) INFO 03-28 12:07:38 [base.py:216] Multi-modal warmup completed in 3.646s
  75. (APIServer pid=8) INFO 03-28 12:07:38 [api_server.py:580] Starting vLLM server on http://0.0.0.0:30000
  76. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:37] Available routes are:
  77. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
  78. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /docs, Methods: HEAD, GET
  79. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
  80. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
  81. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /tokenize, Methods: POST
  82. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /detokenize, Methods: POST
  83. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /load, Methods: GET
  84. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /version, Methods: GET
  85. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /health, Methods: GET
  86. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /metrics, Methods: GET
  87. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /v1/models, Methods: GET
  88. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /ping, Methods: GET
  89. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /ping, Methods: POST
  90. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /invocations, Methods: POST
  91. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
  92. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /v1/responses, Methods: POST
  93. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
  94. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
  95. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /v1/completions, Methods: POST
  96. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /v1/messages, Methods: POST
  97. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
  98. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
  99. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
  100. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
  101. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
  102. (APIServer pid=8) INFO 03-28 12:07:38 [launcher.py:46] Route: /v1/completions/render, Methods: POST
  103. (APIServer pid=8) INFO: Started server process [8]
  104. (APIServer pid=8) INFO: Waiting for application startup.
  105. (APIServer pid=8) INFO: Application startup complete.
  106. (APIServer pid=8) INFO: 172.19.0.1:34948 - "POST /v1/chat/completions HTTP/1.1" 200 OK
  107. (APIServer pid=8) INFO 03-28 12:08:28 [loggers.py:259] Engine 000: Avg prompt throughput: 1.6 tokens/s, Avg generation throughput: 5.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
  108. (APIServer pid=8) INFO 03-28 12:08:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%