CRBC-MaaS-Platform-Project
/
LQDeployConfig


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109
							(APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297] 
(APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297]   █▄█▀ █     █     █     █  model   /model/Qwen3-8B
(APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=8) INFO 03-28 12:05:54 [utils.py:297] 
(APIServer pid=8) INFO 03-28 12:05:54 [utils.py:233] non-default args: {'model_tag': '/model/Qwen3-8B', 'host': '0.0.0.0', 'port': 30000, 'api_key': ['lq123456'], 'model': '/model/Qwen3-8B', 'trust_remote_code': True, 'gpu_memory_utilization': 0.45}
(APIServer pid=8) INFO 03-28 12:06:00 [model.py:533] Resolved architecture: Qwen3ForCausalLM
(APIServer pid=8) INFO 03-28 12:06:00 [model.py:1582] Using max model len 40960
(APIServer pid=8) INFO 03-28 12:06:00 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=8) INFO 03-28 12:06:00 [vllm.py:754] Asynchronous scheduling is enabled.
(EngineCore pid=421) INFO 03-28 12:06:06 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='/model/Qwen3-8B', speculative_config=None, tokenizer='/model/Qwen3-8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=40960, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/model/Qwen3-8B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=421) INFO 03-28 12:06:06 [parallel_state.py:1395] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.19.0.2:39923 backend=nccl
(EngineCore pid=421) INFO 03-28 12:06:06 [parallel_state.py:1717] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=421) INFO 03-28 12:06:07 [gpu_model_runner.py:4481] Starting to load model /model/Qwen3-8B...
(EngineCore pid=421) INFO 03-28 12:06:08 [cuda.py:317] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=421) INFO 03-28 12:06:08 [flash_attn.py:598] Using FlashAttention version 3
(EngineCore pid=421) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=421) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore pid=421) 
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
(EngineCore pid=421) 
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:01,  2.49it/s]
(EngineCore pid=421) 
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:00<00:01,  2.47it/s]
(EngineCore pid=421) 
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:01<00:00,  2.50it/s]
(EngineCore pid=421) 
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:01<00:00,  2.72it/s]
(EngineCore pid=421) 
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00,  3.59it/s]
(EngineCore pid=421) 
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00,  3.04it/s]
(EngineCore pid=421) 
(EngineCore pid=421) INFO 03-28 12:06:10 [default_loader.py:384] Loading weights took 1.64 seconds
(EngineCore pid=421) INFO 03-28 12:06:10 [gpu_model_runner.py:4566] Model loading took 15.27 GiB memory and 2.309113 seconds
(EngineCore pid=421) INFO 03-28 12:06:17 [backends.py:988] Using cache directory: /root/.cache/vllm/torch_compile_cache/fc4524cb9c/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=421) INFO 03-28 12:06:17 [backends.py:1048] Dynamo bytecode transform time: 6.40 s
(EngineCore pid=421) INFO 03-28 12:06:22 [backends.py:371] Cache the graph of compile range (1, 8192) for later use
(EngineCore pid=421) INFO 03-28 12:06:27 [backends.py:387] Compiling a graph for compile range (1, 8192) takes 10.23 s
(EngineCore pid=421) INFO 03-28 12:06:29 [decorators.py:627] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/424ae8e217a442f107c31fa8643d83be32bdfca74fd240179b6bca8b58db26ac/rank_0_0/model
(EngineCore pid=421) INFO 03-28 12:06:29 [monitor.py:48] torch.compile took 18.32 s in total
(EngineCore pid=421) INFO 03-28 12:06:30 [monitor.py:76] Initial profiling/warmup run took 1.25 s
(EngineCore pid=421) INFO 03-28 12:06:37 [kv_cache_utils.py:826] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=421) INFO 03-28 12:06:37 [gpu_model_runner.py:5607] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=51 (largest=512)
(EngineCore pid=421) INFO 03-28 12:06:38 [gpu_model_runner.py:5686] Estimated CUDA graph memory: 0.54 GiB total
(EngineCore pid=421) INFO 03-28 12:06:38 [gpu_worker.py:456] Available KV cache memory: 45.26 GiB
(EngineCore pid=421) INFO 03-28 12:06:38 [gpu_worker.py:490] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.4500 to 0.4539 to maintain the same effective KV cache size.
(EngineCore pid=421) INFO 03-28 12:06:38 [kv_cache_utils.py:1316] GPU KV cache size: 329,536 tokens
(EngineCore pid=421) INFO 03-28 12:06:38 [kv_cache_utils.py:1321] Maximum concurrency for 40,960 tokens per request: 8.05x
(EngineCore pid=421) 2026-03-28 12:06:38,828 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore pid=421) 2026-03-28 12:06:38,839 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore pid=421) 
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/51 [00:00<?, ?it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   4%|▍         | 2/51 [00:00<00:03, 13.00it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   8%|▊         | 4/51 [00:00<00:03, 12.93it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  12%|█▏        | 6/51 [00:00<00:03, 12.95it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  16%|█▌        | 8/51 [00:00<00:03, 13.36it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  20%|█▉        | 10/51 [00:00<00:02, 13.83it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  24%|██▎       | 12/51 [00:00<00:02, 14.22it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  27%|██▋       | 14/51 [00:01<00:02, 14.72it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  31%|███▏      | 16/51 [00:01<00:02, 15.23it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  37%|███▋      | 19/51 [00:01<00:01, 16.76it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  41%|████      | 21/51 [00:01<00:01, 17.45it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  45%|████▌     | 23/51 [00:01<00:01, 17.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  49%|████▉     | 25/51 [00:01<00:01, 17.84it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  55%|█████▍    | 28/51 [00:01<00:01, 18.74it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  61%|██████    | 31/51 [00:01<00:01, 19.33it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  67%|██████▋   | 34/51 [00:02<00:00, 20.38it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  73%|███████▎  | 37/51 [00:02<00:00, 21.34it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  78%|███████▊  | 40/51 [00:02<00:00, 22.13it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  84%|████████▍ | 43/51 [00:02<00:00, 22.93it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  90%|█████████ | 46/51 [00:02<00:00, 23.69it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  96%|█████████▌| 49/51 [00:02<00:00, 24.60it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 51/51 [00:02<00:00, 18.87it/s]
(EngineCore pid=421) 
Capturing CUDA graphs (decode, FULL):   0%|          | 0/51 [00:00<?, ?it/s]
Capturing CUDA graphs (decode, FULL):   4%|▍         | 2/51 [00:00<00:03, 12.55it/s]
Capturing CUDA graphs (decode, FULL):   8%|▊         | 4/51 [00:00<00:03, 12.93it/s]
Capturing CUDA graphs (decode, FULL):  12%|█▏        | 6/51 [00:00<00:03, 13.40it/s]
Capturing CUDA graphs (decode, FULL):  16%|█▌        | 8/51 [00:00<00:03, 13.77it/s]
Capturing CUDA graphs (decode, FULL):  20%|█▉        | 10/51 [00:00<00:02, 14.46it/s]
Capturing CUDA graphs (decode, FULL):  24%|██▎       | 12/51 [00:00<00:02, 15.17it/s]
Capturing CUDA graphs (decode, FULL):  27%|██▋       | 14/51 [00:00<00:02, 16.02it/s]
Capturing CUDA graphs (decode, FULL):  31%|███▏      | 16/51 [00:01<00:02, 16.96it/s]
Capturing CUDA graphs (decode, FULL):  37%|███▋      | 19/51 [00:01<00:01, 18.60it/s]
Capturing CUDA graphs (decode, FULL):  43%|████▎     | 22/51 [00:01<00:01, 19.88it/s]
Capturing CUDA graphs (decode, FULL):  49%|████▉     | 25/51 [00:01<00:01, 21.05it/s]
Capturing CUDA graphs (decode, FULL):  55%|█████▍    | 28/51 [00:01<00:01, 22.47it/s]
Capturing CUDA graphs (decode, FULL):  61%|██████    | 31/51 [00:01<00:00, 24.11it/s]
Capturing CUDA graphs (decode, FULL):  69%|██████▊   | 35/51 [00:01<00:00, 26.58it/s]
Capturing CUDA graphs (decode, FULL):  76%|███████▋  | 39/51 [00:01<00:00, 29.30it/s]
Capturing CUDA graphs (decode, FULL):  86%|████████▋ | 44/51 [00:02<00:00, 33.25it/s]
Capturing CUDA graphs (decode, FULL):  96%|█████████▌| 49/51 [00:02<00:00, 36.04it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 51/51 [00:02<00:00, 23.33it/s]
(EngineCore pid=421) INFO 03-28 12:06:44 [gpu_model_runner.py:5746] Graph capturing finished in 6 secs, took 0.67 GiB
(EngineCore pid=421) INFO 03-28 12:06:44 [gpu_worker.py:617] CUDA graph pool memory: 0.67 GiB (actual), 0.54 GiB (estimated), difference: 0.12 GiB (18.2%).
(EngineCore pid=421) INFO 03-28 12:06:45 [core.py:281] init engine (profile, create kv cache, warmup model) took 34.43 seconds
(EngineCore pid=421) INFO 03-28 12:06:45 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=8) INFO 03-28 12:06:45 [api_server.py:576] Supported tasks: ['generate']
(APIServer pid=8) WARNING 03-28 12:06:46 [model.py:1376] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=8) INFO 03-28 12:06:46 [hf.py:320] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=8) INFO 03-28 12:06:46 [api_server.py:580] Starting vLLM server on http://0.0.0.0:30000
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:37] Available routes are:
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=8) INFO 03-28 12:06:46 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=8) INFO:     Started server process [8]
(APIServer pid=8) INFO:     Waiting for application startup.
(APIServer pid=8) INFO:     Application startup complete.
(APIServer pid=8) INFO:     127.0.0.1:45826 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:56290 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:51600 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:50080 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:35316 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:58764 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:34480 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:43894 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:46488 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:39672 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     172.19.0.1:33516 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=8) INFO 03-28 12:08:26 [loggers.py:259] Engine 000: Avg prompt throughput: 1.4 tokens/s, Avg generation throughput: 5.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=8) INFO:     127.0.0.1:48602 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO 03-28 12:08:36 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=8) INFO:     127.0.0.1:49418 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:56068 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:58122 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:46568 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:43008 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:44352 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:57556 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:42470 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:56392 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=8) INFO:     127.0.0.1:38934 - "GET /v1/models HTTP/1.1" 200 OK