# Optimizing Qwen3-32B Throughput on NVIDIA A100 GPUs ## Conclusion ![qwen3-32b-a100](../../assets/performance-lab/qwen3-32b-a100.png) Recommended configuration for optimizing throughput of Qwen3-32B on A100 GPUs: ???+ tip "Serving Command" ```bash trtllm-serve Qwen/Qwen3-32B --max_seq_len=40960 --enable_chunked_prefill ``` Comparison of benchmark results before and after optimization: | Benchmark Case | baseline (vLLM without any optimizations) | Optimized | |----------|-------------------------------------------|-----------| | **ShareGPT** | Total TPS: 1730.98
Mean TPOT(ms): 149.67 | Total TPS: 2054.21 (+18.7%)
Mean TPOT(ms): 89.01 | | **Short Prompt** | Total TPS: 2656.71
Mean TPOT(ms): 528.57 | Total TPS: 3535.54 (+33.1%)
Mean TPOT(ms): 1885.18 | | **Medium Prompt** | Total TPS: 2521.38
Mean TPOT(ms): 143.23 | Total TPS: 2715.90 (+7.7%)
Mean TPOT(ms): 145.49 | | **Long Prompt** | Total TPS: 1837.08
Mean TPOT(ms): 96.19 | Total TPS: 2037.67 (+10.9%)
Mean TPOT(ms): 100.09 | | **Very Long Prompt** | Total TPS: 1798.45
Mean TPOT(ms): 67.87 | Total TPS: 1580.36 (-13.8%)
Mean TPOT(ms): 47.37 | !!! note 1. Our benchmark tests do not cover all possible optimization combinations. For example, we select the inference engine that performs best under its default configuration as the starting point for further tuning. This pruning approach yields a local optimum, which may not be the global optimum. 2. There are other optimization methods that depend on specific user scenarios, including max batch size, schedule configuration, extended KV cache, CUDA graph, Torch Compile, etc. The conclusions in this document can serve as a starting point for more targeted optimizations. 3. The tests are conducted on specific hardware and software setups. Advances in the inference engine may lead to new conclusions. If there are any missing points or updates reflecting new changes, please [let us know](https://github.com/gpustack/gpustack/issues/new/choose). ## Optimization Objective Achieve high throughput under high-concurrency request scenarios. ## Experimental Setup ### Model Qwen3-32B ### Hardware A single NVIDIA A100 SXM 80GB GPU. ### Engine Version - vLLM: v0.11.0 - SGLang: v0.5.5.post1 - TensorRT-LLM: v1.2.0rc1 ### Benchmark Dataset 1. ShareGPT 2. Random dataset with varying sequence lengths: - Very long prompt: 32000 input tokens, 100 output tokens - Long prompt: 4000 input tokens, 200 output tokens - Medium prompt: 2000 input tokens, 100 output tokens - Short prompt: 128 input tokens, 4 output tokens ### Benchmark Script We use the **vLLM bench CLI** tool to benchmark the model performance. The following command is used to run the benchmark: ```bash # Prepare the ShareGPT dataset wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json # Benchmark on ShareGPT dataset vllm bench serve --model Qwen/Qwen3-32B --backend openai-chat --endpoint /v1/chat/completions --dataset-name sharegpt --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 1000 # Benchmark on random dataset (fixed seed for reproducibility) vllm bench serve --model Qwen/Qwen3-32B --backend openai-chat --endpoint /v1/chat/completions --dataset-name random --random-input-len 4000 --random-output-len 200 --num-prompts 500 --seed 42 ``` ## Experiment Results ### 1. Choosing the Inference Engine vLLM ??? info "Serving script" ```bash vllm serve Qwen/Qwen3-32B --max-model-len 32768 ``` ??? info "Benchmark result" ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Benchmark duration (s): 241.94 Total input tokens: 217393 Total generated tokens: 201396 Request throughput (req/s): 4.13 Output token throughput (tok/s): 832.43 Peak output token throughput (tok/s): 1950.00 Peak concurrent requests: 1000.00 Total Token throughput (tok/s): 1730.98 ---------------Time to First Token---------------- Mean TTFT (ms): 105557.78 Median TTFT (ms): 105990.93 P99 TTFT (ms): 209863.50 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 149.67 Median TPOT (ms): 123.31 P99 TPOT (ms): 563.16 ---------------Inter-token Latency---------------- Mean ITL (ms): 117.98 Median ITL (ms): 61.18 P99 ITL (ms): 572.34 ================================================== ``` SGLang ??? info "Serving script" ```bash python3 -m sglang.launch_server --model-path Qwen/Qwen3-32B ``` ??? info "Benchmark result" ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Benchmark duration (s): 299.51 Total input tokens: 217393 Total generated tokens: 201439 Request throughput (req/s): 3.34 Output token throughput (tok/s): 672.56 Peak output token throughput (tok/s): 1295.00 Peak concurrent requests: 1000.00 Total Token throughput (tok/s): 1398.38 ---------------Time to First Token---------------- Mean TTFT (ms): 139199.11 Median TTFT (ms): 141086.43 P99 TTFT (ms): 271684.04 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 75.11 Median TPOT (ms): 66.68 P99 TPOT (ms): 149.79 ---------------Inter-token Latency---------------- Mean ITL (ms): 69.65 Median ITL (ms): 49.10 P99 ITL (ms): 298.90 ================================================== ``` TensorRT-LLM ??? info "Serving script" ```bash trtllm-serve Qwen/Qwen3-32B ``` ??? info "Benchmark result" ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Benchmark duration (s): 203.88 Total input tokens: 217393 Total generated tokens: 201412 Request throughput (req/s): 4.90 Output token throughput (tok/s): 987.91 Peak output token throughput (tok/s): 1741.00 Peak concurrent requests: 1000.00 Total Token throughput (tok/s): 2054.21 ---------------Time to First Token---------------- Mean TTFT (ms): 84105.84 Median TTFT (ms): 84356.20 P99 TTFT (ms): 171829.62 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 89.01 Median TPOT (ms): 79.85 P99 TPOT (ms): 265.32 ---------------Inter-token Latency---------------- Mean ITL (ms): 78.87 Median ITL (ms): 54.75 P99 ITL (ms): 318.76 ================================================== ``` Result: TensorRT-LLM (2054.21 tok/s) > vLLM (1730.98 tok/s) > SGLang (1398.38 tok/s) ### 2. Quantization in TensorRT-LLM/vLLM TensorRT-LLM FP8 ??? info "Serving script" ```bash trtllm-serve Qwen/Qwen3-32B-FP8 ``` ??? info "Benchmark result" ``` # RuntimeError: Unsupported SM version for FP8 block scaling GEMM ``` TensorRT-LLM AWQ ??? info "Serving script" ```bash trtllm-serve Qwen/Qwen3-32B-AWQ ``` ??? info "Benchmark result" ``` # KeyError: 'weight' ``` vLLM FP8 ??? info "Serving script" ```bash vllm serve Qwen/Qwen3-32B-FP8 ``` ??? info "Benchmark result" ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Benchmark duration (s): 267.77 Total input tokens: 217393 Total generated tokens: 201218 Request throughput (req/s): 3.73 Output token throughput (tok/s): 751.46 Peak output token throughput (tok/s): 1965.00 Peak concurrent requests: 1000.00 Total Token throughput (tok/s): 1563.33 ---------------Time to First Token---------------- Mean TTFT (ms): 103256.52 Median TTFT (ms): 94937.77 P99 TTFT (ms): 225680.67 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 339.76 Median TPOT (ms): 310.91 P99 TPOT (ms): 944.57 ---------------Inter-token Latency---------------- Mean ITL (ms): 273.16 Median ITL (ms): 218.10 P99 ITL (ms): 954.38 ================================================== ``` vLLM AWQ ??? info "Serving script" ```bash vllm serve Qwen/Qwen3-32B-AWQ ``` ??? info "Benchmark result" ``` ============ Serving Benchmark Result ============ Successful requests: 1000 Benchmark duration (s): 237.22 Total input tokens: 217393 Total generated tokens: 201832 Request throughput (req/s): 4.22 Output token throughput (tok/s): 850.81 Peak output token throughput (tok/s): 2072.00 Peak concurrent requests: 1000.00 Total Token throughput (tok/s): 1767.21 ---------------Time to First Token---------------- Mean TTFT (ms): 92528.26 Median TTFT (ms): 85550.21 P99 TTFT (ms): 202791.46 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 302.45 Median TPOT (ms): 277.31 P99 TPOT (ms): 812.48 ---------------Inter-token Latency---------------- Mean ITL (ms): 243.88 Median ITL (ms): 203.46 P99 ITL (ms): 817.18 ================================================== ``` ### 3. Max Batched Token Numbers in TensorRT-LLM ??? info "Serving script" ```bash trtllm-serve Qwen/Qwen3-14B --max_num_tokens=16384 ``` ??? info "Benchmark result" ``` # --max_num_tokens=16384 ============ Serving Benchmark Result ============ Successful requests: 1000 Benchmark duration (s): 220.69 Total input tokens: 217393 Total generated tokens: 201360 Request throughput (req/s): 4.53 Output token throughput (tok/s): 912.43 Peak output token throughput (tok/s): 1560.00 Peak concurrent requests: 1000.00 Total Token throughput (tok/s): 1897.50 ---------------Time to First Token---------------- Mean TTFT (ms): 90546.36 Median TTFT (ms): 89254.77 P99 TTFT (ms): 184658.38 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 79.75 Median TPOT (ms): 74.69 P99 TPOT (ms): 191.75 ---------------Inter-token Latency---------------- Mean ITL (ms): 74.37 Median ITL (ms): 53.56 P99 ITL (ms): 325.39 ================================================== # --max_num_tokens=32768 ============ Serving Benchmark Result ============ Successful requests: 1000 Benchmark duration (s): 257.31 Total input tokens: 217393 Total generated tokens: 201414 Request throughput (req/s): 3.89 Output token throughput (tok/s): 782.77 Peak output token throughput (tok/s): 1266.00 Peak concurrent requests: 1000.00 Total Token throughput (tok/s): 1627.65 ---------------Time to First Token---------------- Mean TTFT (ms): 113890.79 Median TTFT (ms): 114913.27 P99 TTFT (ms): 225978.59 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 66.22 Median TPOT (ms): 64.31 P99 TPOT (ms): 96.69 ---------------Inter-token Latency---------------- Mean ITL (ms): 64.16 Median ITL (ms): 49.74 P99 ITL (ms): 258.34 ================================================== ``` ### Summary of Optimization Options | Optimization Option | Throughput Improvement | |-----------------------------|-------------------------------------| | Engine Selection | +18.7% | | Quantization | - | | Max Batched Token Numbers | - | ### Other Benchmark Cases We further benchmarked the optimized configuration to evaluate its generalization under various workloads. ??? info "Baseline serving script" ```bash vllm serve Qwen/Qwen3-32B ``` ??? info "Baseline benchmark results" ```bash # random 32K input ============ Serving Benchmark Result ============ Successful requests: 100 Benchmark duration (s): 1783.59 Total input tokens: 3200000 Total generated tokens: 7699 Request throughput (req/s): 0.06 Output token throughput (tok/s): 4.32 Peak output token throughput (tok/s): 21.00 Peak concurrent requests: 100.00 Total Token throughput (tok/s): 1798.45 ---------------Time to First Token---------------- Mean TTFT (ms): 890711.38 Median TTFT (ms): 892708.86 P99 TTFT (ms): 1760024.36 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 67.87 Median TPOT (ms): 67.32 P99 TPOT (ms): 147.19 ---------------Inter-token Latency---------------- Mean ITL (ms): 64.54 Median ITL (ms): 48.55 P99 ITL (ms): 690.90 ================================================== # random 4K input ============ Serving Benchmark Result ============ Successful requests: 500 Benchmark duration (s): 1141.94 Total input tokens: 1997942 Total generated tokens: 99893 Request throughput (req/s): 0.44 Output token throughput (tok/s): 87.48 Peak output token throughput (tok/s): 180.00 Peak concurrent requests: 500.00 Total Token throughput (tok/s): 1837.08 ---------------Time to First Token---------------- Mean TTFT (ms): 566579.70 Median TTFT (ms): 564705.97 P99 TTFT (ms): 1116945.95 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 96.19 Median TPOT (ms): 96.40 P99 TPOT (ms): 99.35 ---------------Inter-token Latency---------------- Mean ITL (ms): 95.82 Median ITL (ms): 51.55 P99 ITL (ms): 628.61 ================================================== # random 2K input ============ Serving Benchmark Result ============ Successful requests: 500 Benchmark duration (s): 415.71 Total input tokens: 998175 Total generated tokens: 50000 Request throughput (req/s): 1.20 Output token throughput (tok/s): 120.27 Peak output token throughput (tok/s): 381.00 Peak concurrent requests: 500.00 Total Token throughput (tok/s): 2521.38 ---------------Time to First Token---------------- Mean TTFT (ms): 208397.03 Median TTFT (ms): 209108.79 P99 TTFT (ms): 408103.76 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 143.23 Median TPOT (ms): 143.96 P99 TPOT (ms): 166.03 ---------------Inter-token Latency---------------- Mean ITL (ms): 142.07 Median ITL (ms): 51.49 P99 ITL (ms): 590.04 ================================================== # random 128 input ============ Serving Benchmark Result ============ Successful requests: 1000 Benchmark duration (s): 49.57 Total input tokens: 127698 Total generated tokens: 4000 Request throughput (req/s): 20.17 Output token throughput (tok/s): 80.69 Peak output token throughput (tok/s): 200.00 Peak concurrent requests: 1000.00 Total Token throughput (tok/s): 2656.71 ---------------Time to First Token---------------- Mean TTFT (ms): 31177.64 Median TTFT (ms): 31073.76 P99 TTFT (ms): 48960.57 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 528.57 Median TPOT (ms): 544.55 P99 TPOT (ms): 547.31 ---------------Inter-token Latency---------------- Mean ITL (ms): 396.42 Median ITL (ms): 543.22 P99 ITL (ms): 554.84 ================================================== ``` ??? info "Optimized serving script" ```bash trtllm-serve Qwen/Qwen3-32B --max_seq_len=40960 --enable_chunked_prefill ``` ??? info "Optimized benchmark results" ```bash # random 32K input ============ Serving Benchmark Result ============ Successful requests: 100 Benchmark duration (s): 2029.84 Total input tokens: 3200000 Total generated tokens: 7880 Request throughput (req/s): 0.05 Output token throughput (tok/s): 3.88 Peak output token throughput (tok/s): 23.00 Peak concurrent requests: 100.00 Total Token throughput (tok/s): 1580.36 ---------------Time to First Token---------------- Mean TTFT (ms): 1040855.88 Median TTFT (ms): 1047384.87 P99 TTFT (ms): 2008642.88 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 47.37 Median TPOT (ms): 47.37 P99 TPOT (ms): 47.49 ---------------Inter-token Latency---------------- Mean ITL (ms): 46.77 Median ITL (ms): 47.36 P99 ITL (ms): 49.08 ================================================== # random 4K input ============ Serving Benchmark Result ============ Successful requests: 500 Benchmark duration (s): 1029.52 Total input tokens: 1997942 Total generated tokens: 99874 Request throughput (req/s): 0.49 Output token throughput (tok/s): 97.01 Peak output token throughput (tok/s): 241.00 Peak concurrent requests: 500.00 Total Token throughput (tok/s): 2037.67 ---------------Time to First Token---------------- Mean TTFT (ms): 509693.25 Median TTFT (ms): 505676.34 P99 TTFT (ms): 1014720.64 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 100.09 Median TPOT (ms): 98.07 P99 TPOT (ms): 115.57 ---------------Inter-token Latency---------------- Mean ITL (ms): 99.60 Median ITL (ms): 50.97 P99 ITL (ms): 2379.87 ================================================== # random 2K input ============ Serving Benchmark Result ============ Successful requests: 500 Benchmark duration (s): 385.94 Total input tokens: 998175 Total generated tokens: 50000 Request throughput (req/s): 1.30 Output token throughput (tok/s): 129.55 Peak output token throughput (tok/s): 480.00 Peak concurrent requests: 500.00 Total Token throughput (tok/s): 2715.90 ---------------Time to First Token---------------- Mean TTFT (ms): 189776.20 Median TTFT (ms): 189321.80 P99 TTFT (ms): 378730.86 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 145.49 Median TPOT (ms): 144.79 P99 TPOT (ms): 178.50 ---------------Inter-token Latency---------------- Mean ITL (ms): 144.04 Median ITL (ms): 51.18 P99 ITL (ms): 2256.49 ================================================== # random 128 input ============ Serving Benchmark Result ============ Successful requests: 1000 Benchmark duration (s): 37.25 Total input tokens: 127698 Total generated tokens: 4000 Request throughput (req/s): 26.85 Output token throughput (tok/s): 107.38 Peak output token throughput (tok/s): 400.00 Peak concurrent requests: 1000.00 Total Token throughput (tok/s): 3535.54 ---------------Time to First Token---------------- Mean TTFT (ms): 19761.77 Median TTFT (ms): 19497.33 P99 TTFT (ms): 36592.74 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 1885.18 Median TPOT (ms): 2140.52 P99 TPOT (ms): 2149.03 ---------------Inter-token Latency---------------- Mean ITL (ms): 1413.87 Median ITL (ms): 2136.43 P99 ITL (ms): 2170.23 ================================================== ```