Open-source inference engines like vLLM and SGLang deliver excellent inference performance, but the performance gap between a tuned deployment and an untuned one might be larger than you think. The most effective validation method is to run benchmarks with your actual traffic on your target devices. Nonetheless, we have conducted numerous experiments across different inference engines, GPU devices, models, and configuration parameter combinations. Some general observations from these experiments can offer initial guidance before you dive into deep optimization.
The following observations are based on the scope of our current experiments and may be updated or supplemented as more testing is done or as the community makes progress. For optimization methods and conclusions related to specific models on specific GPUs, please refer to the corresponding experimental documentation.
The choice of inference engine is crucial. Inference optimization often involves meticulous engine-specific tuning for particular scenarios, such as specific models, specific quantization schemes, specific GPUs, etc. Consequently, whether an engine is optimized for a given scenario makes a significant difference. For instance, vLLM runs gpt-oss-20b more than ten times faster than SGLang/TensorRT-LLM on an A100 GPU(see details). However, we cannot simply state that Engine A is universally better than Engine B. In our experimental results, vLLM, SGLang, and TensorRT-LLM each achieved the best performance in specific scenarios.
Speculative decoding is an effective method for optimizing latency. However, its effectiveness degrades significantly as the batch size increases. Therefore, it is not suitable for improving throughput.
Parallelism strategies are essential for multi-GPU distributed inference.
vLLM/SGLang typically provide reasonable default selections based on the hardware environment. In most scenarios, the default attention backend is the most appropriate. Kernel optimizations like DeepGEMM are applicable to specific precisions/GPUs. While often enabled by default, there might be cases where disabling them is more suitable.
A deployment configuration that performs well generally shows positive performance improvements across different Input Sequence Lengths (ISL) and Output Sequence Lengths (OSL), though the ratios can differ significantly.
Some parameters require tuning based on the actual inference request patterns, such as ISL/OSL, prefix repetition in the data, concurrency, etc. These include: