The diagram below provides a high-level view of the GPUStack architecture.
The diagram below details the internal components and their interactions.
The GPUStack server consists of the following components:
The GPUStack worker consists of the following components:
The AI Gateway handles incoming API requests from clients. It routes requests to the appropriate model instances based on the requested model. GPUStack uses Higress for API routing and load balancing.
The GPUStack server connects to a SQL database as the datastore. GPUStack uses an Embedded PostgreSQL by default, but you can configure it to use an external PostgreSQL or MySQL as well.
Inference servers are the backends that perform the inference tasks. GPUStack supports vLLM, SGLang, Ascend MindIE and VoxBox as the built-in inference server. You can also add custom inference backends.
Ray is a distributed computing framework that GPUStack utilizes to run distributed vLLM. GPUStack bootstraps Ray cluster on-demand to run distributed vLLM across multiple workers.