architecture.md 2.1 KB

Architecture

The diagram below provides a high-level view of the MaaS-Base architecture.

gpustack-v2-architecture

The diagram below details the internal components and their interactions.

gpustack-v2-components

Server

The MaaS-Base server consists of the following components:

  • API Server: Provides a RESTful interface for clients to interact with the system. It handles authentication and authorization.
  • Scheduler: Responsible for assigning model instances to workers.
  • Controllers: Manages the state of resources in the system. For example, they handle the rollout and scaling of model instances to match the desired number of replicas.

Worker

The MaaS-Base worker consists of the following components:

  • MaaS-Base Runtime: Detects GPU devices and interacts with the container runtime to deploy model instances.
  • Serving Manager: Manages the lifecycle of model instances on the worker.
  • Metric Exporter: Exports metrics about the model instances and their performance.

AI Gateway

The AI Gateway handles incoming API requests from clients. It routes requests to the appropriate model instances based on the requested model. MaaS-Base uses Higress for API routing and load balancing.

SQL Database

The MaaS-Base server connects to a SQL database as the datastore. MaaS-Base uses an Embedded PostgreSQL by default, but you can configure it to use an external PostgreSQL or MySQL as well.

Inference Server

Inference servers are the backends that perform the inference tasks. MaaS-Base supports vLLM, SGLang, Ascend MindIE and VoxBox as the built-in inference server. You can also add custom inference backends.

Ray

Ray is a distributed computing framework that MaaS-Base utilizes to run distributed vLLM. MaaS-Base bootstraps Ray cluster on-demand to run distributed vLLM across multiple workers.