The scheduler is responsible for two related tasks:
In both directions, the scheduler works in two simple stages:
The scheduler first does a basic round of filtering on the worker list. This step is mainly about non-resource constraints: cluster, labels, backend compatibility, selected GPUs, and local-path availability.
The filter chain runs in this order:
READY state.LOCAL_PATH models when GPUs are explicitly selected. It removes workers where the configured model path does not exist.Only workers that pass this basic filtering step move on to resource-based filtering.
This step is also a filter. Instead of filtering by metadata or worker state, it filters by resources: can this worker or placement actually provide enough RAM/VRAM to run the model?
Resource requirements are determined differently depending on the model type:
Backend capabilities are different, so the available fallback paths are also different:
Candidates are then evaluated in order, and the process stops as soon as one strategy returns runnable candidates. In general, the scheduler tries:
A few details matter here:
Once runnable candidates are found, the scheduler scores them with a scorer chain and picks the candidate with the highest total score.
Current scale-up scoring is:
The total candidate score is the sum of all enabled scorers.
The placement scorer is always enabled for scale-up.
This strategy aims to "pack" as many model instances as possible into the fewest number of "bins" (e.g., Workers/GPUs) to optimize resource utilization. The goal is to minimize the number of bins used while maximizing resource efficiency, ensuring each bin is filled as efficiently as possible without exceeding its capacity. Model instances are placed in the bin with the least remaining space to minimize leftover capacity in each bin.
This strategy seeks to distribute multiple model instances across different workers as evenly as possible, improving system fault tolerance and load balancing.
Additional behavior:
The model-file locality scorer was added to bias placement toward workers that already have the required model files in the READY state.
What it does:
After scoring, the scheduler picks the candidate with the highest total score and assigns the model instance to that placement.
When the desired replica count is lower than the number of existing model instances, the controller ranks current instances and deletes the lowest-ranked ones first.
Scale-down uses a separate scorer chain over existing model instances:
The resulting instances are sorted by score in ascending order, and the lowest-scoring instances are removed first.
The status scorer prefers healthy replicas. This makes unhealthy or not-yet-ready replicas the first candidates for removal.
For GGUF models that report total_layers and offload_layers, this scorer prefers instances with more layers already offloaded:
0 from this scorer.The same placement scorer is reused during scale-down, but it evaluates existing placements from a removal perspective instead of a placement perspective. Placement still reflects the model's binpack or spread policy, but the score is interpreted as a keep-preference, so lower placement scores are more likely to be removed first.