Observability

This document describes how to monitor GPUStack Server/Worker/LLM serving runtime metrics using Prometheus and Grafana.

Overview

GPUStack provides a comprehensive set of metrics for model serving and GPU resource management. By integrating Prometheus and Grafana, users can collect, store, and visualize these metrics in real time, enabling efficient monitoring and troubleshooting.

Built-in Observability (Default)

By default, GPUStack starts with an embedded Prometheus and Grafana. You can access them via:

Prometheus: http://your_gpustack_server_host_ip/prometheus
Grafana: http://your_gpustack_server_host_ip/grafana

Built-in Grafana is configured for anonymous Viewer access and has the login form disabled. Admin credentials remain admin / grafana by default.

External Observability (Optional)

If you want an external Prometheus/Grafana stack, we recommend using the provided Docker Compose files:

Run the following commands to clone the latest stable release:

LATEST_TAG=$(
    curl -s "https://api.github.com/repos/gpustack/gpustack/releases" \
    | grep '"tag_name"' \
    | sed -E 's/.*"tag_name": "([^"]+)".*/\1/' \
    | grep -Ev 'rc|beta|alpha|preview' \
    | head -1
)
echo "Latest stable release: $LATEST_TAG"
git clone -b "$LATEST_TAG" https://github.com/gpustack/gpustack.git
cd gpustack/docker-compose

Before starting, set GPUSTACK_GRAFANA_URL to a browser-reachable Grafana URL (not a container-only hostname like grafana).

Start external Prometheus/Grafana (this disables the built-in stack):

sudo docker compose -f docker-compose.external-observability.yaml up -d

If you already have an external Prometheus/Grafana stack, you can configure it manually instead:

Configure Prometheus to scrape GPUStack metrics
Add targets for the GPUStack metrics endpoint (default :10161) and worker discovery endpoint. Example:

scrape_configs:
 - job_name: gpustack-worker-discovery
   scrape_interval: 5s
   http_sd_configs:
     - url: "http://<gpustack_server_host>:10161/metrics/targets"
       refresh_interval: 1m
 - job_name: gpustack-server
   scrape_interval: 10s
   static_configs:
     - targets: ["<gpustack_server_host>:10161"]

Import GPUStack dashboards into Grafana
Use the dashboards provided in the docker-compose/grafana/grafana_dashboards/ directory as a starting point.
Point GPUStack to your Grafana
Set GPUSTACK_GRAFANA_URL to the externally reachable Grafana URL so dashboard redirects work. This must be a browser-reachable URL.

Accessing Metrics

GPUStack Metrics Endpoint:
Access metrics at http://<gpustack_server_host>:10161/metrics
GPUStack Worker Metrics Targets:
Access metrics at http://<gpustack_server_host>:10161/metrics/targets
Prometheus UI:
Access Prometheus at http://<host>:19090 by default, or the port configured by --builtin-prometheus-port / GPUSTACK_BUILTIN_PROMETHEUS_PORT.
Grafana UI:
Access Grafana at http://<host>:13000 by default, or the port configured by --builtin-grafana-port / GPUSTACK_BUILTIN_GRAFANA_PORT. Built-in Grafana is configured for anonymous Viewer access with the login form disabled. The admin credentials remain admin / grafana by default.

Migration from Older Compose Setups

If you previously used Docker Compose to run Prometheus/Grafana alongside GPUStack:

Keep external observability (recommended for continuity):
Leave your existing Prometheus/Grafana containers running. Update Prometheus scrape targets to the new GPUStack metrics endpoint and set GPUSTACK_GRAFANA_URL to your existing Grafana.
Switch to built-in observability:
Stop the old Prometheus/Grafana containers, then use the latest docker-compose.server.yaml (GPUStack only). Built-in Grafana/Prometheus will take over. Historical metrics from the old Prometheus will not be migrated unless you keep the old stack read-only.

Customizing Metrics Mapping

GPUStack supports dynamic customization of metrics mapping through its configuration API. This allows you to update how runtime engine metrics are mapped to GPUStack metrics without restarting the service. The configuration is managed centrally on the server and can be accessed or modified via HTTP API.

API Endpoints

Get Current Metrics Config
- GET http://<gpustack_server_host>:<gpustack_server_port>/v2/metrics/config
- Returns the current metrics mapping configuration in JSON format.
Update Metrics Config
- POST http://<gpustack_server_host>:<gpustack_server_port>/v2/metrics/config
- Accepts a JSON payload to update the metrics mapping configuration. Changes take effect immediately for all workers.
Get Default Metrics Config
- GET http://<gpustack_server_host>:<gpustack_server_port>/v2/metrics/default-config
- Returns the default metrics mapping configuration in JSON format, useful for reference or resetting.

Example Usage

Get current config:

curl http://<gpustack_server_host>:<gpustack_server_port>/v2/metrics/config

Update config:

curl -X POST http://<gpustack_server_host>:<gpustack_server_port>/v2/metrics/config \
     -H "Content-Type: application/json" \
     -d @custom_metrics_config.json

(where custom_metrics_config.json is your new config file)

Get default config:

curl -X POST http://<gpustack_server_host>:<gpustack_server_port>/v2/metrics/default-config

Note: The configuration should be provided in valid JSON format.

Metrics Exposed by GPUStack

The following metrics are exposed by GPUStack and can be scraped by Prometheus. Each metric includes hierarchical labels for cluster, worker, model, and instance identification.

LLM Serving Runtime Metrics

Metric Name	Type	Description
gpustack:num_requests_running	Gauge	Number of requests currently being processed.
gpustack:num_requests_waiting	Gauge	Number of requests waiting in the queue.
gpustack:num_requests_swapped	Gauge	Number of requests swapped out to CPU.
gpustack:prefix_cache_hit_rate	Gauge	Prefix cache hit rate.
gpustack:kv_cache_usage_ratio	Gauge	KV-cache usage ratio. 1.0 means fully used.
gpustack:prefix_cache_queries	Counter	Number of prefix cache queries (total tokens).
gpustack:prefix_cache_hits	Counter	Number of prefix cache hits (total tokens).
gpustack:prompt_tokens	Counter	Total number of prefill tokens processed.
gpustack:generation_tokens	Counter	Total number of generated tokens.
gpustack:request_prompt_tokens	Histogram	Number of prefill tokens processed per request.
gpustack:request_generation_tokens	Histogram	Number of generation tokens processed per request.
gpustack:time_to_first_token_seconds	Histogram	Time to generate first token.
gpustack:inter_token_latency_seconds	Histogram	Time to generate the next token after the previous token has been produced.
gpustack:time_per_output_token_seconds	Histogram	Time per generated token.
gpustack:e2e_request_latency_seconds	Histogram	End-to-end request latency.
gpustack:request_success	Counter	Total number of successful requests.

These metrics are mapped from various runtime engines (vLLM, SGLang, MindIE) as defined in metrics_config.yaml.

Worker Metrics

Metric Name	Type	Description
gpustack:worker_status	Gauge	Worker status (with state label).
gpustack:worker_node_os	Info	Operating system information of the worker node.
gpustack:worker_node_kernel	Info	Kernel information of the worker node.
gpustack:worker_node_uptime_seconds	Gauge	Uptime in seconds of the worker node.
gpustack:worker_node_cpu_cores	Gauge	Total CPU cores of the worker node.
gpustack:worker_node_cpu_utilization_rate	Gauge	CPU utilization rate of the worker node.
gpustack:worker_node_memory_total_bytes	Gauge	Total memory in bytes of the worker node.
gpustack:worker_node_memory_used_bytes	Gauge	Memory used in bytes of the worker node.
gpustack:worker_node_memory_utilization_rate	Gauge	Memory utilization rate of the worker node.
gpustack:worker_node_gpu	Info	GPU information of the worker node.
gpustack:worker_node_gpu_cores	Gauge	Total GPU cores of the worker node.
gpustack:worker_node_gpu_utilization_rate	Gauge	GPU utilization rate of the worker node.
gpustack:worker_node_gpu_temperature_celsius	Gauge	GPU temperature in Celsius.
gpustack:worker_node_gram_total_bytes	Gauge	Total GPU RAM in bytes.
gpustack:worker_node_gram_allocated_bytes	Gauge	Allocated GPU RAM in bytes.
gpustack:worker_node_gram_used_bytes	Gauge	Used GPU RAM in bytes.
gpustack:worker_node_gram_utilization_rate	Gauge	GPU RAM utilization rate.
gpustack:worker_node_filesystem_total_bytes	Gauge	Total filesystem size in bytes.
gpustack:worker_node_filesystem_used_bytes	Gauge	Used filesystem size in bytes.
gpustack:worker_node_filesystem_utilization_rate	Gauge	Filesystem utilization rate.

Server Metrics

Metric Name	Type	Description
gpustack:cluster	Info	Cluster information (ID, name, provider).
gpustack:cluster_status	Gauge	Cluster status (with state label).
gpustack:model	Info	Model information (ID, name, runtime, source).
gpustack:model_desired_instances	Gauge	Desired number of model instances.
gpustack:model_running_instances	Gauge	Number of running model instances.
gpustack:model_instance_status	Gauge	Status of each model instance (with state label).
gpustack:model_instance_restart_count	Gauge	Model instance restart count.
gpustack:model_instance_latest_restart_time	Gauge	Model instance latest restart time as Unix timestamp seconds.

Event Bus Metrics

Metric Name	Type	Description
gpustack:bus_subscribers	Gauge	Active bus subscribers per topic.
gpustack:bus_queue_depth	Gauge	Per-subscriber queue depth at scrape time.
gpustack:bus_queue_capacity	Gauge	Per-subscriber queue maxsize (see env knob below).
gpustack:bus_queue_full	Gauge	1 if the queue is full at scrape time, 0 otherwise.
gpustack:bus_queue_saturation_ratio	Gauge	`qsize / maxsize` in `[0, 1]`. Sustained > 0.8 ⇒ slow consumer.
gpustack:bus_subscriber_latest_keys	Gauge	Ids pending coalesced UPDATED delivery (size of `latest_by_key`).
gpustack:bus_events_total	Counter	Cumulative event counts. Extra labels: `kind`, `event_type`.

Note: All metrics are labeled with relevant identifiers (cluster, worker, model, instance, user) for fine-grained monitoring and filtering.

observability.md 13 KB Historia Raaka