vLLM Profiling, Tracing, and Monitoring

See the Quick Start guide on how to install Graphsignal.

graphsignal-run recognizes vllm serve and configures vLLM’s Prometheus metrics and GPU profiling for you. OpenTelemetry tracing is opt-in: add the --enable-otel flag to capture vLLM’s built-in request traces. The profiler ingests vLLM’s built-in OTEL traces and /metrics endpoint and captures CUDA kernel activity via CUPTI.

What’s captured

GPU profiling: per-kernel CUDA timelines plus higher-level aggregations by activity type (attention, matrix multiply, communication, KV cache, quantization, normalization, activation, …). See CUDA Profiling and Monitoring for details on what’s captured GPU-side.
Tracing (opt-in with --enable-otel): vLLM’s built-in OpenTelemetry spans for LLM generation requests, ingested through a local OTLP/gRPC collector. When --enable-otel is set, Graphsignal captures these traces automatically. Model name, prompt/completion token counts, time-to-first-token, and end-to-end latency arrive as span attributes. Requires OpenTelemetry installed in vLLM’s environment (see below).
Engine metrics: vLLM’s /metrics endpoint is auto-discovered and scraped (Graphsignal keeps metrics enabled).
System metrics: CPU, host memory, and GPU metrics — collected by the profiler sidecar regardless of engine.

Run `vllm serve` with `graphsignal-run` (recommended)

Set your API key, then start vllm serve via graphsignal-run. Graphsignal sets up GPU profiling and metrics for you.

export GRAPHSIGNAL_API_KEY="..."

graphsignal-run vllm serve Qwen/Qwen1.5-7B-Chat --port 8000

Enable tracing with `--enable-otel`

OpenTelemetry tracing is off by default. Add --enable-otel (before the command) to capture vLLM’s request traces — Graphsignal then captures the engine’s traces via a local OTLP/gRPC collector:

graphsignal-run --enable-otel vllm serve Qwen/Qwen1.5-7B-Chat --port 8000

Tracing requires OpenTelemetry installed in vLLM’s environment (Graphsignal can’t provide it — it may be installed separately, e.g. via uv tool):

pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc

If you pass your own --otlp-traces-endpoint alongside --enable-otel, graphsignal-run honors it and does not start a local collector — your traces stay in your existing OTEL pipeline. CUDA/NVML/Prometheus profiling works regardless of --enable-otel.

CUDA graph tracing

vLLM replay paths often use CUDA graphs. By default Graphsignal traces each graph as one aggregated cuda.graph event (lower overhead). To see individual kernels and memory operations inside graphs, use --cuda-graph-trace node:

graphsignal-run --cuda-graph-trace node vllm serve Qwen/Qwen1.5-7B-Chat --port 8000

graphsignal.watch(cuda_graph_trace='node')

See CUDA Profiling and Monitoring for details on the two modes.

Run vLLM from Python with `graphsignal.watch()`

If you bootstrap vLLM from your own Python entry point, call graphsignal.watch() once before importing vllm / touching CUDA:

import graphsignal

graphsignal.watch()

import vllm
# ... run vLLM normally ...

Add Graphsignal to a vLLM Docker image

vLLM images may not include the CUPTI library. Install the matching Graphsignal CUPTI extra (e.g. graphsignal[cu12] for CUDA 12.x).

Example: a modified docker run that installs Graphsignal with CUPTI support and runs vLLM with it:

docker run --gpus all \
    -p 8000:8000 \
    --ipc=host \
    -e GRAPHSIGNAL_API_KEY=YOUR_API_KEY \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --entrypoint bash \
    vllm/vllm-openai:latest \
    -lc 'pip install --no-cache-dir graphsignal[cu12] \
            && exec graphsignal-run vllm serve \
                --model Qwen/Qwen2-VL-7B-Instruct \
                --trust-remote-code'