vLLM Profiling, Tracing, and Monitoring
See the Quick Start guide on how to install Graphsignal.
graphsignal-run recognizes vllm serve and configures vLLM’s Prometheus endpoint and CUPTI injection for you. OpenTelemetry tracing is opt-in: add the --enable-otel flag to capture vLLM’s built-in request traces. The profiler ingests vLLM’s built-in OTEL traces and /metrics endpoint and captures CUDA kernel activity externally via CUPTI.
What’s captured
Section titled “What’s captured”- GPU profiling: per-kernel CUDA timelines plus higher-level aggregations by activity type (attention, matrix multiply, communication, KV cache, quantization, normalization, activation, …). See CUDA Profiling and Monitoring for details on what’s captured GPU-side.
- Tracing (opt-in with
--enable-otel): vLLM’s built-in OpenTelemetry spans for LLM generation requests, ingested through a local OTLP/gRPC collector that the launcher spawns. When--enable-otelis set, the launcher points--otlp-traces-endpointat the local collector. Model name, prompt/completion token counts, time-to-first-token, and end-to-end latency arrive as span attributes. Requires OpenTelemetry installed in vLLM’s environment (see below). - Engine metrics: vLLM’s
/metricsendpoint is auto-discovered and scraped (the launcher strips--disable-log-statsif present so metrics stay enabled). - System metrics: CPU, host memory, and GPU metrics — collected by the profiler sidecar regardless of engine.
Run vllm serve with graphsignal-run (recommended)
Section titled “Run vllm serve with graphsignal-run (recommended)”Set your API key, then start vllm serve via graphsignal-run. The launcher wires up CUPTI profiling and metrics for you.
export GRAPHSIGNAL_API_KEY="..."
graphsignal-run vllm serve Qwen/Qwen1.5-7B-Chat --port 8000Enable tracing with --enable-otel
Section titled “Enable tracing with --enable-otel”OpenTelemetry tracing is off by default. Add --enable-otel (before the command) to capture vLLM’s request traces — the launcher then points --otlp-traces-endpoint at a local OTLP/gRPC collector it starts:
graphsignal-run --enable-otel vllm serve Qwen/Qwen1.5-7B-Chat --port 8000Tracing requires OpenTelemetry installed in vLLM’s environment (Graphsignal can’t provide it — it may be installed separately, e.g. via uv tool):
pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpcIf you pass your own --otlp-traces-endpoint alongside --enable-otel, the launcher honors it and does not start a local collector — your traces stay in your existing OTEL pipeline. CUDA/NVML/Prometheus profiling works regardless of --enable-otel.
CUDA graph tracing
Section titled “CUDA graph tracing”vLLM replay paths often use CUDA graphs. By default Graphsignal traces each graph as one aggregated cuda.graph event (lower overhead). To see individual kernels and memory operations inside graphs, use --cuda-graph-trace node:
graphsignal-run --cuda-graph-trace node vllm serve Qwen/Qwen1.5-7B-Chat --port 8000graphsignal.watch(cuda_graph_trace='node')See CUDA Profiling and Monitoring for details on the two modes.
Run vLLM from Python with graphsignal.watch()
Section titled “Run vLLM from Python with graphsignal.watch()”If you bootstrap vLLM from your own Python entry point, call graphsignal.watch() once before importing vllm / touching CUDA:
import graphsignal
graphsignal.watch()
import vllm# ... run vLLM normally ...Add Graphsignal to a vLLM Docker image
Section titled “Add Graphsignal to a vLLM Docker image”vLLM images may not include the CUPTI library. Install the matching Graphsignal CUPTI extra (e.g. graphsignal[cu12] for CUDA 12.x).
Example: a modified docker run that installs Graphsignal with CUPTI support and runs vLLM with it:
docker run --gpus all \ -p 8000:8000 \ --ipc=host \ -e GRAPHSIGNAL_API_KEY=YOUR_API_KEY \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --entrypoint bash \ vllm/vllm-openai:latest \ -lc 'pip install --no-cache-dir graphsignal[cu12] \ && exec graphsignal-run vllm serve \ --model Qwen/Qwen2-VL-7B-Instruct \ --trust-remote-code'