Skip to content

vLLM Profiling, Tracing, and Monitoring

See the Quick Start guide on how to install Graphsignal.

graphsignal-run recognizes vllm serve and configures vLLM’s Prometheus endpoint and CUPTI injection for you. OpenTelemetry tracing is opt-in: add the --enable-otel flag to capture vLLM’s built-in request traces. The profiler ingests vLLM’s built-in OTEL traces and /metrics endpoint and captures CUDA kernel activity externally via CUPTI.

  • GPU profiling: per-kernel CUDA timelines plus higher-level aggregations by activity type (attention, matrix multiply, communication, KV cache, quantization, normalization, activation, …). See CUDA Profiling and Monitoring for details on what’s captured GPU-side.
  • Tracing (opt-in with --enable-otel): vLLM’s built-in OpenTelemetry spans for LLM generation requests, ingested through a local OTLP/gRPC collector that the launcher spawns. When --enable-otel is set, the launcher points --otlp-traces-endpoint at the local collector. Model name, prompt/completion token counts, time-to-first-token, and end-to-end latency arrive as span attributes. Requires OpenTelemetry installed in vLLM’s environment (see below).
  • Engine metrics: vLLM’s /metrics endpoint is auto-discovered and scraped (the launcher strips --disable-log-stats if present so metrics stay enabled).
  • System metrics: CPU, host memory, and GPU metrics — collected by the profiler sidecar regardless of engine.
Section titled “Run vllm serve with graphsignal-run (recommended)”

Set your API key, then start vllm serve via graphsignal-run. The launcher wires up CUPTI profiling and metrics for you.

Terminal window
export GRAPHSIGNAL_API_KEY="..."
graphsignal-run vllm serve Qwen/Qwen1.5-7B-Chat --port 8000

OpenTelemetry tracing is off by default. Add --enable-otel (before the command) to capture vLLM’s request traces — the launcher then points --otlp-traces-endpoint at a local OTLP/gRPC collector it starts:

Terminal window
graphsignal-run --enable-otel vllm serve Qwen/Qwen1.5-7B-Chat --port 8000

Tracing requires OpenTelemetry installed in vLLM’s environment (Graphsignal can’t provide it — it may be installed separately, e.g. via uv tool):

Terminal window
pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc

If you pass your own --otlp-traces-endpoint alongside --enable-otel, the launcher honors it and does not start a local collector — your traces stay in your existing OTEL pipeline. CUDA/NVML/Prometheus profiling works regardless of --enable-otel.

vLLM replay paths often use CUDA graphs. By default Graphsignal traces each graph as one aggregated cuda.graph event (lower overhead). To see individual kernels and memory operations inside graphs, use --cuda-graph-trace node:

Terminal window
graphsignal-run --cuda-graph-trace node vllm serve Qwen/Qwen1.5-7B-Chat --port 8000
graphsignal.watch(cuda_graph_trace='node')

See CUDA Profiling and Monitoring for details on the two modes.

Run vLLM from Python with graphsignal.watch()

Section titled “Run vLLM from Python with graphsignal.watch()”

If you bootstrap vLLM from your own Python entry point, call graphsignal.watch() once before importing vllm / touching CUDA:

import graphsignal
graphsignal.watch()
import vllm
# ... run vLLM normally ...

vLLM images may not include the CUPTI library. Install the matching Graphsignal CUPTI extra (e.g. graphsignal[cu12] for CUDA 12.x).

Example: a modified docker run that installs Graphsignal with CUPTI support and runs vLLM with it:

Terminal window
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-e GRAPHSIGNAL_API_KEY=YOUR_API_KEY \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--entrypoint bash \
vllm/vllm-openai:latest \
-lc 'pip install --no-cache-dir graphsignal[cu12] \
&& exec graphsignal-run vllm serve \
--model Qwen/Qwen2-VL-7B-Instruct \
--trust-remote-code'