vLLM Profiling, Tracing, and Monitoring
See the Quick Start guide on how to install and configure Graphsignal.
For GPU profiling with vLLM on Linux, install the CUPTI extra matching your CUDA version: pip install graphsignal[cu12] (CUDA 12.x) or pip install graphsignal[cu13] (CUDA 13.x).
Graphsignal automatically instruments and profiles vLLM.
What’s captured
- Profiling: vLLM engine, scheduler, KV cache, attention, and output-processing hot paths.
- Tracing: spans such as
vllm.llm.generateandvllm.asyncllm.generate, including tags likevllm.model.name/vllm.request.idand usage/latency counters (prompt/completion tokens, time-to-first-token, end-to-end). - Metrics: vLLM Prometheus metrics (when available in your vLLM build).
Integration into a Python application that runs vLLM
Call graphsignal.configure(...) in your app and run vLLM normally.
import graphsignal
graphsignal.configure(api_key='my-api-key')
# or pass the API key via the GRAPHSIGNAL_API_KEY environment variable
Run vLLM server with Graphsignal runner
Set your API key, then start vllm serve via graphsignal-run:
export GRAPHSIGNAL_API_KEY="..."
graphsignal-run vllm serve Qwen/Qwen1.5-7B-Chat --port 8000
Add Graphsignal to vLLM Docker image
vLLM images may not include the CUPTI library. Install the matching Graphsignal CUPTI extra (e.g. graphsignal[cu12] for CUDA 12.x).
Here is an example of a modified docker run command that installs Graphsignal with CUPTI support and runs vLLM with it:
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-e GRAPHSIGNAL_API_KEY=YOUR_API_KEY \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--entrypoint bash \
vllm/vllm-openai:latest \
-lc 'pip install --no-cache-dir graphsignal[cu12] \
&& exec graphsignal-run vllm serve \
--model Qwen/Qwen2-VL-7B-Instruct \
--trust-remote-code'