vLLM Profiling, Tracing, and Monitoring

See the Quick Start guide on how to install and configure Graphsignal.

For GPU profiling with vLLM on Linux, install the CUPTI extra matching your CUDA version: pip install graphsignal[cu12] (CUDA 12.x) or pip install graphsignal[cu13] (CUDA 13.x).

Graphsignal automatically instruments and profiles vLLM.

What’s captured

  • Profiling: vLLM engine, scheduler, KV cache, attention, and output-processing hot paths.
  • Tracing: spans such as vllm.llm.generate and vllm.asyncllm.generate, including tags like vllm.model.name / vllm.request.id and usage/latency counters (prompt/completion tokens, time-to-first-token, end-to-end).
  • Metrics: vLLM Prometheus metrics (when available in your vLLM build).

Integration into a Python application that runs vLLM

Call graphsignal.configure(...) in your app and run vLLM normally.

import graphsignal

graphsignal.configure(api_key='my-api-key')
# or pass the API key via the GRAPHSIGNAL_API_KEY environment variable

Run vLLM server with Graphsignal runner

Set your API key, then start vllm serve via graphsignal-run:

export GRAPHSIGNAL_API_KEY="..."

graphsignal-run vllm serve Qwen/Qwen1.5-7B-Chat --port 8000

Add Graphsignal to vLLM Docker image

vLLM images may not include the CUPTI library. Install the matching Graphsignal CUPTI extra (e.g. graphsignal[cu12] for CUDA 12.x).

Here is an example of a modified docker run command that installs Graphsignal with CUPTI support and runs vLLM with it:

    docker run --gpus all \
    -p 8000:8000 \
    --ipc=host \
    -e GRAPHSIGNAL_API_KEY=YOUR_API_KEY \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --entrypoint bash \
    vllm/vllm-openai:latest \
    -lc 'pip install --no-cache-dir graphsignal[cu12] \
            && exec graphsignal-run vllm serve \
                --model Qwen/Qwen2-VL-7B-Instruct \
                --trust-remote-code'