Production-grade profiling and monitoring for vLLM: always-on vLLM, PyTorch and CUDA profiling with tracing, metrics and errors in one place.
Running vLLM in production means dealing with extreme throughput, dense GPU activity, and millisecond-scale behavior that generic observability tools were never built to capture. This post explains why traditional observability falls short for inference, what an all-in-one vLLM observability stack looks like, and how to integrate it.
Inference systems operate on a fundamentally different time scale than typical backend services. Token generation, attention kernels, memory transfers, CUDA launches, and synchronization points execute thousands of times per second. At this granularity, second-level metrics don’t just lose detail - they become incapable of representing what is actually happening inside the system.
A one-second metric bucket can contain tens of thousands of kernel executions and internal function calls. Short stalls, scheduling jitter, memory contention, or synchronization bubbles are averaged away before they ever reach a dashboard. Even when these issues materially reduce throughput or increase tail latency, they are often intermittent and workload-dependent, making them invisible to SLIs and alert thresholds designed for slower systems.
Graphsignal provides always-on, production-grade observability for vLLM that covers the full stack in one place.
Profiling runs continuously at inference-time scale:
This gives you a single profiling surface from high-level vLLM execution down to kernel and hardware behavior, without toggling tools or running separate profiling sessions.

Tracing captures inference requests as spans (e.g. vllm.llm.generate, vllm.asyncllm.generate) with model and request metadata and usage/latency counters: prompt/completion tokens, time-to-first-token, and end-to-end latency.
Metrics include vLLM Prometheus metrics when available in your build, alongside inference-specific and GPU metrics, so you can alert and trend on throughput, latency, and utilization.
Errors include exceptions, GPU XID errors, and similar failures, so you can detect and debug issues quickly.
Together, this gives you always-on, production-grade vLLM, PyTorch, and CUDA profiling plus tracing, metrics, and errors in one observability stack - the kind of visibility you need to optimize and debug vLLM at scale.
Getting this stack running is straightforward. See the vLLM integration guide for full details; here’s the essence.
Python app that runs vLLM: Call graphsignal.configure(...) (or set GRAPHSIGNAL_API_KEY) and run vLLM as usual. Auto-instrumentation handles the rest.
vLLM server: For GPU profiling on Linux, install the CUPTI extra for your CUDA version (pip install graphsignal[cu12] or graphsignal[cu13]). Then start the server with Graphsignal’s runner:
export GRAPHSIGNAL_API_KEY="..."
graphsignal-run vllm serve Qwen/Qwen1.5-7B-Chat --port 8000
Docker: vLLM images may not include CUPTI. Install the matching Graphsignal CUPTI extra in the container and use graphsignal-run as the entrypoint; the vLLM integration doc includes an example docker run that does this.
Once integrated, you get vLLM profiling, tracing, metrics, and errors in one place - ready for production inference. See the Quick Start to enable observability across your inference stack, and @GraphsignalAI for updates.