TensorRT-LLM Profiling and Monitoring
See the Quick Start guide on how to install Graphsignal.
graphsignal-run recognizes trtllm-serve, trtllm serve, and trtllm-llmapi-launch invocations and configures CUPTI injection for you. The profiler captures CUDA kernel activity externally via CUPTI and scrapes TensorRT-LLM’s Prometheus endpoint on the serving port (--port, default 8000).
What’s captured
Section titled “What’s captured”- GPU profiling: per-kernel CUDA timelines plus higher-level aggregations by activity type (attention, matrix multiply, communication, KV cache, quantization, normalization, activation, …). See CUDA Profiling and Monitoring for details on what’s captured GPU-side.
- Engine metrics: request-level Prometheus metrics (
trtllm_*counters, gauges, histograms) from/prometheus/metricson the HTTP server. Graphsignal does not mutatetrtllm-serveargv; see Enabling engine metrics below if metrics are missing. - System metrics: CPU, host memory, and GPU metrics — collected by the profiler sidecar regardless of engine.
Run trtllm-serve with graphsignal-run (recommended)
Section titled “Run trtllm-serve with graphsignal-run (recommended)”Set your API key, then start trtllm-serve via graphsignal-run. The launcher wires up CUPTI profiling for you.
export GRAPHSIGNAL_API_KEY="..."
graphsignal-run trtllm-serve Qwen/Qwen1.5-7B-Chat --port 8000For the PyTorch backend (common on newer GPUs):
graphsignal-run trtllm-serve Qwen/Qwen1.5-7B-Chat \ --port 8000 \ --backend pytorchCUDA graph tracing
Section titled “CUDA graph tracing”The TensorRT-LLM PyTorch backend replays CUDA graphs during decode. By default Graphsignal traces each graph as one aggregated cuda.graph event (lower overhead). To see individual graph node activities inside those graphs, use --cuda-graph-trace node:
graphsignal-run --cuda-graph-trace node trtllm-serve Qwen/Qwen1.5-7B-Chat \ --port 8000 \ --backend pytorchgraphsignal.watch(cuda_graph_trace='node')See CUDA Profiling and Monitoring for details on the two modes.
Enabling engine metrics
Section titled “Enabling engine metrics”TensorRT-LLM exposes two different HTTP metrics endpoints:
| Endpoint | Format | Purpose |
|---|---|---|
/prometheus/metrics | Prometheus text (trtllm_*) | Request-level metrics scraped by Graphsignal |
/metrics | JSON | Per-iteration stats (IterationStats); not used by Graphsignal |
The /prometheus/metrics route is not enabled by default. It is registered only when return_perf_metrics is set to true in your server configuration. If engine metrics are missing in Graphsignal, enable them in a YAML config and pass it to trtllm-serve:
return_perf_metrics: truegraphsignal-run trtllm-serve Qwen/Qwen1.5-7B-Chat \ --port 8000 \ --backend pytorch \ --config trtllm-config.yamlAfter the server has finished loading and served at least one request, verify the endpoint locally:
curl http://localhost:8000/prometheus/metrics | headYou should see # HELP trtllm_... lines. For the full list of exported metrics and an end-to-end example, see NVIDIA’s Prometheus Metrics guide and the trtllm-serve metrics documentation.
Notes:
- Match the scrape host to your
--hostflag (Graphsignal defaults tolocalhost, same astrtllm-serve). --grpcmode does not expose an HTTP/prometheus/metricsendpoint; engine Prometheus metrics are not available in that mode.- On the PyTorch backend, iteration stats on
/metricsare separate and may requireenable_iter_perf_statsin config; that JSON endpoint is not scraped by Graphsignal.
Run TensorRT-LLM from Python with graphsignal.watch()
Section titled “Run TensorRT-LLM from Python with graphsignal.watch()”If you bootstrap TensorRT-LLM from your own Python entry point, call graphsignal.watch() once before importing tensorrt_llm / touching CUDA:
import graphsignal
graphsignal.watch(metrics_port=8000, metrics_path='/prometheus/metrics', metrics_host='localhost')
import tensorrt_llm# ... run TensorRT-LLM normally ...Add Graphsignal to a TensorRT-LLM Docker image
Section titled “Add Graphsignal to a TensorRT-LLM Docker image”TensorRT-LLM NGC release images may not include Graphsignal (or CUPTI). Install the matching Graphsignal CUPTI extra (e.g. graphsignal[cu12] for CUDA 12.x) at container startup and run the server through graphsignal-run.
docker run --gpus all \ -p 8000:8000 \ --ipc=host \ -e GRAPHSIGNAL_API_KEY=YOUR_API_KEY \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --entrypoint bash \ nvcr.io/nvidia/tensorrt-llm/release:latest \ -lc 'pip install --no-cache-dir graphsignal[cu12] \ && exec graphsignal-run trtllm-serve \ Qwen/Qwen1.5-7B-Chat \ --port 8000 \ --backend pytorch \ --config trtllm-config.yaml'Ensure trtllm-config.yaml sets return_perf_metrics: true (see Enabling engine metrics).