Skip to content

TensorRT-LLM Profiling and Monitoring

See the Quick Start guide on how to install Graphsignal.

graphsignal-run recognizes trtllm-serve, trtllm serve, and trtllm-llmapi-launch invocations and configures CUPTI injection for you. The profiler captures CUDA kernel activity externally via CUPTI and scrapes TensorRT-LLM’s Prometheus endpoint on the serving port (--port, default 8000).

  • GPU profiling: per-kernel CUDA timelines plus higher-level aggregations by activity type (attention, matrix multiply, communication, KV cache, quantization, normalization, activation, …). See CUDA Profiling and Monitoring for details on what’s captured GPU-side.
  • Engine metrics: request-level Prometheus metrics (trtllm_* counters, gauges, histograms) from /prometheus/metrics on the HTTP server. Graphsignal does not mutate trtllm-serve argv; see Enabling engine metrics below if metrics are missing.
  • System metrics: CPU, host memory, and GPU metrics — collected by the profiler sidecar regardless of engine.
Section titled “Run trtllm-serve with graphsignal-run (recommended)”

Set your API key, then start trtllm-serve via graphsignal-run. The launcher wires up CUPTI profiling for you.

Terminal window
export GRAPHSIGNAL_API_KEY="..."
graphsignal-run trtllm-serve Qwen/Qwen1.5-7B-Chat --port 8000

For the PyTorch backend (common on newer GPUs):

Terminal window
graphsignal-run trtllm-serve Qwen/Qwen1.5-7B-Chat \
--port 8000 \
--backend pytorch

The TensorRT-LLM PyTorch backend replays CUDA graphs during decode. By default Graphsignal traces each graph as one aggregated cuda.graph event (lower overhead). To see individual graph node activities inside those graphs, use --cuda-graph-trace node:

Terminal window
graphsignal-run --cuda-graph-trace node trtllm-serve Qwen/Qwen1.5-7B-Chat \
--port 8000 \
--backend pytorch
graphsignal.watch(cuda_graph_trace='node')

See CUDA Profiling and Monitoring for details on the two modes.

TensorRT-LLM exposes two different HTTP metrics endpoints:

EndpointFormatPurpose
/prometheus/metricsPrometheus text (trtllm_*)Request-level metrics scraped by Graphsignal
/metricsJSONPer-iteration stats (IterationStats); not used by Graphsignal

The /prometheus/metrics route is not enabled by default. It is registered only when return_perf_metrics is set to true in your server configuration. If engine metrics are missing in Graphsignal, enable them in a YAML config and pass it to trtllm-serve:

trtllm-config.yaml
return_perf_metrics: true
Terminal window
graphsignal-run trtllm-serve Qwen/Qwen1.5-7B-Chat \
--port 8000 \
--backend pytorch \
--config trtllm-config.yaml

After the server has finished loading and served at least one request, verify the endpoint locally:

Terminal window
curl http://localhost:8000/prometheus/metrics | head

You should see # HELP trtllm_... lines. For the full list of exported metrics and an end-to-end example, see NVIDIA’s Prometheus Metrics guide and the trtllm-serve metrics documentation.

Notes:

  • Match the scrape host to your --host flag (Graphsignal defaults to localhost, same as trtllm-serve).
  • --grpc mode does not expose an HTTP /prometheus/metrics endpoint; engine Prometheus metrics are not available in that mode.
  • On the PyTorch backend, iteration stats on /metrics are separate and may require enable_iter_perf_stats in config; that JSON endpoint is not scraped by Graphsignal.

Run TensorRT-LLM from Python with graphsignal.watch()

Section titled “Run TensorRT-LLM from Python with graphsignal.watch()”

If you bootstrap TensorRT-LLM from your own Python entry point, call graphsignal.watch() once before importing tensorrt_llm / touching CUDA:

import graphsignal
graphsignal.watch(metrics_port=8000, metrics_path='/prometheus/metrics',
metrics_host='localhost')
import tensorrt_llm
# ... run TensorRT-LLM normally ...

Add Graphsignal to a TensorRT-LLM Docker image

Section titled “Add Graphsignal to a TensorRT-LLM Docker image”

TensorRT-LLM NGC release images may not include Graphsignal (or CUPTI). Install the matching Graphsignal CUPTI extra (e.g. graphsignal[cu12] for CUDA 12.x) at container startup and run the server through graphsignal-run.

Terminal window
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-e GRAPHSIGNAL_API_KEY=YOUR_API_KEY \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--entrypoint bash \
nvcr.io/nvidia/tensorrt-llm/release:latest \
-lc 'pip install --no-cache-dir graphsignal[cu12] \
&& exec graphsignal-run trtllm-serve \
Qwen/Qwen1.5-7B-Chat \
--port 8000 \
--backend pytorch \
--config trtllm-config.yaml'

Ensure trtllm-config.yaml sets return_perf_metrics: true (see Enabling engine metrics).