TensorRT-LLM Profiling and Monitoring

See the Quick Start guide on how to install Graphsignal.

graphsignal-run recognizes trtllm-serve, trtllm serve, and trtllm-llmapi-launch invocations and configures GPU profiling for you. The profiler captures CUDA kernel activity via CUPTI and scrapes TensorRT-LLM’s Prometheus endpoint on the serving port (--port, default 8000).

What’s captured

GPU profiling: per-kernel CUDA timelines plus higher-level aggregations by activity type (attention, matrix multiply, communication, KV cache, quantization, normalization, activation, …). See CUDA Profiling and Monitoring for details on what’s captured GPU-side.
Engine metrics: request-level Prometheus metrics (trtllm_* counters, gauges, histograms) from /prometheus/metrics on the HTTP server. Graphsignal does not modify your trtllm-serve command; see Enabling engine metrics below if metrics are missing.
System metrics: CPU, host memory, and GPU metrics — collected by the profiler sidecar regardless of engine.

Run `trtllm-serve` with `graphsignal-run` (recommended)

Set your API key, then start trtllm-serve via graphsignal-run. Graphsignal sets up GPU profiling for you.

export GRAPHSIGNAL_API_KEY="..."

graphsignal-run trtllm-serve Qwen/Qwen1.5-7B-Chat --port 8000

For the PyTorch backend (common on newer GPUs):

graphsignal-run trtllm-serve Qwen/Qwen1.5-7B-Chat \
  --port 8000 \
  --backend pytorch

CUDA graph tracing

The TensorRT-LLM PyTorch backend replays CUDA graphs during decode. By default Graphsignal traces each graph as one aggregated cuda.graph event (lower overhead). To see individual graph node activities inside those graphs, use --cuda-graph-trace node:

graphsignal-run --cuda-graph-trace node trtllm-serve Qwen/Qwen1.5-7B-Chat \
  --port 8000 \
  --backend pytorch

graphsignal.watch(cuda_graph_trace='node')

See CUDA Profiling and Monitoring for details on the two modes.

Enabling engine metrics

TensorRT-LLM exposes two different HTTP metrics endpoints:

Endpoint	Format	Purpose
`/prometheus/metrics`	Prometheus text (`trtllm_*`)	Request-level metrics scraped by Graphsignal
`/metrics`	JSON	Per-iteration stats (`IterationStats`); not used by Graphsignal

The /prometheus/metrics route is not enabled by default. It is registered only when return_perf_metrics is set to true in your server configuration. If engine metrics are missing in Graphsignal, enable them in a YAML config and pass it to trtllm-serve:

return_perf_metrics: true

graphsignal-run trtllm-serve Qwen/Qwen1.5-7B-Chat \
  --port 8000 \
  --backend pytorch \
  --config trtllm-config.yaml

After the server has finished loading and served at least one request, verify the endpoint locally:

curl http://localhost:8000/prometheus/metrics | head

You should see # HELP trtllm_... lines. For the full list of exported metrics and an end-to-end example, see NVIDIA’s Prometheus Metrics guide and the trtllm-serve metrics documentation.

Notes:

Match the scrape host to your --host flag (Graphsignal defaults to localhost, same as trtllm-serve).
--grpc mode does not expose an HTTP /prometheus/metrics endpoint; engine Prometheus metrics are not available in that mode.
On the PyTorch backend, iteration stats on /metrics are separate and may require enable_iter_perf_stats in config; that JSON endpoint is not scraped by Graphsignal.

Run TensorRT-LLM from Python with `graphsignal.watch()`

If you bootstrap TensorRT-LLM from your own Python entry point, call graphsignal.watch() once before importing tensorrt_llm / touching CUDA:

import graphsignal

graphsignal.watch(metrics_port=8000, metrics_path='/prometheus/metrics',
                  metrics_host='localhost')

import tensorrt_llm
# ... run TensorRT-LLM normally ...

Add Graphsignal to a TensorRT-LLM Docker image

TensorRT-LLM NGC release images may not include Graphsignal (or CUPTI). Install the matching Graphsignal CUPTI extra (e.g. graphsignal[cu12] for CUDA 12.x) at container startup and run the server through graphsignal-run.

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -e GRAPHSIGNAL_API_KEY=YOUR_API_KEY \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --entrypoint bash \
  nvcr.io/nvidia/tensorrt-llm/release:latest \
  -lc 'pip install --no-cache-dir graphsignal[cu12] \
        && exec graphsignal-run trtllm-serve \
            Qwen/Qwen1.5-7B-Chat \
            --port 8000 \
            --backend pytorch \
            --config trtllm-config.yaml'

Ensure trtllm-config.yaml sets return_perf_metrics: true (see Enabling engine metrics).