Profiler CLI

Graphsignal observes your workload from a sidecar process — the profiler. It runs out-of-process, never inside your workload. graphsignal-run launches a workload with the profiler attached.

Install the CLI as an isolated uv tool so it doesn’t pollute the workload environment:

UV_TOOL_BIN_DIR=/usr/local/bin uv tool install 'graphsignal[cu12]'   # CUDA 12.x
# or
UV_TOOL_BIN_DIR=/usr/local/bin uv tool install 'graphsignal[cu13]'   # CUDA 13.x

UV_TOOL_BIN_DIR=/usr/local/bin puts graphsignal-run in a directory that is already on PATH for every shell, including non-interactive scripts and containers.

Alternative: install into your workload environment

If you prefer a single environment, or you use the graphsignal.watch() Python API (which requires graphsignal importable by your application), install it directly into your workload’s environment instead:

pip install 'graphsignal[cu12]'   # CUDA 12.x
# or
pip install 'graphsignal[cu13]'   # CUDA 13.x

graphsignal-run

Wrap any launch command. graphsignal-run starts the profiler sidecar, enables GPU profiling, and launches your workload so process managers (init systems, container runtimes, etc.) see only the workload.

graphsignal-run <command> [args...]

Examples:

graphsignal-run vllm serve <model> --port 8001
graphsignal-run --enable-otel sglang serve --model-path <model>
graphsignal-run python -m sglang.launch_server --model-path <model>
graphsignal-run trtllm-serve <model> --port 8000
graphsignal-run --metrics-port 8000 trtllm-serve <model> --port 8000
graphsignal-run python myapp.py
graphsignal-run app.py

Options (must precede the command):

--enable-otel — Enable OpenTelemetry trace capture for supported engines (vLLM, SGLang). Captures the engine’s request traces via a local OTLP/gRPC collector. Requires OpenTelemetry installed in the engine’s environment. Off by default.
--metrics-port PORT — Port to scrape the workload’s Prometheus /metrics endpoint on. Overrides the port derived from the engine’s --port flag or its default (e.g. 8000 for vLLM/TensorRT-LLM, 30000 for SGLang). Use this when metrics are exposed on a different port than the HTTP server. Not forwarded to the workload.
--cuda-graph-trace graph|node — CUDA graph tracing granularity. graph (default): trace each CUDA graph as one aggregated cuda.graph event; lower CUPTI overhead. node: trace individual graph node activities (kernels, memory copies, etc.) as separate cuda.kernel / cuda.memcpy events; higher CUPTI overhead. Use node when you need node-level visibility inside captured CUDA graphs. Not forwarded to the workload.

Behavior:

Detects the engine from your command (vLLM, SGLang, TensorRT-LLM, or a generic fallback).
For OTEL-aware workloads (vLLM, SGLang), captures the engine’s request traces via a local OTLP/gRPC collector when --enable-otel is set.
Scrapes Prometheus metrics from http://127.0.0.1:<port>/metrics when a metrics port is resolved (from --metrics-port, the engine’s --port, or the engine default).
Collects CUDA kernel activity via CUPTI as soon as CUDA initializes.
Launches your workload with the profiler sidecar running alongside it.

The profiler reads its configuration from environment variables. Set these before invoking graphsignal-run (or before calling graphsignal.watch()).

Variable	Purpose
`GRAPHSIGNAL_API_KEY` (required)	Your account API key.
`GRAPHSIGNAL_API_BASE`	Override the API endpoint (defaults to `https://api.graphsignal.com`).
`GRAPHSIGNAL_TAG_<KEY>=<value>`	Arbitrary tag attached to all signals (e.g. `GRAPHSIGNAL_TAG_DEPLOYMENT=us-prod`).

To get an API key, sign up for a free account at graphsignal.com; the key is in Settings / API Keys.