Profiler CLI
Graphsignal observes your workload from a sidecar process — the profiler. It runs out-of-process, never inside your workload. graphsignal-run launches a workload with the profiler attached.
Install the CLI as an isolated uv tool so it doesn’t pollute the workload environment:
UV_TOOL_BIN_DIR=/usr/local/bin uv tool install 'graphsignal[cu12]' # CUDA 12.x# orUV_TOOL_BIN_DIR=/usr/local/bin uv tool install 'graphsignal[cu13]' # CUDA 13.xUV_TOOL_BIN_DIR=/usr/local/bin puts graphsignal-run in a directory that is already on PATH for every shell, including non-interactive scripts and containers.
Alternative: install into your workload environment
Section titled “Alternative: install into your workload environment”If you prefer a single environment, or you use the graphsignal.watch() Python API (which requires graphsignal importable by your application), install it directly into your workload’s environment instead:
pip install 'graphsignal[cu12]' # CUDA 12.x# orpip install 'graphsignal[cu13]' # CUDA 13.xgraphsignal-run
Section titled “graphsignal-run”Wrap any launch command. graphsignal-run starts the profiler sidecar, enables GPU profiling, and launches your workload so process managers (init systems, container runtimes, etc.) see only the workload.
graphsignal-run <command> [args...]Examples:
graphsignal-run vllm serve <model> --port 8001graphsignal-run --enable-otel sglang serve --model-path <model>graphsignal-run python -m sglang.launch_server --model-path <model>graphsignal-run trtllm-serve <model> --port 8000graphsignal-run --metrics-port 8000 trtllm-serve <model> --port 8000graphsignal-run python myapp.pygraphsignal-run app.pyOptions (must precede the command):
--enable-otel— Enable OpenTelemetry trace capture for supported engines (vLLM, SGLang). Captures the engine’s request traces via a local OTLP/gRPC collector. Requires OpenTelemetry installed in the engine’s environment. Off by default.--metrics-port PORT— Port to scrape the workload’s Prometheus/metricsendpoint on. Overrides the port derived from the engine’s--portflag or its default (e.g. 8000 for vLLM/TensorRT-LLM, 30000 for SGLang). Use this when metrics are exposed on a different port than the HTTP server. Not forwarded to the workload.--cuda-graph-trace graph|node— CUDA graph tracing granularity.graph(default): trace each CUDA graph as one aggregatedcuda.graphevent; lower CUPTI overhead.node: trace individual graph node activities (kernels, memory copies, etc.) as separatecuda.kernel/cuda.memcpyevents; higher CUPTI overhead. Usenodewhen you need node-level visibility inside captured CUDA graphs. Not forwarded to the workload.
Behavior:
- Detects the engine from your command (vLLM, SGLang, TensorRT-LLM, or a generic fallback).
- For OTEL-aware workloads (vLLM, SGLang), captures the engine’s request traces via a local OTLP/gRPC collector when
--enable-otelis set. - Scrapes Prometheus metrics from
http://127.0.0.1:<port>/metricswhen a metrics port is resolved (from--metrics-port, the engine’s--port, or the engine default). - Collects CUDA kernel activity via CUPTI as soon as CUDA initializes.
- Launches your workload with the profiler sidecar running alongside it.
The profiler reads its configuration from environment variables. Set these before invoking graphsignal-run (or before calling graphsignal.watch()).
| Variable | Purpose |
|---|---|
GRAPHSIGNAL_API_KEY (required) | Your account API key. |
GRAPHSIGNAL_API_BASE | Override the API endpoint (defaults to https://api.graphsignal.com). |
GRAPHSIGNAL_TAG_<KEY>=<value> | Arbitrary tag attached to all signals (e.g. GRAPHSIGNAL_TAG_DEPLOYMENT=us-prod). |
To get an API key, sign up for a free account at graphsignal.com; the key is in Settings / API Keys.