SGLang Profiling, Tracing, and Monitoring

See the Quick Start guide on how to install Graphsignal.

graphsignal-run recognizes sglang ... and python -m sglang.launch_server ... invocations and enables SGLang’s metrics plus GPU profiling for you. OpenTelemetry tracing is opt-in: add the --enable-otel flag to capture SGLang’s built-in request traces. The profiler ingests SGLang’s built-in OTEL traces and /metrics endpoint and captures CUDA kernel activity via CUPTI.

What’s captured

GPU profiling: per-kernel CUDA timelines plus higher-level aggregations by activity type (attention, matrix multiply, communication, KV cache, quantization, normalization, activation, …). See CUDA Profiling and Monitoring for details on what’s captured GPU-side.
Tracing (opt-in with --enable-otel): SGLang’s built-in OpenTelemetry spans, ingested through a local OTLP/gRPC collector. When --enable-otel is set, Graphsignal captures these traces automatically. Requires OpenTelemetry installed in SGLang’s environment (see below).
Engine metrics: SGLang’s /metrics endpoint is auto-discovered and scraped (Graphsignal keeps metrics enabled).
System metrics: CPU, host memory, and GPU metrics — collected by the profiler sidecar regardless of engine.

Run `sglang serve` with `graphsignal-run` (recommended)

Set your API key, then start sglang serve via graphsignal-run. Graphsignal sets up GPU profiling and metrics for you.

export GRAPHSIGNAL_API_KEY="..."

graphsignal-run sglang serve \
  --model-path Qwen/Qwen1.5-7B-Chat \
  --port 8000

Enable tracing with `--enable-otel`

OpenTelemetry tracing is off by default. Add --enable-otel (before the command) to capture SGLang’s request traces — Graphsignal then captures the engine’s traces via a local OTLP/gRPC collector:

graphsignal-run --enable-otel sglang serve \
  --model-path Qwen/Qwen1.5-7B-Chat \
  --port 8000

Tracing requires OpenTelemetry installed in SGLang’s environment (Graphsignal can’t provide it — it may be installed separately, e.g. via uv tool):

pip install "sglang[tracing]"
# or: pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc

If you pass your own --otlp-traces-endpoint alongside --enable-otel, graphsignal-run honors it and does not start a local collector — your traces stay in your existing OTEL pipeline. CUDA/NVML/Prometheus profiling works regardless of --enable-otel.

CUDA graph tracing

SGLang decode paths often use CUDA graphs. By default Graphsignal traces each graph as one aggregated cuda.graph event (lower overhead). To see individual graph node activities, use --cuda-graph-trace node:

graphsignal-run --cuda-graph-trace node sglang serve \
  --model-path Qwen/Qwen1.5-7B-Chat \
  --port 8000

graphsignal.watch(cuda_graph_trace='node')

See CUDA Profiling and Monitoring for details on the two modes.

Run SGLang from Python with `graphsignal.watch()`

If you bootstrap SGLang from your own Python entry point, call graphsignal.watch() once before importing sglang / touching CUDA:

import graphsignal

graphsignal.watch()

import sglang
# ... run SGLang normally ...

Add Graphsignal to an SGLang Docker image

If your image does not include Graphsignal (or CUPTI), install Graphsignal at container startup and run SGLang through graphsignal-run.

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -e GRAPHSIGNAL_API_KEY=YOUR_API_KEY \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --entrypoint bash \
  your-sglang-image:latest \
  -lc 'pip install --no-cache-dir graphsignal[cu12] \
        && exec graphsignal-run sglang serve \
            --model-path Qwen/Qwen2.5-1.5B-Instruct \
            --port 8000'