SGLang Profiling, Tracing, and Monitoring
See the Quick Start guide on how to install Graphsignal.
graphsignal-run recognizes sglang ... and python -m sglang.launch_server ... invocations and configures SGLang’s metrics flag (--enable-metrics) plus CUPTI injection for you. OpenTelemetry tracing is opt-in: add the --enable-otel flag to capture SGLang’s built-in request traces. There is no Python-level SGLang instrumentation — the profiler ingests SGLang’s built-in OTEL traces and /metrics endpoint and captures CUDA kernel activity externally via CUPTI.
What’s captured
Section titled “What’s captured”- GPU profiling: per-kernel CUDA timelines plus higher-level aggregations by activity type (attention, matrix multiply, communication, KV cache, quantization, normalization, activation, …). See CUDA Profiling and Monitoring for details on what’s captured GPU-side.
- Tracing (opt-in with
--enable-otel): SGLang’s built-in OpenTelemetry spans, ingested through a local OTLP/gRPC collector that the launcher spawns. When--enable-otelis set, the launcher adds--enable-traceand points--otlp-traces-endpointat the local collector. Requires OpenTelemetry installed in SGLang’s environment (see below). - Engine metrics: SGLang’s
/metricsendpoint is auto-discovered and scraped (the launcher adds--enable-metricsso metrics stay on). - System metrics: CPU, host memory, and GPU metrics — collected by the profiler sidecar regardless of engine.
Run sglang serve with graphsignal-run (recommended)
Section titled “Run sglang serve with graphsignal-run (recommended)”Set your API key, then start sglang serve via graphsignal-run. The launcher wires up CUPTI profiling and metrics for you.
export GRAPHSIGNAL_API_KEY="..."
graphsignal-run sglang serve \ --model-path Qwen/Qwen1.5-7B-Chat \ --port 8000Enable tracing with --enable-otel
Section titled “Enable tracing with --enable-otel”OpenTelemetry tracing is off by default. Add --enable-otel (before the command) to capture SGLang’s request traces — the launcher then injects --enable-trace / --otlp-traces-endpoint and starts a local OTLP/gRPC collector:
graphsignal-run --enable-otel sglang serve \ --model-path Qwen/Qwen1.5-7B-Chat \ --port 8000Tracing requires OpenTelemetry installed in SGLang’s environment (Graphsignal can’t provide it — it may be installed separately, e.g. via uv tool):
pip install "sglang[tracing]"# or: pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpcIf you pass your own --otlp-traces-endpoint alongside --enable-otel, the launcher honors it and does not start a local collector — your traces stay in your existing OTEL pipeline. CUDA/NVML/Prometheus profiling works regardless of --enable-otel.
Run SGLang from Python with graphsignal.watch()
Section titled “Run SGLang from Python with graphsignal.watch()”If you bootstrap SGLang from your own Python entry point, call graphsignal.watch() once before importing sglang / touching CUDA:
import graphsignal
graphsignal.watch()
import sglang# ... run SGLang normally ...Add Graphsignal to an SGLang Docker image
Section titled “Add Graphsignal to an SGLang Docker image”If your image does not include Graphsignal (or CUPTI), install Graphsignal at container startup and run SGLang through graphsignal-run.
docker run --gpus all \ -p 8000:8000 \ --ipc=host \ -e GRAPHSIGNAL_API_KEY=YOUR_API_KEY \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --entrypoint bash \ your-sglang-image:latest \ -lc 'pip install --no-cache-dir graphsignal[cu12] \ && exec graphsignal-run sglang serve \ --model-path Qwen/Qwen2.5-1.5B-Instruct \ --port 8000'