Skip to content

SGLang Profiling, Tracing, and Monitoring

See the Quick Start guide on how to install Graphsignal.

graphsignal-run recognizes sglang ... and python -m sglang.launch_server ... invocations and configures SGLang’s metrics flag (--enable-metrics) plus CUPTI injection for you. OpenTelemetry tracing is opt-in: add the --enable-otel flag to capture SGLang’s built-in request traces. There is no Python-level SGLang instrumentation — the profiler ingests SGLang’s built-in OTEL traces and /metrics endpoint and captures CUDA kernel activity externally via CUPTI.

  • GPU profiling: per-kernel CUDA timelines plus higher-level aggregations by activity type (attention, matrix multiply, communication, KV cache, quantization, normalization, activation, …). See CUDA Profiling and Monitoring for details on what’s captured GPU-side.
  • Tracing (opt-in with --enable-otel): SGLang’s built-in OpenTelemetry spans, ingested through a local OTLP/gRPC collector that the launcher spawns. When --enable-otel is set, the launcher adds --enable-trace and points --otlp-traces-endpoint at the local collector. Requires OpenTelemetry installed in SGLang’s environment (see below).
  • Engine metrics: SGLang’s /metrics endpoint is auto-discovered and scraped (the launcher adds --enable-metrics so metrics stay on).
  • System metrics: CPU, host memory, and GPU metrics — collected by the profiler sidecar regardless of engine.
Section titled “Run sglang serve with graphsignal-run (recommended)”

Set your API key, then start sglang serve via graphsignal-run. The launcher wires up CUPTI profiling and metrics for you.

Terminal window
export GRAPHSIGNAL_API_KEY="..."
graphsignal-run sglang serve \
--model-path Qwen/Qwen1.5-7B-Chat \
--port 8000

OpenTelemetry tracing is off by default. Add --enable-otel (before the command) to capture SGLang’s request traces — the launcher then injects --enable-trace / --otlp-traces-endpoint and starts a local OTLP/gRPC collector:

Terminal window
graphsignal-run --enable-otel sglang serve \
--model-path Qwen/Qwen1.5-7B-Chat \
--port 8000

Tracing requires OpenTelemetry installed in SGLang’s environment (Graphsignal can’t provide it — it may be installed separately, e.g. via uv tool):

Terminal window
pip install "sglang[tracing]"
# or: pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc

If you pass your own --otlp-traces-endpoint alongside --enable-otel, the launcher honors it and does not start a local collector — your traces stay in your existing OTEL pipeline. CUDA/NVML/Prometheus profiling works regardless of --enable-otel.

Run SGLang from Python with graphsignal.watch()

Section titled “Run SGLang from Python with graphsignal.watch()”

If you bootstrap SGLang from your own Python entry point, call graphsignal.watch() once before importing sglang / touching CUDA:

import graphsignal
graphsignal.watch()
import sglang
# ... run SGLang normally ...

If your image does not include Graphsignal (or CUPTI), install Graphsignal at container startup and run SGLang through graphsignal-run.

Terminal window
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-e GRAPHSIGNAL_API_KEY=YOUR_API_KEY \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--entrypoint bash \
your-sglang-image:latest \
-lc 'pip install --no-cache-dir graphsignal[cu12] \
&& exec graphsignal-run sglang serve \
--model-path Qwen/Qwen2.5-1.5B-Instruct \
--port 8000'