An autonomous agent that deploys inference services, collects telemetry, and continuously redeploys with better configurations — indefinitely.
autodebug is an autonomous loop that deploys an inference service, benchmarks it, fetches profiling telemetry, analyzes the results, and redeploys with an updated configuration. Then repeats.
The agent — Claude Code — runs this cycle without human involvement between iterations:
Each session log cites specific signals from the telemetry — profiled function names, span counter values, startup attributes — so the reasoning is traceable. Each deployed config is saved in sessions/ as a version trail.
The pattern comes from autoresearch: autonomous agent loops that iterate continuously toward a goal rather than answering once and stopping. autoresearch applies this to knowledge work. autodebug applies it to infrastructure configuration.
The value compounds across iterations. The first pass finds the obvious issues — features not enabled, defaults that don't fit the workload. Later passes find the non-obvious ones: bottlenecks that only appear under specific load, parameter interactions, optimizations that depend on earlier changes. A human doing this manually tends to stop after one or two rounds. The agent doesn't.
Graphsignal instruments the inference server and collects profiling data, traces, metrics, and errors at inference-time granularity — per-function profiles for SGLang, vLLM, PyTorch, and CUDA. After each benchmark, the agent fetches telemetry for the exact benchmark window and reads the raw signals directly: Scheduler.event_loop_overlap call counts, sglang.inter_token_latency_seconds, sglang.startup.* attributes. Integration is a wrapper around the server startup command:
graphsignal-run python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000
graphsignal-debug is a CLI that fetches debug context from Graphsignal's API for a time range:
graphsignal-debug fetch --start 2026-03-24T20:40:00Z --end 2026-03-24T20:42:00Z
The output — profiles, spans, metrics, errors — is what the agent reads each iteration. A skill file tells Claude Code how to call the CLI and how to map the returned signals to SGLang and vLLM internals.
dstack provisions the GPU instance and deploys the service. Each iteration writes a new sessions/dstack-<ISO>.yml and applies it:
dstack apply -f sessions/dstack-<ISO>.yml -y -d
The -d flag submits without blocking. dstack handles instance provisioning, container pull, model cache volume, and gateway routing.
Configuration: lmsysorg/sglang:latest, default flags, mem_fraction_static=0.835
Results:
Telemetry findings:
sglang.startup.* attributes: enable_mixed_chunk=False, num_continuous_decode_steps=1.
Scheduler.event_loop_overlap profile: ~26,000 passes/s. With num_continuous_decode_steps=1, the scheduler re-entered the event loop after every decode step — ~26,000 times per second of decode. Each pass added overhead on top of GPU time.
sglang.inter_token_latency_seconds: 1.8–2.7ms/token.
sglang.e2e_request_latency_seconds: 2.056s for a fresh request, 0.534s with prefix cache hit.
KV cache: 705,734 tokens allocated (64.6 GB) for a model with 0.91 GB of weights.
Next config: --quantization fp8 --enable-mixed-chunk --num-continuous-decode-steps 5 --mem-fraction-static 0.5


Configuration: adds --quantization fp8 --enable-mixed-chunk --num-continuous-decode-steps 5 --mem-fraction-static 0.5
Results:
Telemetry findings:
Scheduler.event_loop_overlap call count: ~93/s, down from ~26,000/s — 286× reduction. num_continuous_decode_steps=5 runs 5 decode steps per scheduler pass instead of 1.
CUDA graph capture: 1.9s vs 21.6s — FP8 forward passes are faster, so each batch size is captured faster.
Despite FP8 being active (sglang.startup.quantization=fp8), bs=1 decode throughput did not change. With scheduler overhead eliminated, the remaining constraint is KV cache memory bandwidth — the KV cache is still BF16 (kv_cache_dtype=bfloat16).
KV cache reduced to 418,247 tokens; available GPU memory increased from 11.9 GB to 38.3 GB.
Next config: --kv-cache-dtype fp8_e5m2 to halve KV bandwidth, --speculative-algorithm NGRAM --speculative-num-steps 3 for token drafting without a separate draft model.
Most observability tooling is built assuming a human reads the output. autodebug flips that: the telemetry is a structured feedback signal consumed by an agent, which uses it to generate the next configuration change.
The loop does not require the agent to know the right answer up front. It requires the telemetry to be specific enough that the agent can identify what changed, what the effect was, and what to try next. That is a different bar than dashboards built for human monitoring — and it is one that inference-time profiling data meets well.
The code is at github.com/graphsignal/autodebug.