An autonomous agent that deploys inference services, collects telemetry, and continuously redeploys with better configurations - indefinitely.
autodebug is an autonomous loop that deploys an inference service, benchmarks it, fetches profiling telemetry, analyzes the results, and redeploys with an updated configuration. Then repeats.
While the current version focuses on configuration tuning, the same loop can be extended to drive modifications to inference engine code and custom CUDA kernels. The boundary between tuning a config and changing the code is just a matter of what tools the agent has access to.
The agent - Claude Code - runs this cycle without human involvement between iterations:
Each session log cites specific signals from the telemetry - profiled function names, span counter values, startup attributes - so the reasoning is traceable. Each deployed config is saved in sessions/ as a version trail.
The pattern comes from autoresearch: autonomous agent loops that iterate continuously toward a goal rather than answering once and stopping. autoresearch applies this to knowledge work. autodebug applies it to infrastructure configuration.
The value compounds across iterations. The first pass finds the obvious issues - features not enabled, defaults that don't fit the workload. Later passes find the non-obvious ones: bottlenecks that only appear under specific load, parameter interactions, optimizations that depend on earlier changes. A human doing this manually tends to stop after one or two rounds. The agent doesn't.
Graphsignal instruments the inference server and collects profiling data, traces, metrics, and errors at inference-time granularity - per-function profiles for SGLang, vLLM, PyTorch, and CUDA. After each benchmark, the agent fetches telemetry for the exact benchmark window and reads the raw signals directly: Scheduler.event_loop_overlap call counts, sglang.inter_token_latency_seconds, sglang.startup.* attributes. Integration is a wrapper around the server startup command:
graphsignal-run python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000
graphsignal-debug is a CLI that fetches debug context from Graphsignal's API for a time range:
graphsignal-debug fetch --start 2026-03-24T20:40:00Z --end 2026-03-24T20:42:00Z
The output - profiles, spans, metrics, errors - is what the agent reads each iteration. A skill file tells Claude Code how to call the CLI and how to map the returned signals to SGLang and vLLM internals.
dstack provisions the GPU instance and deploys the service. Each iteration writes a new sessions/dstack-<ISO>.yml and applies it:
dstack apply -f sessions/dstack-<ISO>.yml -y -d
The -d flag submits without blocking. dstack handles instance provisioning, container pull, model cache volume, and gateway routing.
The agent runs indefinitely. The two iterations below illustrate how the loop works - the findings and config changes it produces at each step - not the full run. The setup is Qwen1.5-0.5B-Chat on SGLang, deployed on an H100 80GB via dstack.
Configuration: lmsysorg/sglang:latest, Qwen1.5-0.5B-Chat, default SGLang flags
Results (server-side, 31 prompt / 150 completion tokens):
Telemetry findings:
sglang.startup.* attributes confirmed: enable_mixed_chunk=False, num_continuous_decode_steps=1, attention_backend=fa3.
Scheduler.run_batch profile: 405 calls / 435ms during the active benchmark window → 1.07ms/step average scheduler overhead. ModelRunner.forward averaged 0.66ms/call. Scheduler overhead was 62% of total per-token latency.
sglang.latency.time_to_first_token span counter: 164,006,022 ns. sglang.latency.time_in_model_prefill: 161,486,041 ns - prefill time was almost entirely model forward; queue overhead ~2ms.
sglang.inter_token_latency_seconds: 1.50ms/token (sequential), rising to 2.08ms/token under 5 concurrent requests.
sglang.cache_hit_rate: 0.0 - cold start, no prefix reuse yet.
GPU compute kernels: attention_prefill averaged 7.5μs/call (94,552 calls), gemm_cublas averaged 1.6μs/call (40,646 calls). At 664 tok/s the H100 was running at ~0.07% of its BF16 peak - the model is far too small to saturate the GPU. The bottleneck is CPU-side scheduling overhead, not GPU compute.
Next config: --num-continuous-decode-steps 8 to run 8 decode steps per scheduler round-trip (amortizing 1ms overhead over 8 tokens), --enable-mixed-chunk to overlap chunked prefill with ongoing decode, --schedule-policy lpm for longest-prefix-match scheduling to maximize radix cache reuse.


Configuration: adds --num-continuous-decode-steps 8 --enable-mixed-chunk --schedule-policy lpm
Results (server-side, sequential requests):
Telemetry findings:
sglang.gen_throughput: 729.5 tok/s - up from 650–675 tok/s at baseline (+9% decode throughput).
sglang.inter_token_latency_seconds: 1.42ms/token - down from 1.50ms at baseline (−5%).
Radix cache now active. Server logs: Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 30 on the second and later requests for the same prompt - 30/31 tokens served from the radix cache with schedule-policy lpm. Subsequent TTFT effectively collapsed to a single-token prefill (~5ms server-side).
Span data confirmed: warmup span TTFT=116ms (20 tokens, 0 cached); next span TTFT=23ms (20 tokens, 14 cached). Prefix reuse compounds across requests.
The per-token cost breakdown did not change structurally. Model forward remained ~0.66ms; scheduler overhead per effective token dropped due to 8 continuous steps, but the absolute scheduler cost per pass stayed similar. GPU utilization remained near 0.07% - the model is too small to stress the H100.
Next config: --speculative-algorithm ngram --speculative-num-draft-tokens 3 --speculative-num-steps 3 to generate multiple tokens per forward pass without a draft model, --num-continuous-decode-steps 16 to further reduce per-token scheduler cost.
Most observability tooling is built assuming a human reads the output. Graphsignal flips that: the telemetry is a structured feedback context consumed by an agent, which uses it to generate the next configuration change.
The loop does not require the agent to know the right answer up front. It requires the telemetry to be specific enough that the agent can identify what changed, what the effect was, and what to try next. That is a different bar than dashboards built for human monitoring - and it is one that inference-time profiling data meets well.
The code is at github.com/graphsignal/autodebug.