autodebug: Telemetry-Driven Inference Optimization Loop

autodebug is an autonomous loop that deploys an inference service, benchmarks it, fetches profiling telemetry, analyzes the results, and redeploys with an updated configuration. Then repeats.

While the current version focuses on configuration tuning, the same loop can be extended to drive modifications to inference engine code and custom CUDA kernels. The boundary between tuning a config and changing the code is just a matter of what tools the agent has access to.

How it works

The agent - Claude Code - runs this cycle without human involvement between iterations:

Benchmark the running endpoint (sequential and parallel requests, varying prompt length)
Check service logs for errors
Fetch telemetry from Graphsignal for the benchmark window
Identify bottlenecks in TTFT, TBOT, TTLT, and throughput
Write a session log with findings and planned changes
Write a new dstack configuration with the optimizations applied
Deploy and wait for the service to come up
Repeat

Each session log cites specific signals from the telemetry - profiled function names, span counter values, startup attributes - so the reasoning is traceable. Each deployed config is saved in sessions/ as a version trail.

Inspired by autoresearch

The pattern comes from autoresearch: autonomous agent loops that iterate continuously toward a goal rather than answering once and stopping. autoresearch applies this to knowledge work. autodebug applies it to infrastructure configuration.

The value compounds across iterations. The first pass finds the obvious issues - features not enabled, defaults that don't fit the workload. Later passes find the non-obvious ones: bottlenecks that only appear under specific load, parameter interactions, optimizations that depend on earlier changes. A human doing this manually tends to stop after one or two rounds. The agent doesn't.

Debug toolset

Graphsignal instruments the inference server and collects profiling data, traces, metrics, and errors at inference-time granularity - per-function profiles for SGLang, vLLM, PyTorch, and CUDA. After each benchmark, the agent fetches telemetry for the exact benchmark window and reads the raw signals directly: Scheduler.event_loop_overlap call counts, sglang.inter_token_latency_seconds, sglang.startup.* attributes. Integration is a wrapper around the server startup command:

graphsignal-run python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000

graphsignal-debug is a CLI that fetches debug context from Graphsignal's API for a time range:

graphsignal-debug fetch --start 2026-03-24T20:40:00Z --end 2026-03-24T20:42:00Z

The output - profiles, spans, metrics, errors - is what the agent reads each iteration. A skill file tells Claude Code how to call the CLI and how to map the returned signals to SGLang and vLLM internals.

dstack provisions the GPU instance and deploys the service. Each iteration writes a new sessions/dstack-<ISO>.yml and applies it:

dstack apply -f sessions/dstack-<ISO>.yml -y -d

The -d flag submits without blocking. dstack handles instance provisioning, container pull, model cache volume, and gateway routing.

The loop in practice

The agent runs indefinitely. The two iterations below illustrate how the loop works - the findings and config changes it produces at each step - not the full run. The setup is Qwen1.5-0.5B-Chat on SGLang, deployed on an H100 80GB via dstack.

Iteration 1 - baseline

Configuration: lmsysorg/sglang:latest, Qwen1.5-0.5B-Chat, default SGLang flags

Results (server-side, 31 prompt / 150 completion tokens):

TTFT: 164ms
TBOT: 1.51ms/token
TTLT: 390ms
Decode throughput: 664 tok/s
Errors: 0

Telemetry findings:

sglang.startup.* attributes confirmed: enable_mixed_chunk=False, num_continuous_decode_steps=1, attention_backend=fa3.

Scheduler.run_batch profile: 405 calls / 435ms during the active benchmark window → 1.07ms/step average scheduler overhead. ModelRunner.forward averaged 0.66ms/call. Scheduler overhead was 62% of total per-token latency.

sglang.latency.time_to_first_token span counter: 164,006,022 ns. sglang.latency.time_in_model_prefill: 161,486,041 ns - prefill time was almost entirely model forward; queue overhead ~2ms.

sglang.inter_token_latency_seconds: 1.50ms/token (sequential), rising to 2.08ms/token under 5 concurrent requests.

sglang.cache_hit_rate: 0.0 - cold start, no prefix reuse yet.

GPU compute kernels: attention_prefill averaged 7.5μs/call (94,552 calls), gemm_cublas averaged 1.6μs/call (40,646 calls). At 664 tok/s the H100 was running at ~0.07% of its BF16 peak - the model is far too small to saturate the GPU. The bottleneck is CPU-side scheduling overhead, not GPU compute.

Next config: --num-continuous-decode-steps 8 to run 8 decode steps per scheduler round-trip (amortizing 1ms overhead over 8 tokens), --enable-mixed-chunk to overlap chunked prefill with ongoing decode, --schedule-policy lpm for longest-prefix-match scheduling to maximize radix cache reuse.

Graphsignal debug context fetched by the agent after iteration 1

Iteration 2 - continuous decode steps + mixed chunk + LPM scheduling

dstack apply deploying iteration 2

Configuration: adds --num-continuous-decode-steps 8 --enable-mixed-chunk --schedule-policy lpm

Results (server-side, sequential requests):

TTFT: ~5ms on requests with 30/31 tokens cached; 23ms on partial cache hit
TBOT: 1.42ms/token
Decode throughput: 729 tok/s
Errors: 0

Telemetry findings:

sglang.gen_throughput: 729.5 tok/s - up from 650–675 tok/s at baseline (+9% decode throughput).

sglang.inter_token_latency_seconds: 1.42ms/token - down from 1.50ms at baseline (−5%).

Radix cache now active. Server logs: Prefill batch, #new-seq: 1, #new-token: 1, #cached-token: 30 on the second and later requests for the same prompt - 30/31 tokens served from the radix cache with schedule-policy lpm. Subsequent TTFT effectively collapsed to a single-token prefill (~5ms server-side).

Span data confirmed: warmup span TTFT=116ms (20 tokens, 0 cached); next span TTFT=23ms (20 tokens, 14 cached). Prefix reuse compounds across requests.

The per-token cost breakdown did not change structurally. Model forward remained ~0.66ms; scheduler overhead per effective token dropped due to 8 continuous steps, but the absolute scheduler cost per pass stayed similar. GPU utilization remained near 0.07% - the model is too small to stress the H100.

Next config: --speculative-algorithm ngram --speculative-num-draft-tokens 3 --speculative-num-steps 3 to generate multiple tokens per forward pass without a draft model, --num-continuous-decode-steps 16 to further reduce per-token scheduler cost.

Observability as a feedback loop

Most observability tooling is built assuming a human reads the output. Graphsignal flips that: the telemetry is a structured feedback context consumed by an agent, which uses it to generate the next configuration change.

The loop does not require the agent to know the right answer up front. It requires the telemetry to be specific enough that the agent can identify what changed, what the effect was, and what to try next. That is a different bar than dashboards built for human monitoring - and it is one that inference-time profiling data meets well.

The code is at github.com/graphsignal/autodebug.

autodebug: Telemetry-Driven Inference Optimization Loop By Dmitri Melikyan | Mar 24, 2026