autodebug: Telemetry-Driven Inference Optimization Loop
By Dmitri Melikyan |

An autonomous agent that deploys inference services, collects telemetry, and continuously redeploys with better configurations — indefinitely.

autodebug is an autonomous loop that deploys an inference service, benchmarks it, fetches profiling telemetry, analyzes the results, and redeploys with an updated configuration. Then repeats.

How it works

The agent — Claude Code — runs this cycle without human involvement between iterations:

  1. Benchmark the running endpoint (sequential and parallel requests, varying prompt length)
  2. Check service logs for errors
  3. Fetch telemetry from Graphsignal for the benchmark window
  4. Identify bottlenecks in TTFT, TBOT, TTLT, and throughput
  5. Write a session log with findings and planned changes
  6. Write a new dstack configuration with the optimizations applied
  7. Deploy and wait for the service to come up
  8. Repeat

Each session log cites specific signals from the telemetry — profiled function names, span counter values, startup attributes — so the reasoning is traceable. Each deployed config is saved in sessions/ as a version trail.

Inspired by autoresearch

The pattern comes from autoresearch: autonomous agent loops that iterate continuously toward a goal rather than answering once and stopping. autoresearch applies this to knowledge work. autodebug applies it to infrastructure configuration.

The value compounds across iterations. The first pass finds the obvious issues — features not enabled, defaults that don't fit the workload. Later passes find the non-obvious ones: bottlenecks that only appear under specific load, parameter interactions, optimizations that depend on earlier changes. A human doing this manually tends to stop after one or two rounds. The agent doesn't.

Stack

Graphsignal instruments the inference server and collects profiling data, traces, metrics, and errors at inference-time granularity — per-function profiles for SGLang, vLLM, PyTorch, and CUDA. After each benchmark, the agent fetches telemetry for the exact benchmark window and reads the raw signals directly: Scheduler.event_loop_overlap call counts, sglang.inter_token_latency_seconds, sglang.startup.* attributes. Integration is a wrapper around the server startup command:

graphsignal-run python3 -m sglang.launch_server --model-path $MODEL_ID --port 8000

graphsignal-debug is a CLI that fetches debug context from Graphsignal's API for a time range:

graphsignal-debug fetch --start 2026-03-24T20:40:00Z --end 2026-03-24T20:42:00Z

The output — profiles, spans, metrics, errors — is what the agent reads each iteration. A skill file tells Claude Code how to call the CLI and how to map the returned signals to SGLang and vLLM internals.

dstack provisions the GPU instance and deploys the service. Each iteration writes a new sessions/dstack-<ISO>.yml and applies it:

dstack apply -f sessions/dstack-<ISO>.yml -y -d

The -d flag submits without blocking. dstack handles instance provisioning, container pull, model cache volume, and gateway routing.

Two iterations on Qwen1.5-0.5B-Chat / H100

Iteration 1 — baseline

Configuration: lmsysorg/sglang:latest, default flags, mem_fraction_static=0.835

Results:

  • TTFT: 1.49s (fresh), 0.65s (prefix cache hit)
  • TTLT: 1.61s for ~500 output tokens
  • Decode throughput: 500–555 tok/s at bs=1, 2,802 tok/s at bs=7
  • Errors: 0

Telemetry findings:

sglang.startup.* attributes: enable_mixed_chunk=False, num_continuous_decode_steps=1.

Scheduler.event_loop_overlap profile: ~26,000 passes/s. With num_continuous_decode_steps=1, the scheduler re-entered the event loop after every decode step — ~26,000 times per second of decode. Each pass added overhead on top of GPU time.

sglang.inter_token_latency_seconds: 1.8–2.7ms/token.

sglang.e2e_request_latency_seconds: 2.056s for a fresh request, 0.534s with prefix cache hit.

KV cache: 705,734 tokens allocated (64.6 GB) for a model with 0.91 GB of weights.

Next config: --quantization fp8 --enable-mixed-chunk --num-continuous-decode-steps 5 --mem-fraction-static 0.5

Graphsignal debug context fetched by the agent after iteration 1


Iteration 2 — FP8 + mixed chunk + reduced KV allocation

dstack apply deploying iteration 2

Configuration: adds --quantization fp8 --enable-mixed-chunk --num-continuous-decode-steps 5 --mem-fraction-static 0.5

Results:

  • TTFT: 1.16s (fresh), 0.64s (cached) — 22% improvement on fresh requests
  • TTLT: 1.66s for ~500 output tokens
  • Decode throughput: 488–539 tok/s (bs=1)
  • Parallel TTFT (8 concurrent): 1.53s vs 1.62s
  • Errors: 0

Telemetry findings:

Scheduler.event_loop_overlap call count: ~93/s, down from ~26,000/s — 286× reduction. num_continuous_decode_steps=5 runs 5 decode steps per scheduler pass instead of 1.

CUDA graph capture: 1.9s vs 21.6s — FP8 forward passes are faster, so each batch size is captured faster.

Despite FP8 being active (sglang.startup.quantization=fp8), bs=1 decode throughput did not change. With scheduler overhead eliminated, the remaining constraint is KV cache memory bandwidth — the KV cache is still BF16 (kv_cache_dtype=bfloat16).

KV cache reduced to 418,247 tokens; available GPU memory increased from 11.9 GB to 38.3 GB.

Next config: --kv-cache-dtype fp8_e5m2 to halve KV bandwidth, --speculative-algorithm NGRAM --speculative-num-steps 3 for token drafting without a separate draft model.


Observability as a feedback loop

Most observability tooling is built assuming a human reads the output. autodebug flips that: the telemetry is a structured feedback signal consumed by an agent, which uses it to generate the next configuration change.

The loop does not require the agent to know the right answer up front. It requires the telemetry to be specific enough that the agent can identify what changed, what the effect was, and what to try next. That is a different bar than dashboards built for human monitoring — and it is one that inference-time profiling data meets well.

The code is at github.com/graphsignal/autodebug.