AI Debugging and Optimization For Production Inference
By Dmitri Melikyan |

A practical workflow to debug production inference issues and optimize performance using Claude Code and Graphsignal debug context.

Production inference systems fail in ways that are hard to debug with generic logs and dashboards. The fastest path to root cause is to combine inference observability with an AI coding agent that can investigate the right time window, summarize profiles/errors/traces, and suggest concrete fixes.

Overview

This is the new way to debug, optimize, and troubleshoot production inference:

  • Use Graphsignal to continuously collect inference-time telemetry (profiles, traces, errors, and metrics).
  • Use your AI coding agent to fetch debug context for the exact incident window.
  • Ask focused production questions in natural language, then act on findings quickly.
  • Repeat as an optimization loop: detect regression, identify bottleneck, ship fix, and validate.

Claude Code

Install

Install the Graphsignal debug skill for Claude Code:

git clone https://github.com/graphsignal/graphsignal-debug ~/.claude/skills/graphsignal-debug

Install and authenticate the CLI:

pip install graphsignal-debug
graphsignal-debug login

This enables Claude Code to run:

graphsignal-debug fetch --start <ISO_UTC> --end <ISO_UTC> --tags "env:prod"

Debugging production issues

When there is an incident, ask a direct question tied to a real time window.

Example prompt:

What was the cause of the spike yesterday? Fetch Graphsignal debug context for 14:00-16:00 UTC in production and summarize root cause with evidence.

A typical investigation flow:

  1. Fetch debug context for the incident window.
  2. Isolate the biggest latency/profile deltas against baseline.
  3. Correlate with traces and errors by service/model tags.
  4. Identify the most likely root cause and impacted requests.
  5. Recommend the smallest safe mitigation and a follow-up fix.

Claude Code root-cause analysis output

This approach is faster than manually pivoting across multiple dashboards because the agent can interpret profiles, errors, and traces together in one pass.

Optimization of inference stack

The same workflow applies to continuous optimization, not just incident response.

Use prompts like:

  • "Fetch Graphsignal context for the last 24 hours and identify top throughput bottlenecks."
  • "Compare p95 latency and TTFT before and after yesterday's deploy."
  • "Find model/tag combinations with worst tokens-per-second efficiency."

From those findings, optimize the stack in priority order:

  • Prompt/completion efficiency: reduce unnecessary prompt size and completion length.
  • Model choice: route low-complexity traffic to smaller/faster models.
  • Runtime tuning: tune batching, concurrency, and scheduler parameters.
  • GPU utilization: eliminate idle bubbles and memory-pressure hotspots.
  • Regression control: validate every change against production telemetry.

Production inference optimization is not a one-time benchmark exercise. It is an always-on loop powered by observability plus AI-assisted debugging.

See AI Debugging and Quick Start for setup details.