CUDA Profiling and Monitoring

See the Quick Start guide on how to install Graphsignal.

Graphsignal profiles CUDA workloads via NVIDIA’s CUPTI activity API and observes NVIDIA GPUs via NVML — no code instrumentation required. Attach the profiler to any CUDA program (PyTorch, vLLM, SGLang, TensorRT-LLM, raw CUDA, custom Triton kernels) and the GPU side comes through automatically.

What’s captured

GPU kernel profiles (CUPTI activity records, low overhead): per-kernel timelines — cumulative time, call count, peak concurrency, and theoretical occupancy — for the top kernels by GPU time, plus a higher-level view that aggregates all kernels (including the long tail) by activity type: attention, matrix multiply, communication, KV cache, quantization, normalization, activation, sampling, embedding, memory movement.
Memory transfers: host↔device, device↔device, and peer-GPU copies (memcpy) and device-memory fills (memset) — cumulative time, call count, and bytes per direction.
Synchronization: cudaDeviceSynchronize / cudaStreamSynchronize / cudaEventSynchronize cumulative blocking time.
GPU metrics: utilization, memory (used/free/total/reserved), temperature, power, clock speeds, PCIe and NVLink throughput/utilization/errors, ECC errors, and XID error events.

How it works

The Graphsignal profiler runs as a sidecar process — never co-located with CUDA. It collects CUDA kernel activity via CUPTI from the workload and processes it externally in the sidecar. GPU metrics are collected via NVML by the sidecar.

Integration

Set GRAPHSIGNAL_API_KEY and wrap your launch command with graphsignal-run:

export GRAPHSIGNAL_API_KEY=<my-api-key>
graphsignal-run <my-app>

CUPTI activity collection and NVML metrics start automatically once a CUDA context exists in the workload. See the Quick Start for details and the Profiler CLI reference for options.

CUDA graph tracing

Workloads that replay CUDA graphs (common in inference engines) can be traced at two granularities via --cuda-graph-trace or graphsignal.watch(cuda_graph_trace=...):

graph (default): each graph launch is reported as one aggregated cuda.graph event. Lower CUPTI overhead.
node: individual graph node activities (kernels, memory copies, etc.) appear as separate cuda.kernel / cuda.memcpy events. Higher CUPTI overhead; use when you need node-level visibility inside captured graphs.

graphsignal-run --cuda-graph-trace node python my_cuda_app.py

graphsignal.watch(cuda_graph_trace='node')