CUDA Profiling and Monitoring
See the Quick Start guide on how to install Graphsignal.
Graphsignal profiles CUDA workloads via NVIDIA’s CUPTI activity API and observes NVIDIA GPUs via NVML — no code instrumentation required. Attach the profiler to any CUDA program (PyTorch, vLLM, SGLang, TensorRT-LLM, raw CUDA, custom Triton kernels) and the GPU side comes through automatically.
What’s captured
Section titled “What’s captured”- GPU kernel profiles (CUPTI activity records, low overhead): per-kernel timelines — cumulative time, call count, peak concurrency, and theoretical occupancy — for the top kernels by GPU time, plus a higher-level view that aggregates all kernels (including the long tail) by activity type: attention, matrix multiply, communication, KV cache, quantization, normalization, activation, sampling, embedding, memory movement.
- Memory transfers: host↔device, device↔device, and peer-GPU copies (memcpy) and device-memory fills (memset) — cumulative time, call count, and bytes per direction.
- Synchronization:
cudaDeviceSynchronize/cudaStreamSynchronize/cudaEventSynchronizecumulative blocking time. - GPU metrics: utilization, memory (used/free/total/reserved), temperature, power, clock speeds, PCIe and NVLink throughput/utilization/errors, ECC errors, and XID error events.
How it works
Section titled “How it works”The Graphsignal profiler runs as a sidecar process — never co-located with CUDA. It collects CUDA kernel activity via CUPTI from the workload and processes it externally in the sidecar. GPU metrics are collected via NVML by the sidecar.
Integration
Section titled “Integration”Set GRAPHSIGNAL_API_KEY and wrap your launch command with graphsignal-run:
export GRAPHSIGNAL_API_KEY=<my-api-key>graphsignal-run <my-app>CUPTI activity collection and NVML metrics start automatically once a CUDA context exists in the workload. See the Quick Start for details and the Profiler CLI reference for options.
CUDA graph tracing
Section titled “CUDA graph tracing”Workloads that replay CUDA graphs (common in inference engines) can be traced at two granularities via --cuda-graph-trace or graphsignal.watch(cuda_graph_trace=...):
graph(default): each graph launch is reported as one aggregatedcuda.graphevent. Lower CUPTI overhead.node: individual graph node activities (kernels, memory copies, etc.) appear as separatecuda.kernel/cuda.memcpyevents. Higher CUPTI overhead; use when you need node-level visibility inside captured graphs.
graphsignal-run --cuda-graph-trace node python my_cuda_app.pygraphsignal.watch(cuda_graph_trace='node')