Skip to content

CUDA Profiling and Monitoring

See the Quick Start guide on how to install Graphsignal.

Graphsignal profiles CUDA workloads via NVIDIA’s CUPTI activity API and observes NVIDIA GPUs via NVML — no code instrumentation required. Attach the profiler to any CUDA program (PyTorch, vLLM, SGLang, TensorRT-LLM, raw CUDA, custom Triton kernels) and the GPU side comes through automatically.

  • GPU kernel profiles (CUPTI activity records, low overhead): per-kernel timelines — cumulative time, call count, peak concurrency, and theoretical occupancy — for the top kernels by GPU time, plus a higher-level view that aggregates all kernels (including the long tail) by activity type: attention, matrix multiply, communication, KV cache, quantization, normalization, activation, sampling, embedding, memory movement.
  • Memory transfers: host↔device, device↔device, and peer-GPU copies (memcpy) and device-memory fills (memset) — cumulative time, call count, and bytes per direction.
  • Synchronization: cudaDeviceSynchronize / cudaStreamSynchronize / cudaEventSynchronize cumulative blocking time.
  • GPU metrics: utilization, memory (used/free/total/reserved), temperature, power, clock speeds, PCIe and NVLink throughput/utilization/errors, ECC errors, and XID error events.

The Graphsignal profiler runs as a sidecar process — never co-located with CUDA. The CUPTI injection library (libgscuptiprof.so) is loaded into the workload process by the CUDA driver at first CUDA call, records activity into shared memory, and the sidecar drains and processes it externally. NVML is queried directly by the sidecar.

Set GRAPHSIGNAL_API_KEY and wrap your launch command with graphsignal-run:

Terminal window
export GRAPHSIGNAL_API_KEY=<my-api-key>
graphsignal-run <my-app>

CUPTI activity collection and NVML metrics start automatically once a CUDA context exists in the workload. See the Quick Start for details and the Profiler CLI reference for options.