PyTorch Profiling
See the Quick Start guide on how to install Graphsignal.
Graphsignal profiles PyTorch workloads at the CUDA kernel level via CUPTI activity records — see CUDA Profiling and Monitoring for the full list of what’s captured (kernel timelines, activity groups, memcpy/memset/sync, and NVML metrics). There is no PyTorch-specific instrumentation: PyTorch is a CUDA application like any other, and the profiler treats it as such.
What you’ll see
Section titled “What you’ll see”PyTorch workloads predominantly emit kernels from cuBLAS, cuDNN, Triton, Flash Attention, and NCCL libraries. The per-kernel view surfaces these by raw GPU kernel name (e.g. sm80_xmma_gemm_f16f16_*, flash_attn_fwd_kernel, ncclAllReduceRingLLKernel_*). A higher-level view aggregates them by activity type — attention, matrix multiply, communication, KV cache, normalization, activation, and so on — which is usually the right level for diagnosing where PyTorch inference time is going before drilling into individual kernels.
Integration
Section titled “Integration”Set GRAPHSIGNAL_API_KEY and wrap your launch command with graphsignal-run:
export GRAPHSIGNAL_API_KEY=<my-api-key>graphsignal-run python my_app.pyFor applications that bootstrap themselves, call graphsignal.watch() once during startup — before any CUDA work happens — then import and use PyTorch normally:
import graphsignal
graphsignal.watch()
import torch# ... your PyTorch code ...