Skip to content

PyTorch Profiling

See the Quick Start guide on how to install Graphsignal.

Graphsignal profiles PyTorch workloads at the CUDA kernel level via CUPTI activity records — see CUDA Profiling and Monitoring for the full list of what’s captured (kernel timelines, activity groups, memcpy/memset/sync, and NVML metrics). There is no PyTorch-specific instrumentation: PyTorch is a CUDA application like any other, and the profiler treats it as such.

PyTorch workloads predominantly emit kernels from cuBLAS, cuDNN, Triton, Flash Attention, and NCCL libraries. The per-kernel view surfaces these by raw GPU kernel name (e.g. sm80_xmma_gemm_f16f16_*, flash_attn_fwd_kernel, ncclAllReduceRingLLKernel_*). A higher-level view aggregates them by activity type — attention, matrix multiply, communication, KV cache, normalization, activation, and so on — which is usually the right level for diagnosing where PyTorch inference time is going before drilling into individual kernels.

Set GRAPHSIGNAL_API_KEY and wrap your launch command with graphsignal-run:

Terminal window
export GRAPHSIGNAL_API_KEY=<my-api-key>
graphsignal-run python my_app.py

For applications that bootstrap themselves, call graphsignal.watch() once during startup — before any CUDA work happens — then import and use PyTorch normally:

import graphsignal
graphsignal.watch()
import torch
# ... your PyTorch code ...