PyTorch Profiling

See the Quick Start guide on how to install Graphsignal.

Graphsignal profiles PyTorch workloads at the CUDA kernel level via CUPTI activity records — see CUDA Profiling and Monitoring for the full list of what’s captured (kernel timelines, activity groups, memcpy/memset/sync, and NVML metrics). There is no PyTorch-specific instrumentation: PyTorch is a CUDA application like any other, and the profiler treats it as such.

What you’ll see

PyTorch workloads predominantly emit kernels from cuBLAS, cuDNN, Triton, Flash Attention, and NCCL libraries. The per-kernel view surfaces these by raw GPU kernel name (e.g. sm80_xmma_gemm_f16f16_*, flash_attn_fwd_kernel, ncclAllReduceRingLLKernel_*). A higher-level view aggregates them by activity type — attention, matrix multiply, communication, KV cache, normalization, activation, and so on — which is usually the right level for diagnosing where PyTorch inference time is going before drilling into individual kernels.

Integration

Set GRAPHSIGNAL_API_KEY and wrap your launch command with graphsignal-run:

export GRAPHSIGNAL_API_KEY=<my-api-key>
graphsignal-run python my_app.py

For applications that bootstrap themselves, call graphsignal.watch() once during startup — before any CUDA work happens — then import and use PyTorch normally:

import graphsignal

graphsignal.watch()

import torch
# ... your PyTorch code ...