Benchmarking, Profiling and Monitoring PyTorch Lightning With Graphsignal
By Dmitri Melikyan | | 3 min read

Learn how to benchmark, profile and monitor PyTorch Lightning training using Graphsignal.

Why Benchmark and Analyze Training Speed?

There are many questions that will remain unanswered if there is no visibility into ML workload performance:

  • How much speed gain has your last change introduced?
  • Is your training compute, memory or overhead intensive?
  • Is model training running on GPU?
  • Are all devices utilized?
  • How much free memory is left on the device?
  • How is memory growing during the run, is there a memory leak?

These are just a few reasons that can make training many times slower, which directly translates to compute costs and modeling time.

To answer these questions and provide visibility when implementing optimizations, a machine learning profiler is instrumental.

Using a Machine Learning Profiler

Similar to traditional code profilers, a machine learning profiler provides execution statistics, however, the focus is on steps, operations and compute kernels instead of plain method calls. Additionally, ML profilers provide GPU utilization information relevant in machine learning context.

Graphsignal Profiler enables automatic ML profiling in any environment, including notebooks, scripts, training pipelines, periodic batch jobs and model serving, without installing additional software. It also allows teams to share and collaborate online.

Adding Graphsignal Profiler to PyTorch Lightning Training

Here is a minimal example of PyTorch Lightning training. All you need to do is to install and import graphsignal module, configure it and add the profiler callback. The configure method requires api_key (or an environment variable), which can be obtained by signing up for a free account. The account is necessary for accessing the dashboards. Refer to Quick Start Guide for more information.

import logging
import os
import torch
from pytorch_lightning import LightningModule, Trainer
from torch import nn
from torch.nn import functional as F
from import DataLoader, random_split
from torchmetrics import Accuracy
from torchvision import transforms
from torchvision.datasets import MNIST
# Graphsignal: import
import graphsignal
from graphsignal.profilers.pytorch_lightning import GraphsignalCallback

# Graphsignal: configure
graphsignal.configure(api_key='my_api_key', workload_name='PyTorch Lightning MNIST')

PATH_DATASETS = os.environ.get("PATH_DATASETS", ".")
AVAIL_GPUS = min(1, torch.cuda.device_count())
BATCH_SIZE = 256 if AVAIL_GPUS else 64

class MNISTModel(LightningModule):
    def __init__(self):
        self.l1 = torch.nn.Linear(28 * 28, 10)
        self.batch_size = BATCH_SIZE

    def optimizer_zero_grad(self, epoch, batch_idx, optimizer, optimizer_idx):

    def forward(self, x):
        return torch.relu(self.l1(x.view(x.size(0), -1)))

    def training_step(self, batch, batch_nb):
        x, y = batch
        loss = F.cross_entropy(self(x), y)
        return loss

    def train_dataloader(self):
        train_ds = MNIST(PATH_DATASETS, train=True, download=True, transform=transforms.ToTensor())
        train_loader = DataLoader(train_ds, batch_size=self.batch_size)
        return train_loader

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

mnist_model = MNISTModel()

# Graphsignal: add profiler callback
trainer = Trainer(


Benchmarking Runs

Let's implement a few optimizations described in this very helpful article by PyTorch Lightning creators: Speed Up Model Training.

Our baseline training run is on CPU. We will first move the training to GPU, by setting gpus=2, with ddp, a distributed data parallel strategy. As a second step, after noticing that we still have plenty of GPU memory, we will increase the batch size.

All runs are visualized in real time in the cloud dashboard. The run dashboard provides basic statistics about each run. This dashboard can be used to compare multiple runs and benchmark how changes influence speed and compute. The Sample rate column shows our improvements and Effective parameters column shows parameters that had impact on speed improvement.

You can see that our optimizations were successful. Speed improvement tells us how much faster our training will be done, which directly translates to compute costs, and indirectly, to the ability to iterate faster and train better models.

Profile Timeline

Much more details about a run or a phase of the run are available in the profile.

Analyzing Profiles

A profile covers various aspects of ML execution. These include:

  • Step statistics.
  • Time distribution summary.
  • Run parameters and metrics, user-provided or recorded.
  • Detailed step trace information.
  • Operation statistics for one training step.
  • Kernel statistics for one training step.
  • Current state of host and devices.

Here are some of the profile dashboard segments for our example:

Step statistics and time breakdown


Step trace


ML operation and kernel statistics


Compute utilization


Monitoring a run

The monitor dashboard shows how long runs perform over time. It is helpful in detecting memory leaks and other issues manifesting themselves over a period of time.


Getting Started

Using Graphsignal is easy, see Quick Start Guide for instructions or request a quick demo.