Benchmarking and Profiling Hugging Face Training With Graphsignal
By Dmitri Melikyan | | 3 min read

Learn how to monitor, benchmark and profile Hugging Face training using Graphsignal.

Why Benchmark and Analyze Training Speed?

There are many questions that will remain unanswered if there is no visibility into ML workload performance:

  • How much speed gain has your last change introduced?
  • Is your training compute, memory or overhead intensive?
  • Is model training running on GPU?
  • Are all devices utilized?
  • How much free memory is left on the device?
  • How is memory growing during the run, is there a memory leak?

These are just a few reasons that can make training many times slower, which directly translates to compute costs and modeling time.

To answer these questions and provide visibility when implementing optimizations, a machine learning profiler is instrumental.

Using a Machine Learning Profiler

Similar to traditional code profilers, a machine learning profiler provides execution statistics, however, the focus is on steps, operations and compute kernels instead of plain method calls. Additionally, ML profilers provide GPU utilization information relevant in machine learning context.

Graphsignal Profiler enables automatic ML profiling in any environment, including notebooks, scripts, training pipelines, periodic batch jobs and model serving, without installing additional software. It also allows teams to share and collaborate online.

Adding Graphsignal Profiler to Hugging Face Training

Here is a minimal example of Hugging Face training. Basically, you just need to install and import graphsignal module, configure it and add the profiler callback. The configure method requires api_key, which can be obtained by signing up for a free account. The account is necessary for accessing the dashboards. Refer to Quick Start Guide for more information.

from datasets import load_dataset
from transformers import AutoModelForSequenceClassification
from transformers import TrainingArguments
from transformers import Trainer
from transformers import AutoTokenizer

# Import graphsignal and profiler callback for PyTorch.
import graphsignal
from graphsignal.profilers.huggingface import GraphsignalPTCallback

# Configure the profiler.
graphsignal.configure(api_key='api_key_here', workload_name='IMDB Bert Training')

raw_datasets = load_dataset("imdb")

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets =, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))
full_train_dataset = tokenized_datasets["train"]
full_eval_dataset = tokenized_datasets["test"]

model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-cased", num_labels=2)

training_args = TrainingArguments("test_trainer")

trainer = Trainer(

# Add profiler callback.


Benchmarking Runs

Let's implement a few optimizations from this very helpful article by Hugging Face: Performance and Scalability: How To Fit a Bigger Model and Train It Faster.

After running the above code few times, while implementing optimizations to reduce GPU memory, we can see and compare our runs in the cloud dashboard. The run timeline provides basic statistics about each run. These include step time and rate, GPU utilization and memory. This dashboard can be used to compare multiple runs and benchmark how changes influence speed and compute. The GPU memory column shows our progress and Parameter difference column shows what has been changed since last run.

You can see that our optimizations were successful. However, we've lost some training speed due to more data copying to and between GPUs.

Profile Timeline

Much more details about a run or a phase of the run are available in the profile.

Analyzing Profiles

A profile covers various aspects of ML execution. These include:

  • Time distribution summary.
  • Run parameters, user-provided or recorded.
  • Step summary.
  • Current state of host and devices.
  • Operation statistics for one training step.
  • Kernel statistics for one training step.

Here are some of the profile dashboard segments for our example:

Performance summary and step statistics


Compute utilization


ML operation statistics


Kernel statistics


Tracking Metrics

The metrics dashboard shows how long runs perform over time.


Getting Started

Using Graphsignal is easy, see Quick Start Guide for instructions or request a quick demo.