Learn how to monitor, benchmark and profile Hugging Face training using Graphsignal.
There are many questions that will remain unanswered if there is no visibility into ML workload performance:
These are just a few reasons that can make training many times slower, which directly translates to compute costs and modeling time.
To answer these questions and provide visibility when implementing optimizations, a machine learning profiler is instrumental.
Similar to traditional code profilers, a machine learning profiler provides execution statistics, however, the focus is on steps, operations and compute kernels instead of plain method calls. Additionally, ML profilers provide GPU utilization information relevant in machine learning context.
Graphsignal Profiler enables automatic ML profiling in any environment, including notebooks, scripts, training pipelines, periodic batch jobs and model serving, without installing additional software. It also allows teams to share and collaborate online.
Here is a minimal example of Hugging Face training. Basically, you just need to install and import
graphsignal module, configure it and add the profiler callback. The
configure method requires
api_key, which can be obtained by signing up for a free account. The account is necessary for accessing the dashboards. Refer to Quick Start Guide for more information.
from datasets import load_dataset from transformers import AutoModelForSequenceClassification from transformers import TrainingArguments from transformers import Trainer from transformers import AutoTokenizer # Import graphsignal and profiler callback for PyTorch. import graphsignal from graphsignal.profilers.huggingface import GraphsignalPTCallback # Configure the profiler. graphsignal.configure(api_key='api_key_here', workload_name='IMDB Bert Training') raw_datasets = load_dataset("imdb") tokenizer = AutoTokenizer.from_pretrained("bert-base-cased") def tokenize_function(examples): return tokenizer(examples["text"], padding="max_length", truncation=True) tokenized_datasets = raw_datasets.map(tokenize_function, batched=True) small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100)) small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100)) full_train_dataset = tokenized_datasets["train"] full_eval_dataset = tokenized_datasets["test"] model = AutoModelForSequenceClassification.from_pretrained( "bert-base-cased", num_labels=2) training_args = TrainingArguments("test_trainer") trainer = Trainer( model=model, args=training_args, train_dataset=small_train_dataset, eval_dataset=small_eval_dataset) # Add profiler callback. trainer.add_callback(GraphsignalPTCallback()) trainer.train()
Let's implement a few optimizations from this very helpful article by Hugging Face: Performance and Scalability: How To Fit a Bigger Model and Train It Faster.
After running the above code few times, while implementing optimizations to reduce GPU memory, we can see and compare our runs in the cloud dashboard. The run timeline provides basic statistics about each run. These include step time and rate, GPU utilization and memory. This dashboard can be used to compare multiple runs and benchmark how changes influence speed and compute. The GPU memory column shows our progress and Parameter difference column shows what has been changed since last run.
You can see that our optimizations were successful. However, we've lost some training speed due to more data copying to and between GPUs.
Much more details about a run or a phase of the run are available in the profile.
A profile covers various aspects of ML execution. These include:
Here are some of the profile dashboard segments for our example:
The metrics dashboard shows how long runs perform over time.