Speed Up Machine Learning Using Graphsignal Profiler
By Dmitri Melikyan | | 3 min read

Оptimize ML training and inference to fully utilize available resources, reduce training time and inference latency.

The Invisible Cost of Training and Inference

When training an ML model, data scientists focus on the model performance metrics, such as accuracy, and often ignore how computationally optimal training or inference is. The time and cost are multiplied for periodically or regularly running training jobs. The same applies to inference, where low latency and high throughput may be critical, for example, when significant number of predictions or large prediction batches are run. The cost of high latency, whether a hardware/cloud costs or bad user experience, is often ignored.

Some basic techniques are usually applied to speed up ML workloads:

  • Single-host, multi-GPU synchronous training.
  • Multi-worker distributed synchronous training.
  • Multi-Instance GPU (MIG) to increase inference throughput.

These will typically also increase resource use and hardware/cloud costs if implemented without optimizing and tuning the workloads. In turn, optimization work is based on understanding where the time is actually spent or what causes underutilization and waiting time, more generally, whether a workload is compute, memory or overhead bound. This article by Horace He presents an essential mental model for GPU performance optimization in a machine learning context. A machine learning profiler is instrumental in identifying operating modes and providing execution statistics.

Adding Visibility Using Machine Learning Profiler

Just like traditional profilers, a machine learning profiler provides execution statistics, however, the focus is on ML operations and compute kernels instead of plain method calls. Additionally, ML profilers provide GPU utilization information relevant in machine learning context.

TensorFlow and PyTorch provide built-in ML profilers, which utilize NVIDIA® CUDA® profiling interface (CUPTI) under the hood for GPU profiling. One way to use those profilers is via locally installed TensorBoard or just logging profiles. In turn, Graphsignal Profiler uses the built-in profilers as well as other tools to enable automatic profiling in any environment, including notebooks, training pipelines, periodic batch jobs, model serving and so on, without installing additional software. It also allows teams to share and collaborate online.

Let's consider a simple training job with Keras. The training successfully completes and the model is saved. Now let's look at a profile of one of the training batches:

Profile before

We can see in the Performance Summary that most of the time is spent on CPU and not GPU as we would like to. Indeed, the slowest operations are related to dataset loading and copying. Keras documentation recommends to add caching and prefetching to GPU memory for both training and validation datasets. After implementing those changes, the profiles look different:

Profile after

Now we can see that single batch time has dropped about 2x and dataset operations are no longer bottlenecks. Also, GPU utilization is increased (Device operations in Performance Summary).

In-Depth Visibility

An ML profile includes information about ML execution from different perspectives in order to cover as many optimization use cases as possible.

  • Performance summary: percentages of times spent in different states, e.g. between CPU and GPU.
  • Run environment: ML stack and version information, e.g. ML framework versions, devices and capabilities.
  • Resource usage: CPU, GPU, memory usage and more.
  • Operations: List of ML operation statistics with times spent on CPU or GPU for each operation.
  • Kernels: List of compute kernel statistics with execution times for each kernel.

Continuous Visibility

Because modeling as well as model serving are continuous processes, being able to always ensure workloads are fast and efficient, e.g. for a new model or hardware, is important. And that is one of the reasons we've developed Graphsignal.

Trying it out is easy, see Quick Start Guide for instructions.