Speed Up Machine Learning Using Graphsignal Profiler
By Dmitri Melikyan | | 2 min read

Оptimize ML inference to fully utilize available resources and reduce inference latency.

The Invisible Cost of Inference

Latency and high throughput of model inference may be critical when significant number of predictions or large prediction batches are run, but also, for models that are used in real-time tasks such as translation. The cost of high latency, whether a hardware/cloud costs or bad user experience, is often not very obvious.

One way to optimize inference is to profile inference, understand the bottlenecks, optimize or try different models or training parameters. This is an iterative process that requires right tools. Machine learning profiler is an essential tool in this process.

Adding Visibility Using Machine Learning Profiler

Just like traditional profilers, a machine learning profiler provides execution statistics, however, the focus is on ML operations and compute kernels instead of plain method calls. Additionally, ML profilers provide GPU utilization information relevant in machine learning context.

TensorFlow and PyTorch provide built-in ML profilers, which utilize NVIDIA® CUDA® profiling interface (CUPTI) under the hood for GPU profiling. One way to use those profilers is via locally installed TensorBoard or just logging profiles. In turn, Graphsignal agent uses the built-in profilers as well as other tools to enable automatic profiling in any environment without installing additional software. It also allows teams to share and collaborate online.

In-Depth Visibility

An ML profile includes information about ML execution from different perspectives in order to cover as many optimization use cases as possible.

  • Step statistics: statistics for single or batch predictions: sample rate, inference rate, inference time and FLOP rate.
  • Time breakdown: percentages of times spent in different states, e.g. between CPU and GPU.
  • Parameters: user-provided or automatically recorded parameters.
  • Metrics: user-provided or automatically recorded metrics.
  • Step trace: Detailed trace of a single inference in Chrome trace viewer.
  • Operations: List of ML operation statistics with times spent on CPU or GPU for each operation.
  • Kernels: List of compute kernel statistics with execution times for each kernel.
  • Compute utilization: CPU, GPU, memory usage and more.

Continuous Visibility

Being able to always ensure inference is fast and efficient, e.g. for a new model or hardware, is important. And that is one of the reasons we've developed Graphsignal.

Trying it out is easy, see Quick Start Guide for instructions.