Accuracy-Aware Inference Optimization Tracking
By Dmitri Melikyan | | 4 min read

Learn how to measure and profile inference to improve latency and throughput, while maintaining accuracy or other metrics.

Big Models and Inference Speed

Recent models, in particular foundation models, are increasingly larger with billions of parameters. Inference latency and throughput of such models become an issue for many use cases. For large-scale stream processing or model serving deployment scenarios, millisecond-level latency is usually a requirement.

Model Compilers and Runtimes

To speed up inference, model compilation-level optimizations for target hardware are proven to be effective. Compilation-level optimizations are offered both by open source compiles and runtimes, but also integrated with deployment providers. One popular example is ONNX Runtime. It provides automatic graph-level optimizations such as node fusion, but also quantization. In turn, it supports various execution providers such as TensorRT.

Network-Level Optimizations

Combined with network-level optimizations the speedup can be much grater. For example, model weight pruning is one option. Although it does not produce best results with all model architectures, architecture-specific methods are already available. The paper Rethinking Network Pruning— under the Pre-train and Fine-tune Paradigm presents one such method shown on BERT transformers. Another paper Prune Once for All: Sparse Pre-Trained Language Models claims that 40x compression ratio can be achieved via weight pruning, knowledge distillation and quantization. According to some benchmarks, such compression may lead to 20x speedup.

Optimized Model Implementations

ML engineers, who train/fine-tune and deploy models, do not necessarily need to implement optimizations themselves, but may just need to evaluate various existing optimized model implementations (e.g. from DeepSpeed, FasterTransformer, etc.) against their datasets, deployment scenarios and hardware configurations. It was shown that for language models weight pruning can be applied at pre-training, resulting in sparse models that can be fine-tuned.

The Optimize-Verify-Evaluate Loop

In any case, many inference optimizations, especially at the network-level, affect model performance, e.g. accuracy. Therefore, it is essential to verify if optimizations were applied and ensure evaluation metrics are at the acceptable level while iterating on optimizations or models. Other parameters, such as batch size, are useful to benchmark against as well. Here is an example of batch size search.

This implies that for every optimization trial, an evaluation on some test/production dataset should be performed to ensure model performance metrics are not degrading.

Additionally, an important aspect of the optimization process is to be able to analyze workload speed and compute to remove bottlenecks and optimize utilization.

Using Graphsignal Inference Profiler

We've created Graphsignal to enable the Optimize-Verify-Evaluate loop for any model and deployment scenario. The profiler automates all measurements and profiling, and provides dashboards for benchmarking and analyzing trial runs. Latency and throughput can be compared against model metrics, compute configuration and utilization and any user-defined parameters.

Runs dashboard

Now it is possible to compare runs' latency and throughput, and track how accuracy is affected by the optimizations.

It is possible to select other dimensions to benchmark the speed against. These include any logged parameters and metrics, batch size, hardware, GPU utilization. Alternatively all dimensions can be viewed in one table.

Runs table

To analyze particular run for bottlenecks, we can look at the inference trace.

Inference trace

And here is the code of the benchmark for reference:

from pathlib import Path
from transformers import AutoTokenizer, pipeline, PretrainedConfig
from optimum.onnxruntime import ORTModelForSequenceClassification

import argparse
parser = argparse.ArgumentParser(description='Benchmark args')
parser.add_argument('--onnx', action='store_true')
parser.add_argument('--gpu', action='store_true')
parser.add_argument('--quantize', dest='quantize', action='store_true')
args = parser.parse_args()


model_id="optimum/distilbert-base-uncased-finetuned-banking77"
dataset_id="banking77"
onnx_path = Path("/tmp/onnx")
task = "text-classification"

payload = "What can I do if my card still hasn't arrived after 2 weeks?"

def compute_accuracy(pipe):
    print('Evaluating...')

    from evaluate import evaluator
    from datasets import load_dataset 

    eval = evaluator("text-classification")
    eval_dataset = load_dataset("banking77", split="test")

    results = eval.compute(
        model_or_pipeline=pipe,
        data=eval_dataset,
        metric="accuracy",
        input_column="text",
        label_column="label",
        label_mapping=pipe.model.config.label2id,
        strategy="simple",
    )
    return results['accuracy']


# Graphsignal: configure, expects GRAPHSIGNAL_API_KEY environment variable
import graphsignal
graphsignal.configure(workload_name='DistilBERT Inference')


if not args.onnx:
    from graphsignal.profilers.pytorch import profile_inference
    graphsignal.add_tag('Vanilla')
    if args.gpu:
        graphsignal.add_tag('GPU')

    vanilla_clx = pipeline(task, model=model_id, device=0 if args.gpu else -1)

    for _ in range(10):
        _ = vanilla_clx(payload)
    for _ in range(100):
        # Graphsignal: profile
        with profile_inference():
            _ = vanilla_clx(payload)

    # Graphsignal: log accuracy, upload and end run
    graphsignal.log_metric('accuracy', compute_accuracy(vanilla_clx))
    graphsignal.end_run()

if args.onnx:
    if args.quantize:
        from optimum.onnxruntime import ORTQuantizer
        from optimum.onnxruntime.configuration import AutoQuantizationConfig

        model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
        quantizer = ORTQuantizer.from_pretrained(model_id, feature=model.pipeline_task)
        qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)

        quantizer.export(
            onnx_model_path=onnx_path / "model.onnx",
            onnx_quantized_model_output_path=onnx_path / "model.onnx",
            quantization_config=qconfig,
        )
    else:
        model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
        model.save_pretrained(onnx_path)
    
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.save_pretrained(onnx_path)

    import onnxruntime
    from graphsignal.profilers.onnxruntime import initialize_profiler, profile_inference
    graphsignal.add_tag('ONNX')
    if args.gpu:
        graphsignal.add_tag('GPU')
    if args.quantize:
        graphsignal.add_tag('Quantized')

    sess_options = onnxruntime.SessionOptions()

    # Graphsignal: initialize profiler for ONNX Runtime session
    initialize_profiler(sess_options)

    session = onnxruntime.InferenceSession(
        str(onnx_path / 'model.onnx'),
        sess_options,
        providers=[
            'CUDAExecutionProvider' if args.gpu else 'CPUExecutionProvider'
        ])
    model_from_session = ORTModelForSequenceClassification(
        model=session, 
        config=PretrainedConfig.from_json_file(onnx_path / 'config.json'))

    optimum_clx = pipeline(task, model=model_from_session, tokenizer=tokenizer, device=0 if args.gpu else -1)

    for _ in range(10):
        _ = optimum_clx(payload)
    for _ in range(100):
        # Graphsignal: profile
        with profile_inference(session):
            _ = optimum_clx(payload)

    # Graphsignal: log accuracy, upload and end run
    graphsignal.log_metric('accuracy', compute_accuracy(optimum_clx))
    graphsignal.end_run()

To try it out yourself, see Quick Start Guide for instructions.