Learn how to measure inference to improve latency and throughput, while maintaining accuracy or other metrics.
Recent models, in particular foundation models, are increasingly larger with billions of parameters. Inference latency and throughput of such models become an issue for many use cases. For large-scale stream processing or model serving deployment scenarios, millisecond-level latency is usually a requirement.
To speed up inference, model compilation-level optimizations for target hardware are proven to be effective. Compilation-level optimizations are offered both by open source compiles and runtimes, but also integrated with deployment providers. One popular example is ONNX Runtime. It provides automatic graph-level optimizations such as node fusion, but also quantization. In turn, it supports various execution providers such as TensorRT.
Combined with network-level optimizations the speedup can be much grater. For example, model weight pruning is one option. Although it does not produce best results with all model architectures, architecture-specific methods are already available. The paper Rethinking Network Pruning— under the Pre-train and Fine-tune Paradigm presents one such method shown on BERT transformers. Another paper Prune Once for All: Sparse Pre-Trained Language Models claims that 40x compression ratio can be achieved via weight pruning, knowledge distillation and quantization. According to some benchmarks, such compression may lead to 20x speedup.
ML engineers, who train/fine-tune and deploy models, do not necessarily need to implement optimizations themselves, but may just need to evaluate various existing optimized model implementations (e.g. from DeepSpeed, FasterTransformer, etc.) against their datasets, deployment scenarios and hardware configurations. It was shown that for language models weight pruning can be applied at pre-training, resulting in sparse models that can be fine-tuned.
In any case, many inference optimizations, especially at the network-level, affect model performance, e.g. accuracy. Therefore, it is essential to verify if optimizations were applied and ensure evaluation metrics are at the acceptable level while iterating on optimizations or models. Other parameters, such as batch size, are useful to benchmark against as well. Here is an example of batch size search.
This implies that for every optimization trial, an evaluation on some test/production dataset should be performed to ensure model performance metrics are not degrading.
Additionally, an important aspect of the optimization process is to be able to analyze inference speed and compute to remove bottlenecks and optimize utilization.
We've created Graphsignal to enable the Optimize-Verify-Evaluate loop for any model and deployment scenario. It automates measurements and provides dashboards for benchmarking and analyzing trial runs. Latency and throughput can be compared against model metrics, compute configuration and utilization.
Now it is possible to compare runs' latency and throughput, and track how accuracy is affected by the optimizations.
To analyze particular run for bottlenecks, we can look at the inference trace in the dashboard.
And here is the code of the benchmark for reference:
from pathlib import Path
from transformers import AutoTokenizer, pipeline, PretrainedConfig
from optimum.onnxruntime import ORTModelForSequenceClassification
import argparse
parser = argparse.ArgumentParser(description='Benchmark args')
parser.add_argument('--onnx', action='store_true')
parser.add_argument('--gpu', action='store_true')
parser.add_argument('--quantize', dest='quantize', action='store_true')
args = parser.parse_args()
model_id="optimum/distilbert-base-uncased-finetuned-banking77"
dataset_id="banking77"
onnx_path = Path("/tmp/onnx")
task = "text-classification"
payload = "What can I do if my card still hasn't arrived after 2 weeks?"
def compute_accuracy(pipe):
print('Evaluating...')
from evaluate import evaluator
from datasets import load_dataset
eval = evaluator("text-classification")
eval_dataset = load_dataset("banking77", split="test")
results = eval.compute(
model_or_pipeline=pipe,
data=eval_dataset,
metric="accuracy",
input_column="text",
label_column="label",
label_mapping=pipe.model.config.label2id,
strategy="simple",
)
return results['accuracy']
# Graphsignal: configure, expects GRAPHSIGNAL_API_KEY environment variable
import graphsignal
graphsignal.configure(deployment='distilbert-local')
if not args.onnx:
vanilla_clx = pipeline(task, model=model_id, device=0 if args.gpu else -1)
accuracy = compute_accuracy(vanilla_clx)
print('Accuracy (vanilla)', accuracy)
for _ in range(100):
# Graphsignal: measure inference
with graphsignal.start_trace(endpoint='predict-vanilla'):
_ = vanilla_clx(payload)
if args.onnx:
if args.quantize:
from optimum.onnxruntime import ORTQuantizer
from optimum.onnxruntime.configuration import AutoQuantizationConfig
model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
quantizer = ORTQuantizer.from_pretrained(model_id, feature=model.pipeline_task)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)
quantizer.export(
onnx_model_path=onnx_path / "model.onnx",
onnx_quantized_model_output_path=onnx_path / "model.onnx",
quantization_config=qconfig,
)
else:
model = ORTModelForSequenceClassification.from_pretrained(model_id, from_transformers=True)
model.save_pretrained(onnx_path)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.save_pretrained(onnx_path)
import onnxruntime
session = onnxruntime.InferenceSession(
str(onnx_path / 'model.onnx'),
providers=[
'CUDAExecutionProvider' if args.gpu else 'CPUExecutionProvider'
])
model_from_session = ORTModelForSequenceClassification(
model=session,
config=PretrainedConfig.from_json_file(onnx_path / 'config.json'))
optimum_clx = pipeline(task, model=model_from_session, tokenizer=tokenizer, device=0 if args.gpu else -1)
accuracy = compute_accuracy(optimum_clx)
print('Accuracy (ONNX)', accuracy)
for _ in range(100):
# Graphsignal: measure inference
with graphsignal.start_trace(endpoint='predict-onnx'):
_ = optimum_clx(payload)
To try it out yourself, see Quick Start Guide for instructions.