Learn how to make your LLM-powered applications faster.
For applications that rely on large language models, low end-to-end LLM latency is crucial for enhancing their usability and effectiveness. Here are some of the advantages:
Better User Experience: Low latency ensures smoother, more natural interactions, enhancing user satisfaction in real-time applications like chatbots and assistants.
Improved Reasoning: Reducing latency boosts interaction throughput, enabling iterative engagement and better reasoning through techniques like chain-of-thought prompting.
Cost Reduction: Optimized latency reduces token usage by delivering accurate outputs faster, lowering overall costs for users and providers.
The starting point of optimization work is understanding how LLMs process prompts and generate tokens.
To understand the main factors influencing LLM inference latency, let's analyze time complexity of transformers, including vendor-specific optimizations, and benchmark some recent models.
The time complexity behind modern transformers with attention caching can be represented as follows:
for prompt encoding, i.e., the initial prompt pass, and:
for token generation, where
Based on the given time complexity, the main latency factors are prompt size, completion size, and model size. Additionally, vendor-specific features like prompt caching and predicted outputs introduce other optimization possibilities. We'll look at each one of them.
Each token in the prompt adds extra steps in computing attention across all layers. A smaller prompt reduces this upfront “prompt processing” compute, thus lowering latency.
To illustrate this relationship, here is an end-to-end latency benchmark using OpenAI’s 4o model, gradually increasing the number of prompt tokens while restricting completions to just one token.
Prompt compression is a general approach to reduce prompt size. It is the process of condensing a detailed prompt into a more concise and efficient form while retaining the key elements necessary for achieving the desired output. When it comes to latency, investing time in compressing large prompts can yield significant benefits.
Latency grows with each newly generated token because the model must do a forward pass for each step. Generating fewer tokens cuts the total number of decode steps, reducing overall response time.
This trend is evident in the chart, where latency increases with the number of completion tokens, highlighting the direct correlation between the number of decode steps and response time for the GPT-4o model. This corresponds to about 80 tokens per second.
Reducing completion size is often the most impactful optimization, both for small and large models. Here are some techniques to reduce completion size:
This also applies to reasoning tasks—when reasoning effort is high, the model takes longer to respond.
Prompt caching can reduce latency by cutting prompt encoding time to partially or fully constant time . This feature is supported by OpenAI and Anthropic.
Prompt caching reuses computations for prompts with identical prefixes, saving time and reducing costs. OpenAI's prompt caching system works for prompts that exceed 1,024 tokens in length and cleared after 5-10 minutes of inactivity, with a maximum lifespan of one hour.
To utilize prompt caching, make sure the prompts prefixes are the same across different prompts, i.e. put custom data at the end of the prompt.
Another OpenAI feature that helps reduce latency by decreasing the number of completion tokens is Predicted Outputs.
The model employs a technique known as speculative decoding, which allows it to bypass the generation of tokens that match the provided prediction. By skipping over these predictable segments, the model reduces the computational load and accelerates the overall response time.
Smaller models have fewer parameters and thus lower per-token compute times (both attention and feed-forward). This directly translates to faster processing for both prompt handling and token generation.
While we didn't see much difference in our speed benchmarks between GPT-4o and GPT-4o Mini, other models, e.g., Llama 3 family, show 2-10x speed difference between small and large-scale models.
ArtificialAnalysis.ai provides useful benchmarks for comparing model throughput.
Inference-optimized stacks, such as AWS Inf2 instances and Groq chips boost token generation throughput and thus the latency on average up to 2-10x, generating 100-300 tokens per second for LLMs such as Llama 70B, while the GPT-4o model achieves around 80-90 tokens per second in our benchmarks.
LLM output streaming enhances the perception of responsiveness by delivering responses incrementally, allowing users to see results as they are generated. While the total processing time may remain the same, users feel engaged and can start consuming the information earlier, improving the interactive experience.
Once you understand the key factors affecting LLM latency, the next step is building a practical workflow for optimizing, validating, and monitoring performance in both test and production. A critical enabler of this iterative workflow is the ability to measure performance and pinpoint latency bottlenecks.
You can’t optimize what you don’t measure. The foundation of latency optimization is continuous monitoring and benchmarking of your LLM generations in production.
You can use Graphsignal, which is purpose-built for observability and optimization of LLM-powered applications. It provides detailed telemetry from real-world usage and helps pinpoint latency bottlenecks.
With Graphsignal, you can:
This data helps you correlate latency with prompts, users, models, or changes in code deployments.
See the Quick Start guide on how to add to your application. Follow us at @GraphsignalAI for updates.