Toward a new framework to accelerate large language model inference

High-quality output at low latency is a critical requirement when using large language models (LLMs), especially in real-world scenarios, such as chatbots interacting with customers, or the AI code assistants used by millions of users daily.

How Mexico’s ‘CJNG’ Drug Cartel Embraced AI, Drones, and Social Media

February 25, 2026

College Campuses Are in Upheaval Over Faculty Ties to Epstein

February 25, 2026

Currently, LLMs use a framework called autoregressive decoding, in which text is generated one token at a time, and the previous text is used to generate the next sequence. However, this is clearly inefficient, because for longer sequences, the time to generate responses increases linearly.

To address this issue, researchers are widely exploring the use of speculative decoding that follows a “guess and verify” framework. In this approach, a specifically trained smaller LLM guesses multiple text tokens in advance, which is simultaneously verified by the original LLM, substantially reducing the response generation time.

But these approaches require additional model training and extensive computational resources. While researchers have considered training-free speculative models in parallel, the speedup gain in these approaches remains limited due to a reduced quality of their speculative guesses.

To address these gaps in the field, Professor Nguyen Le Minh and his doctoral students, Nguyen-Khang Le and Dinh-Truong Do, from the Japan Advanced Institute of Science and Technology (JAIST) recently developed a new speculative decoding framework called SPECTRA and demonstrated accelerated text generation speed without any need for additional training.

“The framework consists of two main components: a core module (SPECTRA-CORE), which integrates seamlessly into LLMs in a plug-and-play manner, and an optional retrieval module (SPECTRA-RETRIEVAL) that further enhances performance,” explains Prof. Nguyen. The team’s findings were presented at the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025) by Dinh-Truong Truong and are published in the conference proceedings.

SPECTRA-CORE, the core module, generates high-quality guesses by using the text distribution pattern predicted by the LLM, improving speculative decoding. In this smart system, dictionaries containing different sizes of word sequences (N-grams) can be searched bidirectionally (forward and backward) to predict word combinations, guessing phrases of varying lengths quickly and more accurately. Additionally, SPECTRA keeps optimizing the N-gram dictionaries by constantly updating them with new word combinations, ensuring robust text coverage.

To speed things up further, the retrieval module, SPECTRA-RETRIEVAL, is integrated into SPECTRA-CORE. Existing approaches that use external sources to retrieve information and generate guesses in speculative decoding often struggle to integrate with other decoding frameworks as the search time exceeds speedup outcomes.

In contrast, SPECTRA-RETRIEVAL filters a large dataset of texts and keeps only the parts that are easy for the target LLM to predict based on perplexity scores. This, in turn, ensures that only high-quality, relevant data is used for training or fine-tuning the model, enabling seamless integration with SPECTRA-CORE.

In their study, the team tested SPECTRA on six tasks, including multi-turn conversations, code generation, and mathematical reasoning, across three LLM families—Llama 2, Llama 3, and CodeLlama. SPECTRA achieved 4x speedup gains and was able to outperform state-of-the-art non-training speculative decoding methods, notably REST, ANPD, and Lookahead.

While the overall model architecture and dataset characteristics determined the speedup gains of speculative decoding methods, SPECTRA showed reliability across a range of models and datasets, consistently accelerating speedup ratios.

“By integrating our plug-and-play SPECTRA-CORE module—which leverages multi-level N-gram storage and bidirectional search—with the refined SPECTRA-RETRIEVAL module that selects high-quality external cues via perplexity-based filtering, we were able to achieve substantial speedups (up to 4.08×) across diverse tasks and model architectures while preserving the original model’s output quality,” states Prof. Nguyen.

By reducing the response generation time without needing to retrain LLMs, SPECTRA offers a practical solution for commercial and research systems that use LLMs and could very well lead to improved accessibility and sustainability of high-performance AIs in the long-term.

More information:
Nguyen-Khang Le et al, SPECTRA: Faster Large Language Model Inference with Optimized Internal and External Speculation, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2025). DOI: 10.18653/v1/2025.acl-long.685

Provided by
Japan Advanced Institute of Science and Technology

Citation:
Toward a new framework to accelerate large language model inference (2025, August 7)
retrieved 7 August 2025
from https://techxplore.com/news/2025-08-framework-large-language-inference.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Source link