Study shows new algorithms accelerate AI models

llm — Credit: Unsplash/CC0 Public Domain

Just as people from different countries speak different languages, AI models also create various internal “languages”—a unique set of tokens understood only by each model. Until recently, there was no way for models developed by different companies to communicate directly, collaborate or combine their strengths to improve performance.

The Best Laptop Cooling Pad and Why You Should Buy One

February 28, 2026

The Best MicroSD Cards for Your Camera, Switch, and More

February 28, 2026

This week, at the International Conference on Machine Learning (ICML) in Vancouver, Canada, scientists from the Weizmann Institute of Science and Intel Labs are presenting a new set of algorithms that overcome this barrier, enabling users to benefit from combined computational power of AI models working together. The new algorithms, already available to millions of AI developers around the world, speed up the performance of large language models (LLMs)—today’s leading models of generative AI—by 1.5 times, on average.

The research is published on the arXiv preprint server.

LLMs, such as ChatGPT and Gemini, are powerful tools, but they come with significant drawbacks: They are slow and consume large amounts of computing power. In 2022, major tech companies realized that AI models, like people, could benefit from collaboration and division of labor. This led to the development of a method called speculative decoding, in which a small, fast model, possessing relatively limited knowledge, makes a first guess while answering a user’s query, and a larger, more powerful but slower model reviews and corrects the answer if needed.

Speculative decoding was quickly adopted by tech giants because it maintains 100% accuracy—unlike most acceleration techniques, which reduce output quality. But it had one big limitation: Both models had to “speak” the exact same digital language, which meant that models developed by different companies could not be combined.

“Tech giants adopted speculative decoding, benefiting from faster performance and saving billions of dollars a year in cost of processing power, but they were the only ones to have access to small, faster models that speak the same language as larger models,” explains Nadav Timor, a Ph.D. student in Prof. David Harel’s research team in Weizmann’s Computer Science and Applied Mathematics Department, who led the new development.

“In contrast, a startup seeking to benefit from speculative decoding had to train its own small model that matched the language of the big one, and that takes a great deal of expertise and costly computational resources.”

The new algorithms developed by Weizmann and Intel researchers allow developers to pair any small model with any large model, causing them to work as a team. To overcome the language barrier, the researchers came up with two solutions.

First, they designed an algorithm that allows an LLM to translate its output from its internal token language into a shared format that all models can understand. Second, they created another algorithm that prompts such models to mainly rely in their collaborative work on tokens that have the same meaning across models, similarly to words like “banana” or “internet” that are nearly identical across human languages.

“At first, we worried that too much information would be ‘lost in translation’ and that different models wouldn’t be able to collaborate effectively,” says Timor. “But we were wrong. Our algorithms speed up the performance of LLMs by up to 2.8 times, leading to massive savings in spending on processing power.”

The significance of this research has been recognized by ICML organizers, who selected the study for public presentation—a distinction granted to only about 1% of the 15,000 submissions received this year. “We have solved a core inefficiency in generative AI,” says Oren Pereg, a senior researcher at Intel Labs and co-author of the study. “This isn’t just a theoretical improvement; these are practical tools that are already helping developers build faster and smarter applications.”

In the past several months, the team released their algorithms on the open-source AI platform Hugging Face Transformers, making them freely available to developers around the world. The algorithms have since become part of standard tools for running efficient AI processes.

“This new development is especially important for edge devices, from phones and drones to autonomous cars, which must rely on limited computing power when not connected to the internet,” Timor adds. “Imagine, for example, a self-driving car that is guided by an AI model. In this case, a faster model can make the difference between a safe decision and a dangerous error.”

Also participating in the study were Dr. Jonathan Mamou, Daniel Korat, Moshe Berchansky and Moshe Wasserblat from Intel Labs and Gaurav Jain from d-Matrix. Prof. David Harel is the incumbent of the William Sussman Professorial Chair of Mathematics.

More information:
Nadav Timor et al, Accelerating LLM Inference with Lossless Speculative Decoding Algorithms for Heterogeneous Vocabularies, arXiv (2025). DOI: 10.48550/arxiv.2502.05202

Journal information:
arXiv

Provided by
Weizmann Institute of Science

Citation:
Faster, smarter, more open: Study shows new algorithms accelerate AI models (2025, July 16)
retrieved 16 July 2025
from https://techxplore.com/news/2025-07-faster-smarter-algorithms-ai.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Source link