• Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints

Leaner large language models could enable efficient local use on phones and laptops

Simon Osuji by Simon Osuji
November 18, 2024
in Artificial Intelligence
0
Leaner large language models could enable efficient local use on phones and laptops
0
SHARES
3
VIEWS
Share on FacebookShare on Twitter


smartphone
Credit: Limon Das from Pexels

Large language models (LLMs) are increasingly automating tasks like translation, text classification and customer service. But tapping into an LLM’s power typically requires users to send their requests to a centralized server—a process that’s expensive, energy-intensive and often slow.

Related posts

Onnit’s Instant Melatonin Spray Keeps Bedtime Uncomplicated

Onnit’s Instant Melatonin Spray Keeps Bedtime Uncomplicated

January 31, 2026
How to Film ICE | WIRED

How to Film ICE | WIRED

January 31, 2026

Now, researchers have introduced a technique for compressing an LLM’s reams of data, which could increase privacy, save energy and lower costs. Their findings are published on the arXiv preprint server.

The new algorithm, developed by engineers at Princeton and Stanford Engineering, works by trimming redundancies and reducing the precision of an LLM’s layers of information. This type of leaner LLM could be stored and accessed locally on a device like a phone or laptop and could provide performance nearly as accurate and nuanced as an uncompressed version.

“Any time you can reduce the computational complexity, storage and bandwidth requirements of using AI models, you can enable AI on devices and systems that otherwise couldn’t handle such compute- and memory-intensive tasks,” said study co-author Andrea Goldsmith, dean of Princeton’s School of Engineering and Applied Science and Arthur LeGrand Doty Professor of Electrical and Computer Engineering.

“When you use ChatGPT, whatever request you give it goes to the back-end servers of OpenAI, which process all of that data, and that is very expensive,” said co-author Rajarshi Saha, a Stanford Engineering Ph.D. student. “So, you want to be able to do this LLM inference using consumer GPUs [graphics processing units], and the way to do that is by compressing these LLMs.” Saha’s graduate work is co-advised by Goldsmith and co-author Mert Pilanci, an assistant professor at Stanford Engineering.

The researchers will present their new algorithm CALDERA, which stands for Calibration Aware Low precision DEcomposition with low Rank Adaptation, at the Conference on Neural Information Processing Systems (NeurIPS) in December. Saha and colleagues began this compression research not with LLMs themselves, but with the large collections of information that are used to train LLMs and other complex AI models, such as those used for image classification. This technique, a forerunner to the new LLM compression approach, was published in 2023.

Training data sets and AI models are both composed of matrices, or grids of numbers that are used to store data. In the case of LLMs, these are called weight matrices, which are numerical representations of word patterns learned from large swaths of text.

“We proposed a generic algorithm for compressing large data sets or large matrices,” said Saha. “And then we realized that nowadays, it’s not just the data sets that are large, but the models being deployed are also getting large. So, we could also use our algorithm to compress these models.”

While the team’s algorithm is not the first to compress LLMs, its novelty lies in an innovative combination of two properties, one called “low-precision,” the other “low-rank.” As digital computers store and process information as bits (zeros and ones), “low-precision” representation reduces the number of bits, speeding up storage and processing while improving energy efficiency. On the other hand, “low-rank” refers to reducing redundancies in the LLM weight matrices.

“Using both of these properties together, we are able to get much more compression than either of these techniques can achieve individually,” said Saha.

The team tested their technique using Llama 2 and Llama 3, open-source large language models released by Meta AI, and found that their method, which used low-rank and low-precision components in tandem with each other, can be used to improve other methods which use just low-precision. The improvement can be up to 5%, which is significant for metrics that measure uncertainty in predicting word sequences.

They evaluated the performance of the compressed language models using several sets of benchmark tasks for LLMs. The tasks included determining the logical order of two statements, or answering questions involving physical reasoning, such as how to separate an egg white from a yolk or how to make a cup of tea.

“I think it’s encouraging and a bit surprising that we were able to get such good performance in this compression scheme,” said Goldsmith, who moved to Princeton from Stanford Engineering in 2020. “By taking advantage of the weight matrix rather than just using a generic compression algorithm for the bits that are representing the weight matrix, we were able to do much better.”

Using an LLM compressed in this way could be suitable for situations that don’t require the highest possible precision. Moreover, the ability to fine-tune compressed LLMs on edge devices like a smartphone or laptop enhances privacy by allowing organizations and individuals to adapt models to their specific needs without sharing sensitive data with third-party providers. This reduces the risk of data breaches or unauthorized access to confidential information during the training process. To enable this, the LLMs must initially be compressed enough to fit on consumer-grade GPUs.

Saha also cautioned that running LLMs on a smartphone or laptop could hog the device’s memory for a period of time. “You won’t be happy if you are running an LLM and your phone drains out of charge in an hour,” said Saha.

Low-precision computation can help reduce power consumption, he added. “But I wouldn’t say that there’s one single technique that solves all the problems. What we propose in this paper is one technique that is used in combination with techniques proposed in prior works. And I think this combination will enable us to use LLMs on mobile devices more efficiently and get more accurate results.”

More information:
Rajarshi Saha et al, Compressing Large Language Models using Low Rank and Low Precision Decomposition, arXiv (2024). DOI: 10.48550/arxiv.2405.18886

Journal information:
arXiv

Provided by
Princeton University

Citation:
Leaner large language models could enable efficient local use on phones and laptops (2024, November 18)
retrieved 18 November 2024
from https://techxplore.com/news/2024-11-leaner-large-language-enable-efficient.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.





Source link

Previous Post

FDA endorses speedy approval path for Regenxbio Duchenne gene therapy

Next Post

Allstate Takes New Approach to Return-to-Office: Coworking

Next Post
Allstate Takes New Approach to Return-to-Office: Coworking

Allstate Takes New Approach to Return-to-Office: Coworking

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

The Bahamas to host IV meeting on sustainable development

The Bahamas to host IV meeting on sustainable development

3 years ago
Max Q: Will Starlink IPO?

Max Q: Will Starlink IPO?

2 years ago
Rand steady ahead of PPI and rate decision

Rand steady ahead of PPI and rate decision

2 years ago
Why are AFCON seats empty?

Why are AFCON seats empty?

2 years ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • The world’s top 10 most valuable car brands in 2025

    0 shares
    Share 0 Tweet 0
  • Top 10 African countries with the highest GDP per capita in 2025

    0 shares
    Share 0 Tweet 0
  • Global ranking of Top 5 smartphone brands in Q3, 2024

    0 shares
    Share 0 Tweet 0
  • When Will SHIB Reach $1? Here’s What ChatGPT Says

    0 shares
    Share 0 Tweet 0

Get strategic intelligence you won’t find anywhere else. Subscribe to the Limitless Beliefs Newsletter for monthly insights on overlooked business opportunities across Africa.

Subscription Form

© 2026 LBNN – All rights reserved.

Privacy Policy | About Us | Contact

Tiktok Youtube Telegram Instagram Linkedin X-twitter
No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • LBNN Blueprints
  • Quizzes
    • Enneagram quiz
  • Fashion Intelligence

© 2023 LBNN - All rights reserved.