Tuesday, June 3, 2025
LBNN
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • Documentaries
No Result
View All Result
LBNN

Visual Reasoning in AI: Boosting Problem-Solving with Images

Simon Osuji by Simon Osuji
February 12, 2025
in Artificial Intelligence
0
Visual Reasoning in AI: Boosting Problem-Solving with Images
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter


When humans try to solve problems, they often visualize the tasks in their heads. New research suggests that enabling artificial intelligence to do the same could boost performance on spatial reasoning challenges.

While large language models excel at many text-based tasks, they often struggle with those that require more complex reasoning. One of the most promising approaches for boosting their performance on these kinds of problems is a technique known as “chain-of-thought” (CoT) prompting, where users ask the model to “think” through them step-by-step.

Related posts

New tool boosts model transparency

New tool boosts model transparency

June 3, 2025
“Mario Kart World” Devs Broke Their Own Rule on Who Gets to Drive

“Mario Kart World” Devs Broke Their Own Rule on Who Gets to Drive

June 3, 2025

This can lead to significant improvements on various reasoning tasks, especially in mathematics, coding, and logic. But the language-focused technique has proved less effective for problems requiring spatial or visual reasoning. To try and close that gap, researchers at the University of Cambridge and Microsoft Research have developed a new approach that lets AI “think” in both text and images.

The technique enables multimodal large language models, which can process both image and text data, to generate visual representations of their intermediate reasoning steps. In non-peer reviewed research posted to arXiv, the researchers report that when they tested the approach on spatial reasoning challenges involving 2D mazes, they saw significant improvements over the typical CoT technique on the most challenging scenarios.

“Spatial relations and layouts and also some geometric features are very hard to describe with pure text,” says co-lead author Chengzu Li, a Ph.D. student at Cambridge. “That’s why we think that reasoning with pure text would limit the performance of the model in spatial tasks. And that’s the main motivation for introducing visual ‘thoughts,’” he says.

How AI Visual Reasoning Works

This is not the first attempt to allow AI to reason visually. But Li says previous approaches have either involved extracting information from images and converting it to text before reasoning with it, or have relied on external software tools or specialized vision models to enable visual reasoning.

The new approach enables a single multimodal model to generate both visual and text reasoning steps itself. This work only recently became feasible, says Li, thanks to the development of more powerful multimodal AI. Older models could interpret images and text, but could only generate text outputs. For these experiments, the researchers used a model called Anole that can respond in either modality.

This model is an open-source extension of Meta’s Chameleon multimodal model: theresearchers behind Anole retrained it to generate sequences of text interleaved with images. For instance, it can generate a step-by-step recipe with an image for each step. Li and colleagues took this pre-trained model and fine-tuned it on text and image data from three maze-like games with different levels of complexity. They called their fine-tuned version Multimodal Visualization of Thought (MVoT).

Flowchart comparison of the processes for direct prompting, chain of thought and multi modal visualization of thought. While traditional chain of thought relies solely on verbal thought to generate an output, multi modal visualization of thought facilitates visual thought to visualize each verbal reasoning trace.The researchers tested the new technique (bottom), which generates both visual and verbal thoughts, against ones that reasons only in text (middle) and one that skips reasoning and jumps straight to the answer.Chengzu Li, Wenshan Wu et al.

The goal for the model was to work out what would happen if it took a pre-determined series of actions in each maze. During training, the model was shown examples that included images of the starting position in the maze and a textual description of the task, a series of reasoning steps featuring text descriptions of actions and images of where the player is on the map, and finally an answer as to what the outcome would be for those actions, such as reaching the desired destination or falling down a hole. During testing the model was only given the starting image and a sequence of actions to perform. It then generated image and text reasoning steps followed by a prediction of what would happen.

The researchers compared MVoT to four other models, three of which they fine-tuned themselves. The first two versions of the model were trained only on text data regarding the maze: One model jumped straight from a prompt to generating a final answer, the other used textual CoT reasoning. Another model was trained on examples of both image and text reasoning, but then did its own reasoning purely in text. Finally, they compared MVoT’s performance on the maze tasks to that of the GPT-4o model from OpenAI, which is the company’s most advanced multimodal model.

They found that on all three games, the MVoT model significantly outperformed all models apart from the one using traditional text CoT. That model actually did slightly better on the two simpler mazes, successfully predicting the outcome 98 percent of the time on both, compared to MVoT’s scores of 93 percent and 95 percent. But the traditional text CoT model did much worse on the most complicated game, scoring just 61 percent compared to MVoT’s 86 percent. They tested both models on progressively larger mazes and while MVoT’s performance remained stable, the other model’s performance plummeted as maze size increased.

The researchers say this outcome is likely because CoT relies on accurate textual descriptions of the environment, which get harder the more complex the mazes become. In contrast, the inclusion of images in the reasoning process appears to make MVoT much better at dealing with more challenging environments.

Applications for AI Visual Reasoning

While the tests the researchers used are simple, Li says extending this approach into more complex domains could have broad applications. One of the most compelling is robotics, where the approach could help machines reason more effectively about the visual input they get from the environment. It could also be help AI tutors better illustrate and explain ideas, particularly in areas like geometry. More broadly, he says the approach can boost model interpretability by giving humans a clear picture of what the model is thinking about in spatial tasks.

One potential gap, admits Li, is that the model has no mechanism for deciding when to reason visually or when to reason via text. At present, the model simply alternates between the two, which works well for these maze navigation challenges that have discrete steps but may be less appropriate for more complex spatial reasoning tasks.

“We haven’t really touched on when is the appropriate time to do a visual reasoning process or not,” Li says. “But I think it’s definitely one of the very interesting directions to further explore.” One possibility, he adds, would be to generate reasoning sequences with both visual and text descriptions at each step, and then get humans to provide feedback on which is more expressive. This feedback could then be used to train the model to pick the best option at each reasoning step.

From Your Site Articles

Related Articles Around the Web



Source link

Previous Post

70% Of Kenyans Report Income Declines In 2024 With Half Of The Workforce Experiencing Stress

Next Post

Turn Setbacks Into Success With These Key Lessons

Next Post
Turn Setbacks Into Success With These Key Lessons

Turn Setbacks Into Success With These Key Lessons

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

Eastern Africa Standby Force Ready to Resolve Conflicts, Leadership Says

Eastern Africa Standby Force Ready to Resolve Conflicts, Leadership Says

1 year ago
Aquaterra Energy wins work at Ineos’ Greensand CCS project

Aquaterra Energy wins work at Ineos’ Greensand CCS project

2 years ago
Brendan Carr Is Turning the FCC Into MAGA’s Censoring Machine

Brendan Carr Is Turning the FCC Into MAGA’s Censoring Machine

1 month ago
Death Announcement Of John Njuguna Mugo Of Atlanta, Georgia

Death Announcement Of John Njuguna Mugo Of Atlanta, Georgia

1 year ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • When Will SHIB Reach $1? Here’s What ChatGPT Says

    0 shares
    Share 0 Tweet 0
  • Matthew Slater, son of Jackson State great, happy to see HBCUs back at the forefront

    0 shares
    Share 0 Tweet 0
  • Dolly Varden Focuses on Adding Ounces the Remainder of 2023

    0 shares
    Share 0 Tweet 0
  • US Dollar Might Fall To 96-97 Range in March 2024

    0 shares
    Share 0 Tweet 0
  • Privacy Policy
  • Contact

© 2023 LBNN - All rights reserved.

No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • Documentaries
  • Quizzes
    • Enneagram quiz
  • Newsletters
    • LBNN Newsletter
    • Divergent Capitalist

© 2023 LBNN - All rights reserved.