• Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Intelligence
    • Policy Intelligence
    • Security Intelligence
    • Economic Intelligence
    • Fashion Intelligence
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • LBNN Blueprints

ChatGPT Sucks at Checking Its Own Code

Simon Osuji by Simon Osuji
December 5, 2024
in Artificial Intelligence
0
ChatGPT Sucks at Checking Its Own Code
0
SHARES
1
VIEWS
Share on FacebookShare on Twitter



This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.

There’s a lot of hype around ChatGPT’s ability to produce code and, so far, the AI program just isn’t on par with its human counterparts. But how good is the AI program at catching its own mistakes?

Researchers in China put ChatGPT to the test in a recent study, evaluating its ability to assess its own code for correctness, vulnerabilities and successful repairs. The results, published 5 November in IEEE Transactions on Software Engineering, show that the AI program is overconfident, often suggesting that code is more satisfactory than it is in reality. The results also show what sort of prompts and tests might improve ChatGPT’s self-verification abilities.

Xing Hu, an associate professor at Zhejiang University, led the study. She emphasizes that, with the growing use of ChatGPT in software development, ensuring the quality of its generated code has become increasingly important.

Hu and her colleagues first tested ChatGPT-3.5’s ability to produce code using several large coding datasets.

Their results show that it can generate “correct” code—code that does what it’s suppose to do—with an average success rate of 57 percent, generate code without security vulnerabilities with a success rate of 73 percent, and repair incorrect code with an average success rate of 70 percent.

So it is successful sometimes, but it still making quite a few mistakes.

Asking ChatGPT to Check Its Coding Work

First, the researchers asked ChatGPT-3.5 to check its own code for correctness using direct prompts, which involve asking it to check whether the code meets a specific requirement.

Thirty-nine percent of the time it erroneously said that code was correct when it was not. It also incorrectly said that code was free of security vulnerabilities 25 percent of the time, and that it had successfully repaired code when it had not 28 percent of the time.

Interestingly, ChatGPT was able to catch more of its own mistakes when the researchers gave it guiding questions, which ask ChatGPT to agree or disagree with assertions that the code does not meet the requirements. Compared to direct prompts, these guiding questions led to the increased detection of incorrectly generated code by an average of 25 percent, increased identification of vulnerabilities by 69 percent, and increased recognition of failed program repairs by 33 percent.

Another important finding was that, although asking ChatGPT to generate test reports was not more effective than direct prompts at identifying incorrect code, it was useful for increasing the number of vulnerabilities flagged in ChatGPT-generated code.

Hu and her colleagues report in this study that ChatGPT demonstrated some instances of self-contradictory hallucinations in its behavior, where it initially generated code or completions that it deems correct or secure but later contradicts this belief during self-verification.

“The inaccuracies and self-contradictory hallucinations observed during ChatGPT’s self-verification underscore the importance of exercising caution and thoroughly evaluating its output,” Hu says. “ChatGPT should be regarded as a supportive tool for developers, rather than a replacement for their role as autonomous software creators and testers.”

As part of their study, the researchers also ran some tests using ChatGPT-4, finding that it does show substantial performance improvements in code generation, code completion, and program repair compared to ChatGPT-3.5.

“However, the overall conclusion regarding the self-verification capabilities of GPT-4 and GPT-3.5 remains similar,” Hu says, noting that GPT-4 still frequently misclassifies its generated incorrect code as correct, its vulnerable code as non-vulnerable, and its failed program repairs as successful, especially when using the direct question prompt.

As well, instances of self-contradictory hallucinations are also observed in GPT-4’s behavior, she adds.

“To ensure the quality and reliability of the generated code, it is essential to integrate ChatGPT’s capabilities with human expertise,” Hu emphasizes.

From Your Site Articles

Related Articles Around the Web



Source link

Related posts

Gmail Is Killing POP and Gmailify Access. Here’s What It Means for You

Gmail Is Killing POP and Gmailify Access. Here’s What It Means for You

February 22, 2026
The War Over Prediction Markets Is Just Getting Started

The War Over Prediction Markets Is Just Getting Started

February 22, 2026
Previous Post

Deloitte survey reveals over half of Middle East C-Suite executives leverage technology to drive climate action

Next Post

SPP appoints Peter Ekweozoh as Senior Fellow on Technology Innovation – EnviroNews

Next Post
SPP appoints Peter Ekweozoh as Senior Fellow on Technology Innovation – EnviroNews

SPP appoints Peter Ekweozoh as Senior Fellow on Technology Innovation - EnviroNews

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

Top 3 Cryptocurrencies To Watch in February 2024

Top 3 Cryptocurrencies To Watch in February 2024

2 years ago
Nano-thin ‘liquid-like’ coatings may pave the way for a ‘self-cleaning’ world

Nano-thin ‘liquid-like’ coatings may pave the way for a ‘self-cleaning’ world

3 years ago
Guidde taps AI to help create software training videos

Guidde taps AI to help create software training videos

1 year ago
US should accelerate hypersonic defenses, NORTHCOM head says

US should accelerate hypersonic defenses, NORTHCOM head says

2 years ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • The world’s top 10 most valuable car brands in 2025

    0 shares
    Share 0 Tweet 0
  • Top 10 African countries with the highest GDP per capita in 2025

    0 shares
    Share 0 Tweet 0
  • Global ranking of Top 5 smartphone brands in Q3, 2024

    0 shares
    Share 0 Tweet 0
  • When Will SHIB Reach $1? Here’s What ChatGPT Says

    0 shares
    Share 0 Tweet 0

Get strategic intelligence you won’t find anywhere else. Subscribe to the Limitless Beliefs Newsletter for monthly insights on overlooked business opportunities across Africa.

Subscription Form

© 2026 LBNN – All rights reserved.

Privacy Policy | About Us | Contact

Tiktok Youtube Telegram Instagram Linkedin X-twitter
No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • LBNN Blueprints
  • Quizzes
    • Enneagram quiz
  • Fashion Intelligence

© 2023 LBNN - All rights reserved.