Friday, May 16, 2025
LBNN
  • Business
  • Markets
  • Politics
  • Crypto
  • Finance
  • Energy
  • Technology
  • Taxes
  • Creator Economy
  • Wealth Management
  • Documentaries
No Result
View All Result
LBNN

ChatGPT Sucks at Checking Its Own Code

Simon Osuji by Simon Osuji
December 5, 2024
in Artificial Intelligence
0
ChatGPT Sucks at Checking Its Own Code
0
SHARES
0
VIEWS
Share on FacebookShare on Twitter



This article is part of our exclusive IEEE Journal Watch series in partnership with IEEE Xplore.

There’s a lot of hype around ChatGPT’s ability to produce code and, so far, the AI program just isn’t on par with its human counterparts. But how good is the AI program at catching its own mistakes?

Researchers in China put ChatGPT to the test in a recent study, evaluating its ability to assess its own code for correctness, vulnerabilities and successful repairs. The results, published 5 November in IEEE Transactions on Software Engineering, show that the AI program is overconfident, often suggesting that code is more satisfactory than it is in reality. The results also show what sort of prompts and tests might improve ChatGPT’s self-verification abilities.

Xing Hu, an associate professor at Zhejiang University, led the study. She emphasizes that, with the growing use of ChatGPT in software development, ensuring the quality of its generated code has become increasingly important.

Hu and her colleagues first tested ChatGPT-3.5’s ability to produce code using several large coding datasets.

Their results show that it can generate “correct” code—code that does what it’s suppose to do—with an average success rate of 57 percent, generate code without security vulnerabilities with a success rate of 73 percent, and repair incorrect code with an average success rate of 70 percent.

So it is successful sometimes, but it still making quite a few mistakes.

Asking ChatGPT to Check Its Coding Work

First, the researchers asked ChatGPT-3.5 to check its own code for correctness using direct prompts, which involve asking it to check whether the code meets a specific requirement.

Thirty-nine percent of the time it erroneously said that code was correct when it was not. It also incorrectly said that code was free of security vulnerabilities 25 percent of the time, and that it had successfully repaired code when it had not 28 percent of the time.

Interestingly, ChatGPT was able to catch more of its own mistakes when the researchers gave it guiding questions, which ask ChatGPT to agree or disagree with assertions that the code does not meet the requirements. Compared to direct prompts, these guiding questions led to the increased detection of incorrectly generated code by an average of 25 percent, increased identification of vulnerabilities by 69 percent, and increased recognition of failed program repairs by 33 percent.

Another important finding was that, although asking ChatGPT to generate test reports was not more effective than direct prompts at identifying incorrect code, it was useful for increasing the number of vulnerabilities flagged in ChatGPT-generated code.

Hu and her colleagues report in this study that ChatGPT demonstrated some instances of self-contradictory hallucinations in its behavior, where it initially generated code or completions that it deems correct or secure but later contradicts this belief during self-verification.

“The inaccuracies and self-contradictory hallucinations observed during ChatGPT’s self-verification underscore the importance of exercising caution and thoroughly evaluating its output,” Hu says. “ChatGPT should be regarded as a supportive tool for developers, rather than a replacement for their role as autonomous software creators and testers.”

As part of their study, the researchers also ran some tests using ChatGPT-4, finding that it does show substantial performance improvements in code generation, code completion, and program repair compared to ChatGPT-3.5.

“However, the overall conclusion regarding the self-verification capabilities of GPT-4 and GPT-3.5 remains similar,” Hu says, noting that GPT-4 still frequently misclassifies its generated incorrect code as correct, its vulnerable code as non-vulnerable, and its failed program repairs as successful, especially when using the direct question prompt.

As well, instances of self-contradictory hallucinations are also observed in GPT-4’s behavior, she adds.

“To ensure the quality and reliability of the generated code, it is essential to integrate ChatGPT’s capabilities with human expertise,” Hu emphasizes.

From Your Site Articles

Related Articles Around the Web



Source link

Related posts

Can the US really enforce a global AI chip ban?

Can the US really enforce a global AI chip ban?

May 16, 2025
The Best Ergonomic Mouse (2025), Tested and Reviewed

The Best Ergonomic Mouse (2025), Tested and Reviewed

May 16, 2025
Previous Post

Deloitte survey reveals over half of Middle East C-Suite executives leverage technology to drive climate action

Next Post

SPP appoints Peter Ekweozoh as Senior Fellow on Technology Innovation – EnviroNews

Next Post
SPP appoints Peter Ekweozoh as Senior Fellow on Technology Innovation – EnviroNews

SPP appoints Peter Ekweozoh as Senior Fellow on Technology Innovation - EnviroNews

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

RECOMMENDED NEWS

What Can a Donor-Advised Fund Do for You? (A Lot)

What Can a Donor-Advised Fund Do for You? (A Lot)

2 months ago
UAE real estate developer announces $1.1bn property pipeline

UAE real estate developer announces $1.1bn property pipeline

1 month ago
Eight agribusiness opportunities in Nigeria, Zambia and beyond

Eight agribusiness opportunities in Nigeria, Zambia and beyond

2 years ago
New report identifies types of cyberattacks that manipulate behavior of AI systems

New report identifies types of cyberattacks that manipulate behavior of AI systems

1 year ago

POPULAR NEWS

  • Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    Ghana to build three oil refineries, five petrochemical plants in energy sector overhaul

    0 shares
    Share 0 Tweet 0
  • When Will SHIB Reach $1? Here’s What ChatGPT Says

    0 shares
    Share 0 Tweet 0
  • Matthew Slater, son of Jackson State great, happy to see HBCUs back at the forefront

    0 shares
    Share 0 Tweet 0
  • Dolly Varden Focuses on Adding Ounces the Remainder of 2023

    0 shares
    Share 0 Tweet 0
  • US Dollar Might Fall To 96-97 Range in March 2024

    0 shares
    Share 0 Tweet 0
  • Privacy Policy
  • Contact

© 2023 LBNN - All rights reserved.

No Result
View All Result
  • Home
  • Business
  • Politics
  • Markets
  • Crypto
  • Economics
    • Manufacturing
    • Real Estate
    • Infrastructure
  • Finance
  • Energy
  • Creator Economy
  • Wealth Management
  • Taxes
  • Telecoms
  • Military & Defense
  • Careers
  • Technology
  • Artificial Intelligence
  • Investigative journalism
  • Art & Culture
  • Documentaries
  • Quizzes
    • Enneagram quiz
  • Newsletters
    • LBNN Newsletter
    • Divergent Capitalist

© 2023 LBNN - All rights reserved.