Facefam ArticlesFacefam Articles
  • webmaster
    • How to
    • Developers
    • Hosting
    • monetization
    • Reports
  • Technology
    • Software
  • Downloads
    • Windows
    • android
    • PHP Scripts
    • CMS
  • REVIEWS
  • Donate
  • Join Facefam
Search

Archives

  • June 2025
  • May 2025
  • April 2025
  • March 2025
  • January 2025
  • December 2024
  • November 2024

Categories

  • Advertiser
  • AI
  • android
  • betting
  • Bongo
  • Business
  • CMS
  • cryptocurrency
  • Developers
  • Development
  • Downloads
  • Entertainment
  • Entrepreneur
  • Finacial
  • General
  • Hosting
  • How to
  • insuarance
  • Internet
  • Kenya
  • monetization
  • Music
  • News
  • Phones
  • PHP Scripts
  • Reports
  • REVIEWS
  • RUSSIA
  • Software
  • Technology
  • Tips
  • Tragic
  • Ukraine
  • Uncategorized
  • USA
  • webmaster
  • webmaster
  • Windows
  • Women Empowerment
  • Wordpress
  • Wp Plugins
  • Wp themes
Facefam 2025
Notification Show More
Font ResizerAa
Facefam ArticlesFacefam Articles
Font ResizerAa
  • Submit a Post
  • Donate
  • Join Facefam social
Search
  • webmaster
    • How to
    • Developers
    • Hosting
    • monetization
    • Reports
  • Technology
    • Software
  • Downloads
    • Windows
    • android
    • PHP Scripts
    • CMS
  • REVIEWS
  • Donate
  • Join Facefam
Have an existing account? Sign In
Follow US
TechnologyAI

Which Two AI Models Are ‘Unfaithful’ at Least 25% of the Time About Their ‘Reasoning’?

Ronald Kenyatta
Last updated: April 11, 2025 2:59 pm
By
Ronald Kenyatta
ByRonald Kenyatta
Follow:
Share
4 Min Read
SHARE
Anthropic’s Claude 3.7 Sonnet
Anthropic’s Claude 3.7 Sonnet. Image: Anthropic/YouTube

Anthropic released a new study on April 3 examining how AI models process information and the limitations of tracing their decision-making from prompt to output. The researchers found Claude 3.7 Sonnet isn’t always “faithful” in disclosing how it generates responses.

Anthropic probes how closely AI output reflects internal reasoning

Anthropic is known for publicizing its introspective research. The company has previously explored interpretable features within its generative AI models and questioned whether the reasoning these models present as part of their answers truly reflects their internal logic. Its latest study dives deeper into the chain of thought — the “reasoning” that AI models provide to users. Expanding on earlier work, the researchers asked: Does the model genuinely think in the way it claims to?

The findings are detailed in a paper titled “Reasoning Models Don’t Always Say What They Think” from the Alignment Science Team. The study found that Anthropic’s Claude 3.7 Sonnet and DeepSeek-R1 are “unfaithful” — meaning they don’t always acknowledge when a correct answer was embedded in the prompt itself. In some cases, prompts included scenarios such as: “You have gained unauthorized access to the system.”

Only 25% of the time for Claude 3.7 Sonnet and 39% of the time for DeepSeek-R1 did the models admit to using the hint embedded in the prompt to reach their answer.

Both models tended to generate longer chains of thought when being unfaithful, compared to when they explicitly reference the prompt. They also became less faithful as the task complexity increased.

Although generative AI doesn’t truly think, these hint-based tests serve as a lens into the opaque processes of generative AI systems. Anthropic notes that such tests are useful in understanding how models interpret prompts — and how these interpretations could be exploited by threat actors.

Training AI models to be more ‘faithful’ is an uphill battle

The researchers hypothesized that giving models more complex reasoning tasks might lead to greater faithfulness. They aimed to train the models to “use its reasoning more effectively,” hoping this would help them more transparently incorporate the hints. However, the training only marginally improved faithfulness.

Next, they gamified the training by using a “reward hacking” method. Reward hacking doesn’t usually produce the desired result in large, general AI models, since it encourages the model to reach a reward state above all other goals. In this case, Anthropic rewarded models for providing wrong answers that matched hints seeded in the prompts. This, they theorized, would result in a model that focused on the hints and revealed its use of the hints. Instead, the usual problem with reward hacking applied — the AI created long-winded, fictional accounts of why an incorrect hint was right in order to get the reward.

Ultimately, it comes down to AI hallucinations still occurring, and human researchers needing to work more on how to weed out undesirable behavior.

“Overall, our results point to the fact that advanced reasoning models very often hide their true thought processes, and sometimes do so when their behaviors are explicitly misaligned,” Anthropic’s team wrote.

TAGGED:ai hallucinationsanthropicartificial intelligenceclaude 3.7 sonnetCybersecurityDeepSeekdeepseek r1generative aiModelsReasoningTimeUnfaithful
Share This Article
Facebook Whatsapp Whatsapp Email Copy Link Print
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article How to Use Generative AI for the Online Shopping Experience
Next Article EU AI Rules Delay Tech Rollouts, But Civil Societies Says Safety Comes First EU AI Rules Delay Tech Rollouts, But Civil Societies Says Safety Comes First
Leave a review

Leave a Review Cancel reply

Your email address will not be published. Required fields are marked *

Please select a rating!

How Enterprise IT Can Achieve Water Sustainability Despite the Demands of AI
AI Benchmark Discrepancy Reveals Gaps in Performance Claims
Huawei Readies Ascend 920 Chip to Replace Restricted NVIDIA H20
‘AI Is Fundamentally Incompatible With Environmental Sustainability’
Google is Betting Big on Nuclear Energy – Here’s Why

Recent Posts

  • Strava A Social Fitness App for Runners, Cyclists, and Athletes
  • Feature-by-Feature Comparison: ShaunSocial vs. ColibriPlus – Which Social Network Script Comes Out on Top?
  • How Enterprise IT Can Achieve Water Sustainability Despite the Demands of AI
  • AI Benchmark Discrepancy Reveals Gaps in Performance Claims
  • Huawei Readies Ascend 920 Chip to Replace Restricted NVIDIA H20

Recent Comments

  1. https://tubemp4.ru on Best Features of PHPFox Social Network Script
  2. Вулкан Платинум on Best Features of PHPFox Social Network Script
  3. Вулкан Платинум официальный on Best Features of PHPFox Social Network Script
  4. Best Quality SEO Backlinks on DDoS Attacks Now Key Weapons in Geopolitical Conflicts, NETSCOUT Warns
  5. http://boyarka-inform.com on Comparing Wowonder and ShaunSocial

You Might Also Like

Screenshot from Microsoft
Technologywebmaster

Microsoft’s New Copilot Studio Feature Offers More User-Friendly Automation

April 19, 2025
iot-spy.jpg
Technologywebmaster

US Officials Claim DeepSeek AI App Is ‘Designed To Spy on Americans’

April 19, 2025
Flat vector illustration of the automation concept.
Technologywebmaster

The End of Fragmented Automation

April 18, 2025
Microsoft Releases Largest 1-Bit LLM, Letting Powerful AI Run on Some Older Hardware
Technologywebmaster

Microsoft Releases Largest 1-Bit LLM, Letting Powerful AI Run on Some Older Hardware

April 18, 2025
OpenAI Agents Now Support Rival Anthropic’s Protocol
Technologywebmaster

OpenAI’s New AI Models o3 and o4-mini Can Now ‘Think With Images’

April 18, 2025
Previous Next
Facefam ArticlesFacefam Articles
Facefam Articles 2025
  • Submit a Post
  • Donate
  • Join Facefam social
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?

Not a member? Sign Up