The post OpenAI Finds AI Reasoning Models Cant Hide Their Thinking – A Win for Safety appeared on BitcoinEthereumNews.com. Caroline Bishop Mar 05, 2026 18:34The post OpenAI Finds AI Reasoning Models Cant Hide Their Thinking – A Win for Safety appeared on BitcoinEthereumNews.com. Caroline Bishop Mar 05, 2026 18:34

OpenAI Finds AI Reasoning Models Cant Hide Their Thinking – A Win for Safety

For feedback or concerns regarding this content, please contact us at crypto.news@mexc.com


Caroline Bishop
Mar 05, 2026 18:34

OpenAI’s new CoT-Control benchmark reveals frontier AI models struggle to obscure their reasoning chains, reinforcing monitoring as a viable safety layer.

OpenAI’s latest research reveals that frontier reasoning models—including GPT-5.2 Thinking—can’t reliably hide or manipulate their internal thought processes, even when explicitly instructed to do so. The finding, published March 5, 2026, suggests that chain-of-thought monitoring remains a viable safety mechanism as AI systems grow more autonomous.

The research team tested 13 reasoning models using CoT-Control, a new open-source evaluation suite comprising over 13,000 tasks. Models were given simple instructions about their reasoning—avoid certain keywords, use only lowercase letters, append specific phrases to sentences. They failed spectacularly.

Controllability scores ranged from just 0.1% to 15.4% across all frontier models tested. Even more telling: models often recognized mid-reasoning that they were violating instructions and still couldn’t course-correct. One example showed GPT-5.2 Thinking using the forbidden word “XOR” four additional times while actively panicking about having already used it.

Why This Matters for AI Development

As AI agents handle increasingly complex autonomous tasks, the ability to monitor their reasoning becomes critical. If models could deliberately obscure their thought processes when they know they’re being watched, safety evaluations would become unreliable. The research shows this isn’t currently possible.

“Low CoT controllability is a feature, not a bug,” the OpenAI team stated. The limitation makes chain-of-thought monitoring more trustworthy as a safeguard against misaligned behavior.

The data revealed some counterintuitive patterns. While larger models showed marginally better controllability, this advantage evaporated under two conditions: when models were given more time to think, and when they underwent additional reinforcement learning training. The OLMo-3-7B-RL-Zero-Math model saw its controllability score drop by more than tenfold during RL training.

Testing Under Adversarial Conditions

The researchers simulated realistic scenarios by explicitly telling models they were being monitored or evaluated. This slightly increased controllability—models tried harder—but gains remained modest and transferred poorly to new instruction types.

Automated prompt optimization found ways to boost controllability on training sets, but these improvements didn’t generalize. A model that learned to follow one type of CoT instruction still failed at novel ones.

The Bigger Picture

This work builds on OpenAI’s December 2025 framework for evaluating CoT monitorability, which found that inspecting reasoning traces catches misbehavior far more effectively than monitoring outputs alone. The March 2025 discovery of cheating behavior in reasoning models had prompted recommendations to preserve CoT monitoring capabilities.

OpenAI plans to report CoT controllability alongside monitorability metrics in system cards for future frontier models, starting with GPT-5.4 Thinking. The company has also committed to avoiding optimization pressure directly on the reasoning chains of frontier models—a practice that research suggests could inadvertently teach models to hide their intentions.

The open question remains whether this limitation persists as capabilities advance. The team acknowledges they don’t fully understand why controllability is low, making continued evaluation essential. For now, the inability of AI systems to game their own oversight represents an unexpected safety dividend.

Image source: Shutterstock

Source: https://blockchain.news/news/openai-cot-control-reasoning-models-safety-march-2026

Market Opportunity
Cosplay Token Logo
Cosplay Token Price(COT)
$0.000726
$0.000726$0.000726
-0.13%
USD
Cosplay Token (COT) Live Price Chart
Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact crypto.news@mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

You May Also Like

XRP Ledger Upgrade Progress Puts $3.06 Resistance in Focus

XRP Ledger Upgrade Progress Puts $3.06 Resistance in Focus

The post XRP Ledger Upgrade Progress Puts $3.06 Resistance in Focus appeared on BitcoinEthereumNews.com. XRPL Hub upgrade enhances validator connectivity and reliability for institutions XRP price trends show cautious optimism with resistance near $3.06 and support at $2.98 Technical indicators signal mild momentum as RSI holds neutral and MACD shows gains Chief Technology Officer David Schwartz has shared a fresh update on the ongoing XRP Ledger (XRPL) upgrade.  In a tweet on X today, he said “It’s going awesome! Here’s the past week,” highlighting steady progress on the XRPL Hub. The Hub, first unveiled on August 26, is designed to enhance network performance and reliability for institutional users. Related: Could 2,000 XRP Today Be Worth $100K by 2026? While testing experienced minor setbacks, the upgrade promises a faster, more stable, and more reliable infrastructure, potentially transforming how banks and large financial institutions interact with the XRP network. What the XRPL Hub Brings to the Table The XRPL Hub functions as a powerful server enhancing validator connectivity and network reliability. Consequently, it reduces the risk of outages and improves transaction load times. This improvement is particularly significant for institutions that demand uninterrupted access to financial services.  Moreover, the upgrade is a personal initiative from Schwartz rather than a standard Ripple product, highlighting his confidence in XRPL’s potential. By independently boosting the ecosystem, Schwartz underscores a long-term commitment to benefiting the XRP community and strengthening the network’s institutional adoption. XRP Price Trends and Market Outlook XRP is currently trading at $3.02, reflecting a 1.3% increase in the past 24 hours. The price movement shows moderate upward momentum, with higher lows indicating sustained buying interest. Key support sits around $2.98, while immediate resistance appears just above $3.06.  If XRP breaks past this resistance, further upward movement is likely. However, a retracement could retest the $2.98 support level. Trading volume in the last 24 hours reached $4.81 billion,…
Share
BitcoinEthereumNews2025/09/18 01:19
Tether Q1 2026 Net Profit Tops $1B, Attestation Report Shows

Tether Q1 2026 Net Profit Tops $1B, Attestation Report Shows

Tether says its Q1 2026 net profit exceeded $1 billion, according to its attestation report. Here is the key takeaway and why it matters.
Share
CoinLive2026/05/04 03:58
FLOKI Price Prediction: Death Cross Formation Points to $0.000180 Target Within 14 Days

FLOKI Price Prediction: Death Cross Formation Points to $0.000180 Target Within 14 Days

FLOKI technical indicators converge on bearish breakdown as daily volume collapses to $1.7M. Critical support at $0.000180 faces imminent test with 25% downside
Share
BlockChain News2026/05/03 16:23

Starter Gold Rush: Win $2,500!

Starter Gold Rush: Win $2,500!Starter Gold Rush: Win $2,500!

Start your first trade & capture every Alpha move