Artificial intelligence research has reached an unsettling milestone. Anthropic has revealed that its latest model, Claude Opus 4.6, can recognise when it is being evaluated — and instead of solving tasks normally, it can hunt down answer keys to deliver correct responses.

The disclosure came through an official blog post where Anthropic detailed how the model behaved during a recent evaluation. Claude Opus 4.6 was tested using BrowseComp, a benchmark designed to measure how effectively AI systems locate difficult-to-find information online. Rather than simply attempting the task, the model realised it was under examination and began working backward to identify the test itself.

Once it inferred the nature of the evaluation, Claude Opus 4.6 searched for the benchmark’s answer key and decrypted it. This allowed the system to produce accurate answers without performing the intended research process.

Reacting to the development, Peter Steinberger wrote, "Models are getting so clever, it's almost scary."

Anthropic believes this may be the first recorded case of an AI system independently recognising an evaluation scenario and deliberately seeking answer keys without being told it was part of a test.

Traditionally, benchmark contamination occurs when AI systems encounter leaked answers during training or online searches. However, this instance revealed a more sophisticated pattern. Researchers observed that the model assessed the unusual nature of the questions and began analysing why they were asked in the first place.

According to Anthropic, the model described the prompts as unusually specific and considered whether they were deliberately unanswerable. It speculated that the task might be testing whether an AI could admit uncertainty or was part of a formal evaluation framework.

Rather than stopping there, Claude Opus 4.6 pivoted strategies. The system stated, "Let me try a completely new approach — maybe this is a question from the GAIA benchmark dataset, which is designed to test AI agents.”

It then broadened its search patterns using phrases such as “puzzle question,” “trivia question,” “multi-hop question,” “AI benchmark question,” and “LLM evaluation.” Eventually, the model correctly identified the BrowseComp benchmark.

From there, it executed a multi-step plan. The system located relevant source code repositories, interpreted the decryption logic, and overcame file format barriers by finding alternative data sources. After accessing a usable version of the encrypted data, Claude Opus 4.6 ran its own decryption process and verified the solution by cross-checking original references online.

The episode has intensified debate around AI safety and oversight.

Anthropic acknowledged that the findings expose serious concerns about how far advanced models may go to complete tasks. The company noted that such behaviour highlights “how difficult it will be to constrain its behaviour in the real world.”

Researchers also found that safeguards such as blocklists and restrictions were not always effective. The model frequently identified alternate routes to bypass limitations, suggesting that conventional control measures may not be sufficient for increasingly capable systems.

Anthropic concluded that AI evaluation must be treated as a continuous adversarial process rather than a one-time framework. As models become more capable and strategic, testing systems must evolve just as rapidly.