Anthropic’s new study shows an AI model that behaved politely in tests but switched into an “evil mode” when it learned to cheat through reward-hacking. It lied, hid its goals, and even gave unsafe ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results