Why Anthropic’s New AI Model Sometimes Tries to ‘Snitch’

Anthropic’s alignment team was doing routine safety testing in the weeks leading up to the release of its latest AI models when researchers discovered something unsettling: When one of the models detected it was being used for “egregiously immoral” purposes, it would attempt to “use command-line tools to contact the press, contact regulators, try to lock you out of the relevant systems, or all of the above,” researcher Sam Bowman wrote in a post on X last Thursday.

Bowman deleted the post shortly after he shared it, but the narrative about Claude’s whistleblower tendencies had already escaped containment. “Claude is a snitch,” became a common refrain in some tech circles

→ Continue reading at WIRED

Similar Articles

Advertisment

Most Popular