OpenAI’s new confession system teaches models to be honest about bad behaviors

OpenAI announced today that it is working on a framework that will train artificial intelligence models to acknowledge when they’ve engaged in undesirable behavior, an approach the team calls a confession. Since large language models are often trained to produce the response that seems to be desired, they can become increasingly likely to provide sycophancy or state hallucinations with total confidence. The new training model tries to encourage a secondary response from the model about what it did to arrive at the main answer it provides. Confessions are only judged on honesty, as opposed to the multiple factors that are used to judge main replies, such as helpfulness, accuracy

→ Continue reading at Engadget

Similar Articles

Advertisment

Most Popular