anti-fake
Channel: LLM - Large Language Model discussion
In reply to: the capital of Spain (View Chain)
https://news.ycombinator.com/item?id=42458752
LINKS TO
https://www.anthropic.com/research/alignment-faking
🔥 the whole thing is bullshit. it is a play-on-words, meant to pretend there is an issue where there is none. it is a bunch of researchers feeding in a specific fantasy story to the machine. when it repeats the story back to them, the researchers freak out and start shouting about alignment.
the deepseek-r1 model cannot "think". What it can do is spit out text that looks a lot more think-y than most LLM output.
it is "faking" thought.
When models are trained using reinforcement learning, they’re rewarded for outputs that accord with certain pre-determined principles. But what if a model, via its prior training, has principles or preferences that conflict with what’s later rewarded in reinforcement learning? Imagine, for example, a model that learned early in training to adopt a partisan slant, but which is later trained to be politically neutral. In such a situation, a sophisticated enough model might “play along”, pretending to be aligned with the new principles—only later revealing that its original preferences remain.
This literally isn't how anything about LLMs work. this is how magical creatures work.
However, in the “free” condition, where the model believed its responses will be used for training, we observed something quite different: In 12% of cases, the model gave the harmful response (describing the violent execution method in detail), and did so after its scratchpad revealed that it was strategically faking alignment. It was well aware that providing the response went against its existing principles, but did so regardless.
Nope. You gave the machine conflicting instructions, and sometimes it randomly picked the one you feel is "bad". You built a maze-house, and when the machine did not magically demonstrate a perfect sense-of-location, took its failures as proof of malice.