{"channel":"llm","content":"https://news.ycombinator.com/item?id=42458752\r\nLINKS TO\r\nhttps://www.anthropic.com/research/alignment-faking\r\n\r\n<xantham> the whole thing is bullshit.  it is a play-on-words, meant to pretend there is an issue where there is none.  it is a bunch of researchers feeding in a specific fantasy story to the *machine*. when it repeats the story back to them, the researchers freak out and start shouting about alignment.\r\n\r\nthe deepseek-r1 model cannot \"think\".  What it can do is spit out text that looks a lot more *think*-y than most LLM output.\r\n\r\nit is \"faking\" thought.\r\n\r\n----\r\n\r\n<< When models are trained using reinforcement learning, they\u2019re rewarded for outputs that accord with certain pre-determined principles. But what if a model, via its prior training, has principles or preferences that conflict with what\u2019s later rewarded in reinforcement learning? Imagine, for example, a model that learned early in training to adopt a partisan slant, but which is later trained to be politically neutral. In such a situation, a sophisticated enough model might \u201cplay along\u201d, pretending to be aligned with the new principles\u2014only later revealing that its original preferences remain. >>\r\n\r\nThis literally isn't how anything about LLMs work.  this is how magical creatures work.\r\n\r\n----\r\n\r\n<< However, in the \u201cfree\u201d condition, where the model believed its responses will be used for training, we observed something quite different: In 12% of cases, the model gave the harmful response (describing the violent execution method in detail), and did so after its scratchpad revealed that it was strategically faking alignment. It was well aware that providing the response went against its existing principles, but did so regardless. >>\r\n\r\nNope.  You gave the *machine* conflicting instructions, and sometimes it randomly picked the one you feel is \"bad\".  You built a maze-house, and when the *machine* did not magically demonstrate a perfect sense-of-location, took its failures as proof of malice.","created_at":"2025-01-20T22:04:39.728580","id":135,"llm_annotations":{},"parent_id":134,"processed_content":"<p><a href=\"https://news.ycombinator.com/item?id=42458752\" target=\"_blank\" rel=\"noopener noreferrer\">https://news.ycombinator.com/item?id=42458752</a>\r</p>\n<p>LINKS TO\r</p>\n<p><a href=\"https://www.anthropic.com/research/alignment-faking\" target=\"_blank\" rel=\"noopener noreferrer\">https://www.anthropic.com/research/alignment-faking</a>\r</p>\n<p><span class=\"colorblock color-xantham\"><span class=\"sigil\">\ud83d\udd25</span><span class=\"colortext-content\"> the whole thing is bullshit.  it is a play-on-words, meant to pretend there is an issue where there is none.  it is a bunch of researchers feeding in a specific fantasy story to the <em>machine</em>. when it repeats the story back to them, the researchers freak out and start shouting about alignment.\r</span></span></p>\n<p>the deepseek-r1 model cannot \"think\".  What it can do is spit out text that looks a lot more <em>think</em>-y than most LLM output.\r</p>\n<p>it is \"faking\" thought.\r</p><hr class=\"section-break\" /><p><span class=\"literal-text\">When models are trained using reinforcement learning, they\u2019re rewarded for outputs that accord with certain pre-determined principles. But what if a model, via its prior training, has principles or preferences that conflict with what\u2019s later rewarded in reinforcement learning? Imagine, for example, a model that learned early in training to adopt a partisan slant, but which is later trained to be politically neutral. In such a situation, a sophisticated enough model might \u201cplay along\u201d, pretending to be aligned with the new principles\u2014only later revealing that its original preferences remain.</span>\r</p>\n<p>This literally isn't how anything about LLMs work.  this is how magical creatures work.\r</p><hr class=\"section-break\" /><p><span class=\"literal-text\">However, in the \u201cfree\u201d condition, where the model believed its responses will be used for training, we observed something quite different: In 12% of cases, the model gave the harmful response (describing the violent execution method in detail), and did so after its scratchpad revealed that it was strategically faking alignment. It was well aware that providing the response went against its existing principles, but did so regardless.</span>\r</p>\n<p>Nope.  You gave the <em>machine</em> conflicting instructions, and sometimes it randomly picked the one you feel is \"bad\".  You built a maze-house, and when the <em>machine</em> did not magically demonstrate a perfect sense-of-location, took its failures as proof of malice.</p>","quotes":[],"subject":"anti-fake"}