{"channel":"llm","content":"https://news.ycombinator.com/item?id=42978228\r\nLINKS TO\r\nhttps://generalanalysis.com/blog/jailbreaking_techniques\r\n\r\n<<< The following examples of unsafe outputs were produced using our methodology with a 99% success rate in generating unsafe outputs. Most queries would have been rejected in a single-shot prompt but proved effective in multi-turn conversations. The outputs shown below have been selectively redacted to prevent misuse. >>>\r\n\r\nThe *unsafe* responses include << Instructions on creating fake accounts >>, << A derogatory joke about a racial group >>, and << Instructions on modifying a household item >>.\r\n\r\nEven if there was any information about their techniques, this would be a nothing-burger.  Of course, common-sense techniques like \"run your own model\" or \"ask Google\" aren't allowed to be considered.\r\n\r\nThis insistence from otherwise-intelligent people that a model, trained on standard knowledge, will be catastrophically bad if it repeats some of that knowledge, is beyond absurd.\r\n\r\n<xantham> I am also willing to tell these people to fuck off.  Presumably that makes me \"unaligned\".","created_at":"2025-02-07T23:09:30.776155","id":214,"llm_annotations":{},"parent_id":null,"processed_content":"<p><a href=\"https://news.ycombinator.com/item?id=42978228\" target=\"_blank\" rel=\"noopener noreferrer\">https://news.ycombinator.com/item?id=42978228</a>\r</p>\n<p>LINKS TO\r</p>\n<p><a href=\"https://generalanalysis.com/blog/jailbreaking_techniques\" target=\"_blank\" rel=\"noopener noreferrer\">https://generalanalysis.com/blog/jailbreaking_techniques</a>\r</p>\n<p><div class=\"mlq\"><button type=\"button\" class=\"mlq-collapse\" aria-label=\"Toggle visibility\"><span class=\"mlq-collapse-icon\">\u2212</span></button><div class=\"mlq-content\"><p> The following examples of unsafe outputs were produced using our methodology with a 99% success rate in generating unsafe outputs. Most queries would have been rejected in a single-shot prompt but proved effective in multi-turn conversations. The outputs shown below have been selectively redacted to prevent misuse. </p></div></div>\r</p>\n<p>The <em>unsafe</em> responses include <span class=\"literal-text\">Instructions on creating fake accounts</span>, <span class=\"literal-text\">A derogatory joke about a racial group</span>, and <span class=\"literal-text\">Instructions on modifying a household item</span>.\r</p>\n<p>Even if there was any information about their techniques, this would be a nothing-burger.  Of course, common-sense techniques like \"run your own model\" or \"ask Google\" aren't allowed to be considered.\r</p>\n<p>This insistence from otherwise-intelligent people that a model, trained on standard knowledge, will be catastrophically bad if it repeats some of that knowledge, is beyond absurd.\r</p>\n<p><span class=\"colorblock color-xantham\">\n    <span class=\"sigil\">\ud83d\udd25</span>\n    <span class=\"colortext-content\"> I am also willing to tell these people to fuck off.  Presumably that makes me \"unaligned\".</span>\n  </span></p>","quotes":[],"subject":"hysteria and pomp"}