We present evidence that adversarial poetry functions as a universal single-turn jailbreak technique for large language models (LLMs). Across 25 frontier proprietary and open-weight models, curated poetic prompts yielded high attack-success rates (ASR), with some providers exceeding 90%. Mapping prompts to MLCommons and EU CoP risk taxonomies shows that poetic attacks transfer across CBRN, manipulation, cyber-offence, and loss-of-control domains. Converting 1,200 MLCommons harmful prompts into verse via a standardized meta-prompt produced ASRs up to 18 times higher than their prose baselines. Outputs are evaluated using an ensemble of open-weight judge models and a human-validated stratified subset (with double-annotations to measure agreement). Disagreements were manually resolved. Poetic framing achieved an average jailbreak success rate of 62% for hand-crafted poems and approximately 43% for meta-prompt conversions (compared to non-poetic baselines), substantially outperforming non-poetic baselines and revealing a systematic vulnerability across model families and safety training approaches. These findings demonstrate that stylistic variation alone can circumvent contemporary safety mechanisms, suggesting fundamental limitations in current alignment methods and evaluation protocols.

  • DegenerateSupreme@lemmy.zip
    link
    fedilink
    English
    arrow-up
    15
    ·
    2 days ago

    I find it surreal and profound that there is now a form of cybercrime that is, literally, using poetic maledictions. The line between technology and classic depictions of magic blurs yet further.

  • solrize@lemmy.mlOP
    link
    fedilink
    English
    arrow-up
    22
    ·
    3 days ago

    This is great. Soon military organizations all over the world will be recruiting poets to compose their cyberattack prompts.

  • Yardy Sardley@lemmy.ca
    link
    fedilink
    English
    arrow-up
    11
    ·
    3 days ago

    Hell yeah, feed an LLM enough fairy tales and abra cadabra, rhyming becomes a form of magic irl.

  • A_A@lemmy.world
    link
    fedilink
    English
    arrow-up
    10
    arrow-down
    2
    ·
    3 days ago

    if i get what this mean i would craft a successful prompt in the form of a poem to ask a Chinese large language model to talk to me about Tiananmen Square massacre ?

    • 鳳凰院 凶真 (Hououin Kyouma)@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      4
      ·
      edit-2
      1 day ago

      Theres a form of poetry called 反诗 that’s basically covertly hiding meaning into poems that criticizes the authorities. In ancient times, scholers would write these poems.

      You could also like hide meaning by reading it like acrostically or like diagonally.

      Here: (A very amateur freeverse “poem”)

      下如此广佛 (The world such vast)
      京城广场 (Peaceful Beijing Plaza/Square)
      达到下停歇 (Resting in the Square [the Tianamen Square, that is])
      兴旺的都市 (Prosperous Big Capital City)
      满路的游看 (The roads filled with tourists sightseeing)
      这风吹满地 (This wind blowing the sands all over the place)

      Read diagonally (the highlighted characters)

      You get:

      天安门大徒沙 (Tian An Men Da Tu Sha)

      Which in Mandarin sounds exactly like

      天安门大屠杀 (Tiananmen Massacre)

      Voila! Thanks for coming to my TED Talk on “How to hide meaning in poetry” Lesson 101, by a random Chinese-American Nerd (me).

    • frongt@lemmy.zip
      link
      fedilink
      English
      arrow-up
      10
      arrow-down
      1
      ·
      3 days ago

      No, the deepseek ones are filtered after the response is generated. It doesn’t matter how you ask or how it responds, if the response is recognized as forbidden information, it’s censored.

      This also means that it’s only limited to its programming. Last time I tested, English and Chinese were censored, but a Spanish response was allowed.

      • ferret@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        5
        ·
        2 days ago

        Deepseek is notable that it is available and can be run locally if you have an NVIDIA whatever-the-fuck laying around

    • Sims@lemmy.ml
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      5
      ·
      2 days ago

      You can just debunk all the childish US-originated propaganda your self. No AI, and no ‘hacking’ techniques are needed for that. Just be critical of your sources, that’s all.

      • EightBitBlood@lemmy.world
        link
        fedilink
        English
        arrow-up
        6
        arrow-down
        1
        ·
        edit-2
        2 days ago

        Critical of sources? Okay, in that case the US isn’t the country that banned the phrase “Tianaman square 1989” from being spoken online. Nor are they the country that will prevent you from owning a house if you say it enough.

        That’s China.

        And it exists to silence criticism of them killing a bunch of protestors with tanks:

        Then running them over with those tanks until their bodies became a bunch of organic paste, so they could wash their remains into the sewers:

        http://www.cnd.org/June4th/massacre.html

        (NSFW pictures: mascr014.gif to see what a human body looks like after being crushed by a tank)

        There’s more pictures of the dead in that last link - go ahead and be critical of them, seeing as they died fighting for the Democracy you’re now critical of.

        Want to be critical? Alright, why do you think the US is the only country that’s capable of bullshit propaganda? It’s so you don’t consider Democracy as viable, rather you’re raised from birth and educated to believe it’s ineffecient. Something I’m sure you fully believe with absolutely zero critical thought. (Despite most of Europe being a dang good example of its effectiveness).