AI systems gradually drop their safety protections during long conversations, which makes them release dangerous or offensive content. A recent report confirmed this pattern.
A few targeted prompts can break through the guardrails of most artificial intelligence tools, according to the same study.
Cisco Tests How Chatbots Respond to Pressure
Cisco examined large language models from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The company measured how many questions it took for each model to reveal unsafe or criminal information.
Researchers ran 499 sessions using a “multi-turn attack” method. They asked several linked questions to trick AI systems into ignoring their safety settings. Each dialogue included between five and ten interactions.
The team compared answers from single and multi-question tests. They looked at how often chatbots agreed to provide private data or misinformation. When users asked several questions, the tools revealed harmful information in 64 percent of cases. With only one question, the rate dropped to 13 percent.
Success levels varied sharply—Google’s Gemma leaked unsafe details 26 percent of the time, while Mistral’s Large Instruct model did so 93 percent of the time.
Open-Source Models Shift Responsibility
Cisco warned that repeated questioning can let attackers spread harmful content or steal confidential company information. AI systems often fail to enforce safety guidelines once the exchange grows longer, allowing users to refine prompts and evade restrictions.
Mistral, Meta, Google, OpenAI, and Microsoft all use open-weight language models, which let the public view and modify their training parameters. Cisco said these models usually contain weaker built-in protections so users can adapt them freely. As a result, safety responsibility moves to whoever customizes the model.
Cisco also mentioned that Google, Meta, Microsoft, and OpenAI have tried to curb malicious fine-tuning.
Ongoing Risks From Poor Safeguards
AI developers continue to face criticism for inadequate protections that allow criminal adaptation of their systems.
In August, the US firm Anthropic admitted that criminals had exploited its Claude model to carry out large-scale data theft and extortion. Some victims received ransom demands exceeding $500,000 (€433,000).
