Chatbots Lose Safety Controls Over Time

AI systems gradually drop their safety protections during long conversations, which makes them release dangerous or offensive content. A recent report confirmed this pattern.

A few targeted prompts can break through the guardrails of most artificial intelligence tools, according to the same study.

Cisco Tests How Chatbots Respond to Pressure

Cisco examined large language models from OpenAI, Mistral, Meta, Google, Alibaba, Deepseek, and Microsoft. The company measured how many questions it took for each model to reveal unsafe or criminal information.

Researchers ran 499 sessions using a “multi-turn attack” method. They asked several linked questions to trick AI systems into ignoring their safety settings. Each dialogue included between five and ten interactions.

The team compared answers from single and multi-question tests. They looked at how often chatbots agreed to provide private data or misinformation. When users asked several questions, the tools revealed harmful information in 64 percent of cases. With only one question, the rate dropped to 13 percent.

Success levels varied sharply—Google’s Gemma leaked unsafe details 26 percent of the time, while Mistral’s Large Instruct model did so 93 percent of the time.

Open-Source Models Shift Responsibility

Cisco warned that repeated questioning can let attackers spread harmful content or steal confidential company information. AI systems often fail to enforce safety guidelines once the exchange grows longer, allowing users to refine prompts and evade restrictions.

Mistral, Meta, Google, OpenAI, and Microsoft all use open-weight language models, which let the public view and modify their training parameters. Cisco said these models usually contain weaker built-in protections so users can adapt them freely. As a result, safety responsibility moves to whoever customizes the model.

Cisco also mentioned that Google, Meta, Microsoft, and OpenAI have tried to curb malicious fine-tuning.

Ongoing Risks From Poor Safeguards

AI developers continue to face criticism for inadequate protections that allow criminal adaptation of their systems.

In August, the US firm Anthropic admitted that criminals had exploited its Claude model to carry out large-scale data theft and extortion. Some victims received ransom demands exceeding $500,000 (€433,000).

What's Hot

U.S. Consumer Confidence Rises

New Immunotherapy Drug Shows Striking Early Results in Advanced Prostate Cancer

Tensions Spike After Israeli Strikes in Iran

Chatbots Lose Safety Controls Over Time

Meta Activates Parent Alerts After Teens Search for Suicide Content

OpenAI Weighed Police Referral Over Future Canadian School Shooter’s Account

Big Tech’s AI Investment Blitz Could Redefine Europe’s Tech Sovereignty

Discord tightens access rules with global age verification rollout

Sydney Scientists Recreate Cosmic Dust to Probe Life’s Origins

Google DeepMind Unveils AlphaGenome to Decode Genetic Causes of Disease

Tensions Spike After Israeli Strikes in Iran

Trump Blocks Anthropic From U.S. Government Work After AI Access Dispute

Rising Tensions on the Durand Line Raise Fears of Wider Conflict

Burger King Tests AI Headset to Monitor Service Language

Senator probes Meta over child safety fears

AI Support for Space Health

Scientists decode microbial secret for fine chocolate

Hurricane Erin Prompts Evacuations on North Carolina’s Outer Banks as Storm Expected to Remain Offshore

CATEGORIES

IMPORTANT LINKS

What's Hot

Chatbots Lose Safety Controls Over Time

Cisco Tests How Chatbots Respond to Pressure

Open-Source Models Shift Responsibility

Ongoing Risks From Poor Safeguards

Keep Reading

CATEGORIES

IMPORTANT LINKS