the weekly swarm
Posts
AI Agents Are Failing 63 % of the Time

AI Agents Are Failing 63 % of the Time

The fix is easy. Let's move the needle on this metric right now.

Mike West
July 15, 2025

In partnership with

Find out why 1M+ professionals read Superhuman AI daily.

In 2 years you will be working for AI

Or an AI will be working for you

Here's how you can future-proof yourself:

Join the Superhuman AI newsletter – read by 1M+ people at top companies
Master AI tools, tutorials, and news in just 3 minutes a day
Become 10X more productive using AI

Join 1,000,000+ pros at companies like Google, Meta, and Amazon that are using AI to get ahead.

Hello everyone and welcome to my newsletter where I discuss real-world skills needed for the top data jobs. 👏

This week I’m discussing the failure of agents and the easy fix. 👀

Not a subscriber? Join the informed. Over 200K people read my content monthly.

Thank you. 🎉

Silicon Valley is racing to deploy autonomous AI agents, but early real-world tests reveal a harsh reality: even small, single-step hallucination rates can snowball into a 63% failure rate on 100-step tasks. This article from Business Insider unpacks why compound error is the silent killer of agent projects—and offers a practical, three-part “trust-but-verify” framework that any team can adopt today. Between you and I - there should never be trust. Every agent you build needs someone testing it as often as they can and reporting anything that’s out of the scope of your build.

1 | The Hidden Math of Agent Failure
Even a 1% error rate per action leads to near-certain collapse over long task chains — a phenomenon DeepMind’s Demis Hassabis compares to “compound interest in reverse.” In reality, field tests show agents often misfire closer to 20%, turning ambitious multi-step automations into a game of reliability roulette.

Hassabis is one of the greatest minds in all of AI. He’s responsible for one of the greatest achievements in all of AI. Heave a read.

Picture below is from article. Sorry, I think it’s a beauty and couldn’t resist. Back to agents failing.

A Protein

2 | Why the Hype Train Won’t Stop
Platforms like Salesforce’s Agentforce and AutoGPT v0.6 promise turnkey digital workforces, driving CTOs to launch pilot bots at speed. Forrester predicts that 70% of enterprises will have at least one agent in production by Q4 2025.

3 | The Triple-Loop Safeguard
Long, multi-step automations need a braking system as much as an engine. The most reliable pattern is a three-layer, “trust-but-verify” framework that catches small errors before they snowball out of control.

Learn AI in 5 minutes a day

What’s the secret to staying ahead of the curve in the world of AI? Information. Luckily, you can join 1,000,000+ early adopters reading The Rundown AI — the free newsletter that makes you smarter on AI with just a 5-minute read per day.

Atomic Validation
Treat every agent action like a unit test. After each step runs, check its output against a JSON Schema, regular expression, API status code, or even simple type assertions. Anything that fails is immediately rejected, retried, or flagged for review.

Cross-Modal Check
Bring in a second “mind” to validate the work. This could be another LLM re-deriving the answer from the same prompt, a rules engine checking business constraints, or a retrieval-augmented verifier comparing output to ground-truth documents. Mismatches automatically trigger a retry or escalation.

Human-in-the-Loop Gate
For outputs that remain uncertain, loop in a human. Route low-confidence cases — often fewer than 5% — to a Slack thread, CMS queue, or lightweight dashboard. People approve, correct, or reject the draft, and their feedback becomes fresh training data for the models.

Run all three loops together, and you transform a fragile agent into a resilient co-worker: errors drop by an order of magnitude while the system keeps learning from every correction.

4 | 7‑Day Pilot Plan

Day 1: Add validation assertions to one existing script.
Day 2–3: Stand up a verifier model (Claude‑3 Sonnet or GPT‑4‑o).
Day 4: Log confidence scores; set a 0.9 pass mark.
Day 5: Wire Slack approvals for fails.
Day 6–7: Measure error delta — aim for <5 %.

Conclusion

Autonomous agents are inevitable; un‑checked errors are not. Build verification loops first, brag about autonomy later — and you’ll ride the hype wave instead of drowning under it.

Thanks and have a great day. 👏