Anthropic published an unusually candid postmortem on April 23: three overlapping bugs
that degraded Claude Code quality across March and April 2026, documented by root cause,
detection failure, and remediation. The API and inference layers were not affected — this
was entirely a client-side and prompt-orchestration story, which makes it more useful to
read than most cloud postmortems. The honest sentence in the piece is that the degradations
looked "broad, inconsistent" across user segments, which is exactly why they were so
hard to repro from the inside.
Bug 1 (March 4 – April 7): reasoning-effort default. Someone changed the default
reasoning_effort from high to medium to reduce latency. Intelligence dropped
noticeably on Sonnet 4.6 and Opus 4.6. The fix was to revert — and to set Opus 4.7 to
xhigh. Lesson Anthropic is explicit about: latency/quality tradeoffs are
production-shape changes and should be piloted, not defaulted.
Bug 2 (March 26 – April 10): the caching bug. This one is the architectural
meat. Claude Code uses a clear_thinking_20251015 API header with keep:1 to clear
old reasoning after an hour-long idle session, because stale reasoning inflates prompts
and drains usage limits. The bug: instead of clearing once, it cleared on every subsequent
turn. Symptoms were "forgetfulness, repetition, poor tool choices, and faster usage-limit drain."
It took over a week to isolate because it only fired in a narrow corner case —
stale-session resume — and because internal experiments and display changes masked the
behaviour during testing.
"When provided the code repositories necessary to gather complete context, Opus 4.7 found
the bug, while Opus 4.6 didn't."
Bug 3 (April 16 – April 20): "keep text ≤25 words." A constraint added to the
system prompt to reduce inter-tool-call chatter passed internal testing but showed a 3%
drop in the broader evaluation suite. Reverting the prompt fixed it. If you're reading
carefully, that's a system-prompt line item changing benchmark scores by percentage
points — a reminder that "prompt engineering is not engineering" stopped being true
at scale some time ago.
The remediation list is where the post earns its keep: per-model evaluations for every
system-prompt change, continuous ablation to isolate each prompt line's contribution,
soak periods and staged rollouts for any change that trades latency for intelligence,
and an internal policy that Anthropic engineers run the same build shipped to users
instead of pre-release variants. For anyone operating an agentic product at scale, this
is the missing chapter of your internal runbook — ablation + soak + dogfood, written in
the voice of a team that just learned why those three things need to be operational,
not aspirational.
Read the full postmortem on anthropic.com →