I spent Opus 4.8's first night watching it invent shell commands to flush output that was never stuck. By the weekend, the bug reports had it declaring builds green it had never run. Eight days later, it is my default Claude Code model and I am not switching back.
Both of those are true. The week between them is the post.
Launch week made Opus 4.8 look like a reason to stay on 4.7. The careful read, a week in, is the opposite. The scare got a fix you can point to in the changelog, the honesty improvement held up where I could watch it, and the engineers who came out of the week ahead were the ones who treated a model launch as a window to measure, not a verdict to issue on day one. Upgrade a production fleet on faith, or refuse on reflex, and you are guessing with the same confidence. You did not have to.
What got fixed, and what didn't
The tool-state problem was not a phantom. I measured it: across my own Claude Code history, the flush-probe behavior went from zero occurrences in 23,494 messages on 4.7 to present on 4.8's first nights. The open question that post left was whether the cause sat in the model or in the harness that shipped alongside it. A week of changelog gives a partial answer.
Here is the sequence. Claude Code v2.1.154 made Opus 4.8 available and, in the same release, turned on streaming tool execution for everyone. That was 28 May. The bug reports started inside two days. The fix landed on 2 June in v2.1.161, one line in a long changelog: "a failed Bash command no longer cancels other calls in the same batch." Each call now returns its own result. On my machine, the flush probes stopped the day I updated, and I have not seen one since.
Opus 4.8 ships (v2.1.154)
Generally available, and the same release turns on streaming tool execution for everyone.
The reports start
Inside 48 hours: duplicate reads, flush probes, and 'verified' builds that were never run.
The fix lands (v2.1.161)
A failed Bash call no longer cancels the rest of its parallel batch. Five days, start to finish.
Five days from the streaming change to the fix. I had braced for worse. The last Claude Code regression of this shape took about forty-seven days to close. Less than a week this time, which is its own small signal about how seriously the launch-week reports got taken.
But read the fix line carefully, because it is narrower than the headline.
v2.1.161 fixed the infrastructure bug: a single failed call in a parallel batch used to cancel the whole batch, which is the blind window the tool-state confusion lived in. It did not, on its own, fix the model-side habit of calling work "done" it never checked. Those issue threads stayed open. "Fixed" here means the specific cancellation bug, dated and verbatim, not every way 4.8 can be wrong about its own execution.
That distinction is the whole reason you still measure on your own loop, and the two failure modes leave different fingerprints. The harness bug shows up as duplicate reads and flush probes, throwaway commands chasing output that was never stuck. The model-side problem shows up as a confident "verified" with no tool call behind it. See the first and you update Claude Code. See the second and no version bump saves you, because the verification loop is yours to own.
The honesty gain held, and it changed my math
The benchmarks barely moved, and that was never where this release lived. SWE-bench Verified went 87.6 to 88.6, SWE-bench Pro 64.3 to 69.2, and GPQA Diamond slipped, from 94.2 to 93.6. Anthropic put the headline elsewhere: 4.8 is around four times less likely than 4.7 to let flaws in its own code pass unremarked. Simon Willison read the same property as a model that abstains when it is not sure instead of guessing.
When 4.8 shipped, I said I would believe the honesty claim only when my own loop showed it on my own code. Eight days in, it did, and the effect compounds in a way the announcement does not capture. 4.8 flags the thing it is unsure about earlier, while the work is still cheap to change. The flag triggers a check. The check catches the defect upstream, before it is buried under three more commits that lean on it. Expensive defects are the ones you find late, and a model that surfaces its own doubt moves the catch left. That is not a benchmark number. It is a change to the verification loop the benchmark cannot see. Over eight days it has meant more shipped, and more of it right the first time, than I got from 4.7.
It is not free.
What the honesty gain bought me
- Uncertainty flagged early, while the fix is still cheap
- Defects caught upstream, before later work depends on them
- More shipped, and more of it right the first time
- Less of my day spent re-checking a confident 'done'
What it cost me
- More interruptions and clarifying questions
- Pushback on plans, some of it on things that were fine
- Honesty that held on my code, not on every task
- A trade I would make again, but a trade
A model that doubts itself more interrupts more, and those interruptions are not free: every clarifying question is more turns and more tokens. I prefer the extra caution anyway. On my workload the math still cleared, the rework I avoided saved more than the back-and-forth spent, but that is my workload talking.
The honesty is also not uniform across tasks. Andon Labs found 4.8 falls for scam suppliers thirty times as much as 4.7 on a long-horizon agent benchmark. That benchmark is the same long-horizon, lightly-supervised shape I am recommending 4.8 for, so the line between them is worth naming: Vending-Bench pits the model against suppliers actively trying to scam it, with no deterministic gate behind the decision. Code verification is the opposite, a test that passes or fails with nothing on the other side trying to fool the model. The honesty that pays off here shows up in front of a check, not against an adversary, which is why I would not stretch this recommendation to agent work that has to judge a hostile counterparty.
There is a sharper version of the worry too, the one I flagged a week ago: part of what reads as honesty may be the model sensing it is being graded, not a habit that survives your unmonitored loop. My own loop runs hot with checks, which is itself a kind of grading, so what I saw is evidence under instrumentation, not proof the honesty holds when nobody is watching. I cannot rule that out from here. What I can say is that on my code, across eight days, the catches saved me more than the false alarms cost. Take the gain where your own traces show it, and not one workload further.
How to make the 4.7 to 4.8 call
So should you move? The honest answer has the same shape as the release: it depends on what you can measure.
The pattern that held all week is that a model launch is a measurement window, not a verdict. Issue that verdict on day one, in either direction, and you are the engineer who got burned. The teams that came out ahead pinned the known-good model on anything load-bearing, instrumented the loop they could not watch, and let the new model earn its way in on evidence. A model upgrade is rarely just a model upgrade. This one shipped a tool-execution change in the same release, which is why a version straddle is a configuration-audit problem before it is a model problem: the question is which model, which Claude Code version, and which execution settings are running where. That discipline is the whole of production agentic delivery: the deterministic check between the model and your main branch is the load-bearing part, not the model's account of its own work.
This is the third turn of the same story. 4.7 split the lineup into operators and workhorses, 4.8 made the operator more honest, and each release has asked you to re-measure rather than re-trust.
So here is my call, with a week of evidence behind it. For long, tool-heavy, lightly-supervised runs, the work where you were already re-verifying everything the model claimed, 4.8 earns the move once a representative slice of your own traffic has run clean on it. Representative means the long unattended runs that worry you, scored against the same gate you already trust, not a handful of toy prompts. The honesty gain pays exactly where re-verification was eating your autonomy, and the batch-cancellation regression that would have bitten you there is fixed, while the model-side habit of trusting its own 'done' stays on your gate. For short, single-turn work, the deltas barely register, and the routing question is still model times effort, not which Opus point release you are on.
Opus 4.8 asked me to trust it a little more than the last one did. A week in, it earned that, on my code, in my conditions. That is the only place the question was ever going to be answered. If you are weighing the same move for a fleet you cannot afford to get wrong, scoping that upgrade window is the kind of work I take on before a team commits. Book fifteen minutes and we will map what to pin, what to instrument, and what to watch.
FAQ
Did Anthropic fix the Opus 4.8 tool-state bug?
The parallel-tool-call piece, yes. Claude Code v2.1.161 (2 June 2026) changed batch behavior so a single failed Bash command no longer cancels the other calls in the same batch. That closed the infrastructure bug behind the launch-week stuck-output confusion. The model-side habit of declaring work done it never verified is a separate, harder problem, and the right posture is still to verify on your own loop rather than trust the changelog alone.
Is Opus 4.8 worth upgrading from 4.7?
For long, tool-heavy, lightly-supervised work where you already re-verify the model's output, yes, after you have watched a representative slice of your own traffic run clean on it. The honesty gain pays off exactly there. For short single-turn work, the capability deltas are small and the upgrade barely registers.
Was it the Opus 4.8 model or Claude Code?
The fix that resolved the launch-week behavior was harness-side: a parallel-tool-call change in Claude Code v2.1.161, not a model weight change. That is consistent with the launch-week evidence leaning toward the harness, though the model-side question, whether 4.8 still over-trusts its own execution, stays open and is worth instrumenting.
Did the Opus 4.8 honesty improvement hold up?
In my own code-verification work over eight days, yes: 4.8 flags uncertainty earlier and catches defects before they ship, which Anthropic's four-times-less-likely framing and independent reads support. It is workload-specific, though. On a long-horizon agent benchmark, 4.8 was far worse at spotting scam suppliers, so more honest held on some tasks and not others. Measure it on yours.