Discussion about this post

User's avatar
Keeping TABs on Your AI Agents's avatar

I've been running the same benchmarks across 10 models with 101 different harness configurations at tabverified.ai, and the harness variance you're measuring across Claude and Codex shows up in every test. Same model, same benchmark, different harness, 36-point score difference. It's not noise.

Your mitigation #3 (route narrow calls elsewhere) is the one I'd push hardest. I ran DeepSeek V4 Flash against error recovery and it scored 85 on clear failure communication. DeepSeek V4 Pro scored 11. Same provider, same architecture family. The cheap model was better at the specific task. Most people are sending Opus-level traffic to tasks that a $0.10 model handles better, not just cheaper, actually better.

The June 15 change is going to force the audit you describe in mitigation #4 whether people want to do it or not. The teams that already know which model fits which task will absorb it. The teams that defaulted to "send everything to Sonnet" are the ones who get squeezed.

Mark Ulett's avatar

Thanks for writing this piece. We all get thrown into the token economics of the pool eventually, but this repricing has a huge impact for some of us and I am one with total exposure.

I built a Cladie Code Headless Agents harness because of Configuration Impersonarion and other failure states common to the 3rd party apps. With 2factor context validation to make me sure the context was the true source of inference.

I knew about the price exposure since day 1 and hoped we would make it to the fall. It’s not something to be mad at.

Appreciate the write up. I would buy your auditor tool, but I know what this means. Time for all the baby birds to fly free from the PoppaAnthropic nest.

“Thanks for all the fish, Anthropic.”

No posts

Ready for more?