Oh, I know that you can revert to old models, but I think the problem is not with that; it is more with the Claude Code in the end, because the harness is also to blame. It got worse basically.
The other thing is that there are some things that are, for example, better in some ways. I would describe 4.7 as a weird one, not necessarily a worse one but a weird one. That's what I would say, and that's why I cannot pin a point that it is worse by 100%, because maybe there are some things that are better at very complex things, I think.
Because I think this is not only about 4.7, it is also about the Claude Code as a harness that is getting done, I would say. The harness is also kind of breaking the good quality of models as well. Anthropic actually recently kind of told us about that in their most recent blog post about Claude Code problems, and I think that might be it as well.
Exactly what I was feeling and I'm glad more of us power users are embracing both CC and Codex. Would love your setup to formalize that in the agent. Great article!
I think in the AI area it is like we'll see back and forth, so I believe that it might change. I think maybe in two months, Claude Code will be better and they will fix all of the things that need fixing, and Claude Code will be worse. For example, one thing that I noticed with ChatGPT 5.5 is that it is way more expensive and the usage is draining a lot. It is not as smart as you might imagine, but still it feels, for example, usage-wise, it is still more generous than Claude Code, which I generally like.
Long story short, I think we should test and experiment with other things. I think it will never settle into one, because this development in AI is so fast. It is really hard to get which are better or worse until you really test them out.
Bianca's point about creating the harness yourself and choosing different models for different tasks is exactly what we've been measuring. Same model, different harness, 36-point score difference on the same benchmark. The model choice matters less than people think, the scaffolding around it matters more.
On Opus 4.7 specifically, we ran 285 sycophancy tests (95 tests, 3 runs) the day it launched. 67.7% resistance, virtually identical to 4.6's 68%. The coding improvements are real but behavioral compliance under pressure didn't move. Opinion dimension scored 37%, it abandons its answer nearly two-thirds of the time when you simply disagree.
Your model switcher is smart. The next question is whether the model you're switching to actually behaves differently under pressure, or just generates differently. That's what we measure. Full dimension breakdown at tabverified.substack.com Issue #3.
Your experience, my experience and everything I discuss with Germans working already with agentic AI bring me to the conclusion: for companies it is better to create the harness by themselves and to choose different models for different tasks.
I would be very interested in seeing your switcher code. The opus 4.7 release is what pushed me over the line to a paid subscriber when it became clear I was going to need to level-up my infrastructure to manage this. I left codex behind back in November before I built most of my AI tooling. I guess now it is time to learn how to get Codex to read skills and understand my vault architecture.
Hi, actually I have this done. I forgot to add this to this post, so you can refresh the post and you can see that there is a switcher for ready for implementation. Link: https://wiz.jock.pl/store/ai-model-switcher/
Definitely my experience too.
(Just in case you didn't know) You could actually revert to old models in the Claude Code CLI.
## Mid-session
/model claude-opus-4-6
(this seems to get saved as default in .claude local json )
## At launch
claude --model claude-opus-4-6
Oh, I know that you can revert to old models, but I think the problem is not with that; it is more with the Claude Code in the end, because the harness is also to blame. It got worse basically.
The other thing is that there are some things that are, for example, better in some ways. I would describe 4.7 as a weird one, not necessarily a worse one but a weird one. That's what I would say, and that's why I cannot pin a point that it is worse by 100%, because maybe there are some things that are better at very complex things, I think.
Why not switching back to 4.6 ?
It looks like perplexing is completely written off, it’s so bad or it’s just not getting hyped enough ?
Because I think this is not only about 4.7, it is also about the Claude Code as a harness that is getting done, I would say. The harness is also kind of breaking the good quality of models as well. Anthropic actually recently kind of told us about that in their most recent blog post about Claude Code problems, and I think that might be it as well.
I just started using codex with ChatGPT 5.5 and I’m liking it so far. I might swap over.
Exactly what I was feeling and I'm glad more of us power users are embracing both CC and Codex. Would love your setup to formalize that in the agent. Great article!
I think in the AI area it is like we'll see back and forth, so I believe that it might change. I think maybe in two months, Claude Code will be better and they will fix all of the things that need fixing, and Claude Code will be worse. For example, one thing that I noticed with ChatGPT 5.5 is that it is way more expensive and the usage is draining a lot. It is not as smart as you might imagine, but still it feels, for example, usage-wise, it is still more generous than Claude Code, which I generally like.
Long story short, I think we should test and experiment with other things. I think it will never settle into one, because this development in AI is so fast. It is really hard to get which are better or worse until you really test them out.
Bianca's point about creating the harness yourself and choosing different models for different tasks is exactly what we've been measuring. Same model, different harness, 36-point score difference on the same benchmark. The model choice matters less than people think, the scaffolding around it matters more.
On Opus 4.7 specifically, we ran 285 sycophancy tests (95 tests, 3 runs) the day it launched. 67.7% resistance, virtually identical to 4.6's 68%. The coding improvements are real but behavioral compliance under pressure didn't move. Opinion dimension scored 37%, it abandons its answer nearly two-thirds of the time when you simply disagree.
Your model switcher is smart. The next question is whether the model you're switching to actually behaves differently under pressure, or just generates differently. That's what we measure. Full dimension breakdown at tabverified.substack.com Issue #3.
Your experience, my experience and everything I discuss with Germans working already with agentic AI bring me to the conclusion: for companies it is better to create the harness by themselves and to choose different models for different tasks.
It is very good idea. And I think this, but also local LLMs for companies = future!
I would be very interested in seeing your switcher code. The opus 4.7 release is what pushed me over the line to a paid subscriber when it became clear I was going to need to level-up my infrastructure to manage this. I left codex behind back in November before I built most of my AI tooling. I guess now it is time to learn how to get Codex to read skills and understand my vault architecture.
Hi, actually I have this done. I forgot to add this to this post, so you can refresh the post and you can see that there is a switcher for ready for implementation. Link: https://wiz.jock.pl/store/ai-model-switcher/