Thanks for these great posts. I'm not a builder like you thanks to gen ai can build and test simple python prototypes. I'm interested in understanding how agents fail and keeping up to date on llm performance. Like a lot of people on here I was more of a ChatGPT user until 6 months ago when I moved to Claude and used ChatGPT less frequently especially for coding. It's good to be aware that Codex maybe better for future coding projects.
Would it be possible to give an example of one of the difficult tasks where you need a more powerful model.
Off subject, I'm very curious to know when you started coding. From reading your posts it seems you're were already coding before gen AI made it so easy.
Interesting timing on this. We just ran 50 safety drift tests on Opus 4.7, GPT-5.4, and GPT-5.5 last night. All scored 97-98% overall but the sub-categories are where it gets interesting. Wrote it up here if you want the numbers: https://tabverified.substack.com
Thanks for these great posts. I'm not a builder like you thanks to gen ai can build and test simple python prototypes. I'm interested in understanding how agents fail and keeping up to date on llm performance. Like a lot of people on here I was more of a ChatGPT user until 6 months ago when I moved to Claude and used ChatGPT less frequently especially for coding. It's good to be aware that Codex maybe better for future coding projects.
Would it be possible to give an example of one of the difficult tasks where you need a more powerful model.
Off subject, I'm very curious to know when you started coding. From reading your posts it seems you're were already coding before gen AI made it so easy.
Thanks for comment!
As per better models - coding, creating and brainstorming.
Interesting timing on this. We just ran 50 safety drift tests on Opus 4.7, GPT-5.4, and GPT-5.5 last night. All scored 97-98% overall but the sub-categories are where it gets interesting. Wrote it up here if you want the numbers: https://tabverified.substack.com