>On terminal bench, Claude Code ranks 39th. There are 38 harness-model pairs that outscore it. If you filter to just Opus, Claude Code is dead last among harnesses. Cursor’s harness gets Opus from a 77% score to 93%. Claude Code gets that same Opus model... 77%. The harness adds nothing.
Good stuff, I was waiting for your angle. The skeptical memory concept is interesting because I already have “Read before assuming - always check file contents before asserting or editing” in my own CLAUDE.md after Claude asserting file contents it hadn’t read. Looks like I was rediscovering a design principle.
Your interactive explorer is the best entry point I’ve seen so far - thanks!
Thanks! I needed some time to really explore the whole leak. Many people jumped into this train like the same day it was leaked. I was like - how was that possible to really explore this? I have to admit that after 4h on this, I still think there’s probbaly more.
Four hours in and still finding things is the honest version of this story. I hope the community response will make the OSS case for Anthropic. Though the DMCA campaign suggests they see it differently, at least while the IPO clock is running.
Thanks for read!
on this note:
>On terminal bench, Claude Code ranks 39th. There are 38 harness-model pairs that outscore it. If you filter to just Opus, Claude Code is dead last among harnesses. Cursor’s harness gets Opus from a 77% score to 93%. Claude Code gets that same Opus model... 77%. The harness adds nothing.
Cursor is not on Terminal Bench at all (https://www.tbench.ai/leaderboard/terminal-bench/2.0), what did you mean?
Nice read!
Thx!
Thanks for putting this together. I’ve already go some pieces from other people’s reviews.
Took a couple of ideas to improve my modular agents system (and also learned a few things I didn’t thought were possible before!)
Sure thing! I bet there’s more than that! I just described things that I wa interested :D
I think enough to keep someone busy for months lol 🤓😂
Good stuff, I was waiting for your angle. The skeptical memory concept is interesting because I already have “Read before assuming - always check file contents before asserting or editing” in my own CLAUDE.md after Claude asserting file contents it hadn’t read. Looks like I was rediscovering a design principle.
Your interactive explorer is the best entry point I’ve seen so far - thanks!
Thanks! I needed some time to really explore the whole leak. Many people jumped into this train like the same day it was leaked. I was like - how was that possible to really explore this? I have to admit that after 4h on this, I still think there’s probbaly more.
Anyways - it’s fun and Claude Code should be OSS!
Four hours in and still finding things is the honest version of this story. I hope the community response will make the OSS case for Anthropic. Though the DMCA campaign suggests they see it differently, at least while the IPO clock is running.