Video: "NEW GPT 5.5 is INSANE!" by Julian Goldie on YouTube.
What the scores actually mean
Terminal Bench is a coding and reasoning benchmark that measures how well a model can complete multi-step software tasks without help. GPT-5.5 hit 82.7% on that test. Claude Opus 4.7, the closest comparable model, scored 69.4%. That's a meaningful gap, not a rounding error.
The model can also code autonomously for up to 20 hours on a single task — and during OpenAI's internal testing it apparently found a previously undiscovered mathematical proof while working. Worth knowing: OpenAI codenamed this model "Spud" internally. Standard Plus, Pro, Business, and Enterprise accounts now have access.
That said, benchmark numbers and real-world output are two different things. The honest reading of Julian's test is: this model is measurably ahead of its predecessors on coding, reasoning, and following complex instructions — but the headline numbers are about potential, not a guarantee of polished results every time.
ChatGPT vs Codex — where the real difference is
Most people are testing GPT-5.5 in the ChatGPT chat interface, which is fine for one-off questions and quick drafts. But that workflow misses most of what makes the model genuinely useful. The practical capability is inside Codex, OpenAI's developer environment.
In Codex, GPT-5.5 can: run live web previews so you see a working page rather than code on a screen; build complete applications faster than any previous version; execute computer-use workflows that involve actual software interaction; handle folder-based projects rather than just single files; support environment-level testing. That's a different class of tool from a chat box.
In practice, what this means is that a developer or small team can hand Codex a brief — build this dashboard, wire up this API, create this internal tool — and come back hours later to review the output, rather than babysitting every step. That changes the economics of building things with AI considerably.
Building apps without coding experience
Julian ran the model through a set of app-building prompts aimed at non-developers — the kind of thing a business owner might want to build (a sales tracker, a client dashboard, a content workflow system). GPT-5.5's output was notably more complete and better structured than earlier models on the same tasks.
One important shift in how to prompt this model: it doesn't respond well to the role-play style prompts that worked for GPT-4 and GPT-5. You don't need to tell it to "pretend you're a senior developer." It already behaves like one. What it needs is extreme specificity about what you actually want — exact fields, exact behaviour, exact outputs. Vague briefs produce vague results, same as always.
To be fair, "apps without coding experience" is still not quite the same as "polished apps without any quality checking." The model produces working code faster than before. Turning that code into something you'd actually give to a customer still requires someone who knows what they're looking at.
What's being overstated
The 20-hour autonomous coding claim is real, but it comes with the usual caveat: it works best when the task is well-defined and the environment is set up properly. Open-ended tasks still tend to drift. And the "discovered a new mathematical proof" story is interesting but not especially relevant to most businesses — it's a signal of reasoning capability, not a practical feature you'll use daily.
GPT-5.5 is also not cheap. Access tiers vary, and for heavy automated use via Codex, API costs will add up. Worth modelling the actual cost per task before committing it to a production workflow.
Where this connects to NordSys
If you've been thinking about building an internal tool, automating a repetitive process, or getting a lightweight app built quickly — GPT-5.5 in Codex is the closest thing to an AI coding colleague that exists right now. What it needs to produce good output is exactly what we help businesses specify: a clear brief, a structured task definition, and someone to review and test the output. That's the programming service we run. We handle the spec, the build, and the quality check — and if AI tooling makes the build faster, the saving comes back to the client.
See our programming service →