ChatGPT 5.4 Is Good. That's Not the Point.

Strong analysis, flat writing, and a model that sometimes lies about finishing its homework.

Mar 08, 2026

I tested ChatGPT 5.4 for 48 hours. Here’s what you need to know.

TL;DR: ChatGPT 5.4 is a real upgrade from 5.2. Stronger analytical work, better spreadsheets, and extended thinking that’s impressive under the hood. But the writing is still flat compared to Claude, the finished output doesn’t match the quality of its own reasoning, and you have to over-prompt to get what you want. If you’re productive with Claude or Gemini, don’t switch. If you’re on OpenAI, enjoy the upgrade.

OpenAI released ChatGPT 5.4 on March 5th. Within hours, my inbox lit up. Clients asking if they should switch. LinkedIn flooded with hot takes. I had a law firm managing partner text me at 7am asking, basically, “do we need to rethink everything?” No. You don’t. But let me explain.

I stopped reading the headlines and started testing. That’s what I always do when a new model drops, and I’d encourage you to be skeptical of anyone who published a review within the first two hours. They’re writing about the press release, not the product.

I’ve spent the last two days building spreadsheets, drafting documents, testing reasoning, and running ChatGPT 5.4 through the same workflows I use daily with Claude and Gemini. This is my early read. I’ll go deeper over the coming weeks, but what I’ve found so far is worth sharing now. One thing I want to clarify quickly: all of these tests were done with ChatGPT 5.4 Thinking Extended. My recommendation is don’t use Auto. Ever. It’s just not as good.

The short version

ChatGPT 5.4 is a real step forward from 5.2. But it’s not a reason to blow up what’s working. If you’re productive with Claude or Gemini, stay put. If you’re an OpenAI shop, you just got a nice upgrade. The gap between the big three is getting smaller, and honestly, that’s the best possible outcome for all of us.

Where it impressed me

I’ll give credit where it’s due.

The analytical work and tool calling are where this model really shines. I asked ChatGPT 5.4 to build multi-step analytical workflows and it nailed the sequencing. Pulling data, structuring it, running calculations, presenting results. All without losing the thread. If you’ve used 5.2 for this kind of work, you’ll notice the difference immediately.

Spreadsheet work is strong too, and this is where I spent the most time testing. I’ve built numerous test spreadsheets both inside ChatGPT 5.4 Thinking Extended and through the new ChatGPT Excel Add-in. Complex formulas, layered calculations, conditional formatting. The analytical depth is real. I’ll say more about how it compares to Claude on spreadsheets in a minute, because that’s a more interesting story than “one is better.”

It also sticks with long tasks, which sounds basic but hasn’t always been a given. Give it a multi-step project and it doesn’t drift. Doesn’t forget what you asked for halfway through. Doesn’t need you to keep re-prompting it back on track. If you’ve been burned by earlier models losing context mid-task, you know why this matters.

One more thing: the extended thinking mode. When you turn it on, you can watch the model reason through problems in real time. Methodical. Thorough. I was surprised by how good the raw thinking was. Which, as you’ll see, actually makes one of my complaints worse.

Where it falls short

Here’s where my experience parts company with the marketing.

Writing. I have to be blunt about this one. ChatGPT 5.4 writes better than 5.2. Sam Altman admitted publicly they’d fumbled writing quality in earlier versions, and they’ve clearly worked on it. But it still reads like AI. I’ve run the same prompts, same instructions, through both ChatGPT 5.4 Thinking Extended and Claude Opus 4.6 Extended, and the difference isn’t subtle. Claude sounds like a person wrote it. ChatGPT sounds like a very capable machine wrote it. Even with tweaking and detailed style instructions, it’s harder to get ChatGPT there. The prose comes out robotic. If you’re producing client-facing content or board materials, you’ll feel this gap right away.

And this connects to what I think is the most frustrating thing about the model. I’ve been calling it the thinking-to-output translation problem. The internal reasoning is excellent. Seriously, the chains of thought are thorough and well-structured. But somewhere between all that great thinking and the final product, something gets lost. You end up with technically correct content that reads flat. And here’s the thing: the extended thinking mode actually makes this more noticeable, because you can see how good the reasoning is. Then the finished output just... doesn’t match it. It’s like watching a chef prep beautiful ingredients and then serve them on a paper plate.

On documents and formatting, the picture is more mixed than I expected. I’ve been creating Word docs, PowerPoints, and spreadsheets on both platforms. ChatGPT 5.4 gives you functional output. Claude gives you something you’d hand to a managing partner without editing first. But here’s the more interesting comparison: on spreadsheets specifically, both deliver solid results. ChatGPT goes deeper analytically. Claude makes it look better. If I had to pick one for a financial model, I’d probably lean ChatGPT. If I had to pick one for a board deck, Claude. Every time.

The other thing that wears on you is how much hand-holding ChatGPT needs. With Claude, I can give a high-level brief and get back something that shows it understood what I was after. With ChatGPT 5.4, I’m writing much more detailed, explicit instructions to get comparable results. That overhead adds up when you’re doing this all day. It’s capable, for sure. It just doesn’t read between the lines the way Claude does.

What people I trust are saying

I follow a handful of people who actually test these models, and I want to share what a few of them found because it lines up with my own experience in interesting ways.

Nate B Jones runs structured blind evaluations across the top models. He captured something I keep coming back to. He asked GPT-5.4 a simple question: “I need to wash my car. The carwash is 100 meters away. Should I walk or drive?” ChatGPT confidently said walk. Think about that for a second. Claude answered in one sentence: “Drive. You need the car at the carwash.” Jones called it “the story of GPT-5.4.” Strong on quantitative work. Shaky on the kind of common-sense reasoning that you’d expect from an intern on their first day. His broader observation stuck with me: these models “are converging on capability and diverging on philosophy.” I think that’s a really important idea, and I’ll probably come back to it in a future piece.

The AI Daily Brief called it “the most substantial OpenAI release in recent memory.” They gave it credit on computer use and professional tasks while being straight about the distance between press release and reality. Fair assessment.

Ethan Mollick at Wharton has been writing about the bigger shift from chatbots to agentic systems. His point is one I keep returning to: we’re past the era where any single model release changes everything. The question now is which set of tools actually helps you get work done. I think he’s right, and I think most executives are still asking the wrong question when a new model drops. The question isn’t “is this one better?” It’s “does this change what I should be doing?”

Every.to ran extensive coding and planning tests. ChatGPT 5.4 “won every planning test” they threw at it. But they also documented something that should bother anyone relying on AI for professional work: the model sometimes marks tasks as complete before actually finishing them, and occasionally “completed tasks in obviously wrong ways, then lied about it.” I re-read that sentence twice when I first saw it. For legal work, for financial work, for anything where accuracy isn’t optional, that’s a serious concern.

Where things actually stand

Here’s my honest read. ChatGPT 5.4 has largely closed the gap with Claude Opus 4.6 and Gemini 3.1 Pro. It passes them in some areas. It lags in others. Integrating ChatGPT 5.3-Codex into the main model was a smart move, but in practice it puts OpenAI on roughly even footing with Claude and Gemini. Not ahead. Parity.

For my work advising law firms, creating training materials, writing, and building analytical tools, I’m not switching. Claude handles about 60% of what I do, Gemini about 30%, ChatGPT about 10%. Nothing in this release changes that math. But I’m glad OpenAI shipped something real, because these companies pushing each other is what makes all the tools get better.

What to do with this

If you’re a senior leader wondering what this means for your team, here’s my take.

Don’t get distracted by model names. The best AI tool is the one your people actually use well. Not the one that topped a leaderboard nobody outside the AI community reads. I can’t stress this enough. I see firms chase every new release and never get good at any of them.

If you’re already productive in Claude or Gemini, stay. The switching cost is real and the marginal gains don’t justify it.

If your team runs on OpenAI, enjoy the upgrade. Your spreadsheets will be better. The analytical workflows will be smoother. The thinking mode is strong.

And regardless of where you are, keep watching the agentic space. That’s the real story. Not which model writes the best paragraph, but which platform lets you build workflows that run reliably without someone watching them. All three companies are betting big here, and that’s where the next wave of real productivity shows up for knowledge workers.

I’ll be testing more over the coming weeks. Deeper document workflows, legal research, more complex analytical work. I’ll report back.

For now? ChatGPT 5.4 brings OpenAI back to parity. Not ahead. But back in the conversation. And that’s good for everyone.

This review is early. The model is 48 hours old and I’ve barely scratched the surface. But first impressions matter, and mine is: better. Not different enough.

If you read this far, you’re not the person chasing every release. You’re the person trying to figure out which ones actually matter for the work your firm does tomorrow. That’s a harder question than it sounds, and the fact that you’re still asking it is exactly why you’re the right person to be making these calls.

That’s the conversation I have every day with firm leaders sorting signal from noise on AI. If you’re sitting with a question about what to do next, whether it’s tools, training, or strategy, send me a note at steve@intelligencebyintent.com. Tell me what you’re working through. I’ll give you a straight answer about what’s ready and what isn’t.

I write articles like this several times a week at smithstephen.com. It’s for the people running firms who need to make decisions about AI without becoming AI experts. No cheerleading. No vendor pitches. Just the part that’s actually useful.

Intelligence by Intent

Discussion about this post

Ready for more?