ChatGPT 5.5 Is Smarter Than Ever. It's Also More Confidently Wrong.
57% accuracy. 86% hallucination rate when it doesn't know. Here's what that means on a Monday morning.
ChatGPT 5.5 Just Shipped. Here’s What It Actually Changes for Your Law Firm.
TL;DR
GPT-5.5 and GPT-5.5 Pro went live today. The gains for lawyers are in multi-step work, document-heavy tasks, and reasoning that has to hold together across many moving parts. Pro is the variant to test on hard legal work. GPT-5.5 tops the public benchmarks this week but has a documented hallucination problem that is worse than Claude Opus 4.7 or Gemini 3.1 Pro on at least one key benchmark. Test it on real matters. Verify everything. Keep your AI policy current.
Why this release is different
The thing to understand about GPT-5.5 is that the upgrade isn’t really about better answers. It’s about the model not wandering off.
OpenAI’s framing is that GPT-5.5 “carries more of the work itself.” In plain English, you can hand it something with three or four moving parts and it’s more likely to sequence the steps, pick the right sub-task, and check its own output before it hands you a result. That’s the difference between a model that needs a PhD to prompt it and one a senior associate can actually use without making you nervous.
Pro is a genuine step up on the hardest reasoning and research tasks. If you’re paying for it, use it on work that actually needs it. And keep this in mind. GPT-5.4 shipped seven weeks ago. If your firm wrote a three-year AI plan in February, it’s already out of date.
Here’s where I’d actually put this thing to work.
Litigation chronologies stop being a nightmare
I don’t know a litigator who enjoys building a chronology from scratch. Important work. Also awful.
The facts sit across emails, pleadings, texts, medical records, contract files, deposition transcripts, and the client’s memory. Dates don’t line up. The key fact is buried in an attachment. Someone says “late February” when they mean March 3.
That mess is where GPT-5.5’s gains show up.
Don’t ask it to “tell me what happened.” Too vague. Give it a real job. Build the first chronology, cite the source for every entry, flag conflicts between sources, separate hard facts from softer inferences, identify what still needs proof. The lawyer still verifies every entry. But starting with a sourced timeline is a completely different exercise than starting with a blank page. On day 10 of a case that matters, that helps. Before a deposition, even more.
Contract review becomes something closer to business advice
A lot of AI contract demos still feel like parlor tricks. Find the indemnity. Summarize termination. Compare two versions. Fine. Not enough.
Clients don’t need a lawyer to say “there is an indemnity clause.” They need the lawyer to say “this clause makes you responsible for losses you don’t control, and if this deal goes sideways, this is where the money leaks.”
That’s the real work, and GPT-5.5 is better at getting there.
The prompt I’d test isn’t “redline this.” It’s closer to: “Here’s the contract, here’s my client’s preferred position, here’s the deal context. What would make my client regret signing this 12 months from now?”
That’s how deal partners actually think. Assignment, change of control, data rights, payment timing, limitation of liability, audit rights, termination, exclusivity. Legal provisions, sure, but also business choices. The model can help a senior associate surface those choices faster. The partner still decides what to fight over.
Make the model attack you instead of help you
Here’s where I’d use GPT-5.5 Pro. And I wouldn’t use it as a case finder, because that’s still how lawyers end up on the wrong end of a sanctions order.
Use it to stress-test the argument instead.
Most lawyers prompt research the wrong way. “Find cases that support my position.” Understandable. Also how you miss the problem. The better move is to hand GPT-5.5 Pro the issue, the jurisdiction, the facts, and the position you want to take, then tell it to rip the argument apart. Where does this break? What facts would make a judge uncomfortable? What is the other side’s cleanest counter? Which precedent would you hate to see in their brief?
That’s not a research shortcut. It’s a preparation shortcut. Good partners do this instinctively. Junior lawyers take years to build the reflex.
Caveat, because it has to be said: GPT-5.5 does not eliminate the citation problem. Every case it surfaces still needs human verification. Every pin cite. Every quote. We’ll get to that.
Client alerts can become client service
Every firm sends alerts. Most clients ignore them.
Not because the alerts are badly written. Most are fine. The problem is they’re broad by design. They explain what changed. They don’t tell a specific client what to do about it.
Take a regulatory change, an enforcement action, or new agency guidance. Feed the model the update plus a short description of your affected client types. Ask it to separate the generic news from the practical implication. Who should care, why now, what they should review this week, what question the relationship partner should ask on the next call.
A privacy update hits a healthcare company, a SaaS vendor, a retailer, and a PE portfolio company differently. A labor rule hits a 30-person startup and a 12,000-person employer operating in nine states very differently. The old version of this work is “send the alert.” The better version is “turn one alert into 12 client-specific partner notes.” Twenty minutes now, versus three hours.
Clients remember the second kind.
The one most managing partners will sleep on
If AI only helps your lawyers draft faster, fine, that’s useful. If it helps the managing partner see where profit is actually leaking, that’s a different conversation.
Law firms sit on years of useful data. Time entries. Write-downs and write-offs. Staffing patterns. Matter budgets. Realization rates. Collection delays. AFAs that worked, and the ones that quietly bled money. Most firms don’t really learn from it. They feel the pattern. They don’t study it.
5.5 is meaningfully better at working across spreadsheets and documents at the same time. Hand it your closed matters from the last six months and ask where margin leaked. Look for write-off patterns by partner, matter type, office, or phase of the engagement. Compare fixed-fee matters that worked against fixed-fee matters that didn’t, and ask what the actual difference was.
This use case doesn’t get applause at a CLE. It changes how a firm prices work and how partners decide which matters to take. That’s strategy, not software.
What actually worries me
Last summer, lawyers at a well-regarded firm ended up in serious trouble in Johnson v. Dunn after hallucinated AI citations made their way into federal filings. The firm avoided the worst of it largely because it could show real internal AI governance: warnings, escalation, and a written AI policy already in place. That was with an earlier model.
And this is not just a mid-market or one-off problem: this week Sullivan & Cromwell apologized to a federal bankruptcy judge after a filing reportedly contained AI-generated inaccuracies, including fabricated citations and misstatements of law, and acknowledged that its internal AI policies and secondary review processes were not followed.
GPT-5.5 is more confident when it’s wrong than its competitors are, and the gap isn’t small.
On AA-Omniscience, a benchmark specifically designed to measure what a model does when it is outside its knowledge, GPT-5.5 posts the highest accuracy of any model tested at 57 percent. That is the good news. The bad news is that its hallucination rate on the same benchmark is 86 percent. Claude Opus 4.7 is at 36. Gemini 3.1 Pro is at 50.
What that means in practice is simple: GPT-5.5 may know more, reason better, and still be more willing to make something up when it should say “I don’t know.” For legal research, that is not a footnote. That is the caveat.
That is a specific, serious problem for legal research.
If your lawyers are running cite checks through GPT-5.5 without hand-verifying every citation, you are closer to a sanctions order than you were yesterday. My personal view, and I realize not everyone agrees: I wouldn’t use any of these general-purpose models for pure case-finding right now. Use a real legal research platform for that. Use GPT-5.5 Pro to pressure-test what you find.
What to do Monday
Three moves.
Pick three associate workflows you run every week. Test GPT-5.5 against your current process. Time both. Score the output. Get real numbers before the next marketing cycle hits.
Run the same tests on Claude Opus 4.7. OpenAI tops the leaderboard this week. Anthropic is ahead on grounding. Your practice mix decides which one wins for you.
Run the firm management test. Feed GPT-5.5 your last six months of closed matters and ask where margin leaked. See if it finds something you missed. If it does, that one test paid for a year of Pro subscriptions for the partnership.
That’s the test plan. Not a strategy deck.
The model race isn’t slowing down. Your real edge isn’t picking the winner every eight weeks. It’s running disciplined tests on real matters and keeping your AI policy sharp enough that no one in your firm ends up on the wrong side of a sanctions order.
If you’re building that test plan right now and want a second set of eyes on it, reach me at steve@intelligencebyintent.com.


