Why My ChatGPT Usage Dropped from 80% to 10%
Same files. Same prompt. Very different results. One model found the evidence. Others buried it.
Image created by Nano Banana Pro
TL;DR: After months of daily testing across ChatGPT, Claude, and Gemini, my usage has shifted dramatically. Claude Opus 4.5 now handles 60% of my work, Gemini 3 Pro Preview takes 30%, and ChatGPT 5.2 gets the remaining 10%. I ran all five models through a real defense attorney workflow to show you why.
I’ve been having a lot of conversations lately about how my model usage is changing. As most of you know, I use all the major models daily for testing, understanding their capabilities, and having honest conversations with clients about what actually works.
Six months ago, before the latest releases hit, ChatGPT 5.1 Thinking dominated my workflow. We’re talking 70% to 80% of everything I did. The remaining 20% to 30% split evenly between Gemini and Claude. I defaulted to ChatGPT 5.1 Thinking “extended,” and once “heavy” arrived in September, I switched to that for deeper answers.
Then came the last few months of 2025. Gemini 3 Pro Preview. Opus 4.5. ChatGPT 5.2.
My usage has changed dramatically.
Here’s what my breakdown looks like today: 60% Claude Opus 4.5, 30% Gemini 3 Pro Preview, and 10% ChatGPT 5.2. I only turn to ChatGPT 5.2 when I need Pro or Heavy Thinking options, which I still use a few times a day. My personal take? The reasoning from those two options still surpasses what Gemini (even Deep Think) and Opus 4.5 generate. But those are the minority of my interactions. For most of what I do right now, Claude has earned the top spot.
Why I’m Sharing This
Yesterday I gave a presentation to a 150+ person law firm, running live demos on legal cases. Same input files, same questions, different models. When I do presentations like this, I don’t recommend specific models to a general audience. But I do highlight the differences.
The results were worth sharing.
The Test: A Real Defense Workflow
I created sample files for a workers’ compensation case. This firm does defense work, so I framed it this way:
The Problem: Defense attorneys receive thousands of pages of medical records and must quickly identify pre-existing conditions, inconsistencies, and malingering indicators. Missing these details means overpaying claims.
I uploaded sample files and gave each tool this prompt:
Analyze these medical records from a DEFENSE perspective for a workers’ comp claim.
Claimant: [Name, age, occupation] Alleged injury: [Description] Date of injury: [Date] Mechanism claimed: [How they say it happened]
Create a defense-focused report identifying:
PRE-EXISTING CONDITIONS - Prior injuries, degenerative findings, earlier treatment to same body parts
CAUSATION PROBLEMS - Delayed reporting, mechanism inconsistent with diagnosis, non-industrial activities that could explain injury
GAPS AND INCONSISTENCIES - Treatment gaps, changing narratives, subjective complaints exceeding objective findings
MALINGERING INDICATORS - Waddell signs, failed validity testing, symptom magnification notes, surveillance-relevant findings
APPORTIONMENT OPPORTUNITIES - Quantifiable pre-existing conditions, age-related degeneration
Output as a table: Issue | Date | Source | Strategic Significance
Conclude with top 5 defense themes.
I ran the test fresh this morning with five model configurations:
ChatGPT 5.2 Thinking Extended
Claude Opus 4.5
Gemini 3 Pro Preview (Personal subscription)
Gemini 3 Pro Preview (Workspace subscription)
Gemini 3 Pro Preview in AI Studio
Why three different Gemini versions? I’ve been noticing output differences across them. In general, I often prefer what AI Studio produces.
I was going to attach the direct output here, but Substack doesn’t support proper table formatting, and I didn’t want to paste endless images. So I created a single Google Doc so you can see all of the output exactly as it came out of the tools.
Link of all results can be found here:
My Rankings and Analysis
Now that you’ve seen the raw output, here’s how I’d rank them and why.
#1: Claude Opus 4.5
This report is the gold standard for a defense attorney. It organizes evidence into the five requested legal theories (pre-existing, causation, gaps, malingering, apportionment) rather than just presenting a chronology.
What stood out:
The forensic precision. Claude captured granular details others glossed over, including the exact discrepancy in the Straight Leg Raise test (80° distracted vs. 35° formal) and the specific “pen pickup” distraction test. It correctly identified the November 3, 2023 chiropractic note about “early retirement due to back,” establishing a pre-existing financial motive. That’s your smoking gun. It also caught the social media photos and the “forgot appointments” excuse.
#2: Gemini 3 Pro in AI Studio
This report is the best strategic overview of the bunch. It uses high-impact labeling (”High Value,” “Smoking Gun”) to frame the legal narrative effectively.
What stood out:
Better causation catch than Claude. It flagged the October 2023 “moving furniture” incident where the claimant admitted flaring up his back helping a friend. Claude missed that specific recent event. The rhetoric was strong too, framing the “early retirement” note as a “Financial Exit Strategy.”
Why not #1? It lacks the forensic measurement details (specific degrees of ROM) that Claude provides. Those numbers matter when you’re deposing a doctor.
#3: ChatGPT 5.2 Thinking Extended
High substance, analysis section not structured as well as the first two.
ChatGPT successfully identified both key smoking guns: the early retirement note and the “helping friend move furniture” incident. But the analysis output looks like a raw CSV dump rather than a structured set of sections, as I got from Claude and AI Studio. It’s harder to read and harder to use in court without significant reformatting.
#4 and #5: Gemini Pro Preview (Personal and Workspace)
They missed the motive.
Both versions failed to mention the “early retirement” note. In a defense case, missing the claimant’s pre-injury statement about wanting to retire on disability is a critical failure. They did note the claimant stopped prior treatment due to “cost,” which is a good rebuttal to any “I was healed” argument. But the bigger miss hurts.
One thing worth noting: the output from my personal account was slightly better than from my Workspace subscription. As a business subscriber, that’s a little concerning.
What Do You Think?
Agree with my rankings? See something I missed? I thought this would give you a useful window into how these models actually perform on real professional workflows.
The differences matter. And they’re worth understanding before you pick a default tool for your team.
Why I write these articles:
I write these pieces because senior leaders don’t need another AI tool ranking. They need someone who can look at how work actually moves through their organization and say: here’s where AI belongs, here’s where your team and current tools should still lead, and here’s how to keep all of it safe and compliant.
In this article, we looked at why model selection matters more than most leaders assume. The same prompt and files produced dramatically different outputs, with real consequences for what evidence gets surfaced and what gets buried. The market is noisy, but the path forward is usually simpler than the hype suggests: test models on your actual workflows before you commit.
If you want help sorting this out:
Reply to this or email me at steve@intelligencebyintent.com. Tell me what professional workflows you’re trying to improve and which tools you’ve already tested. I’ll tell you what I’d run next, which model configuration fits your use case, and whether it makes sense for us to go further than that first conversation.
Not ready to talk yet?
Subscribe to my daily newsletter at smithstephen.com. I publish short, practical takes on AI for business leaders who need signal, not noise.



The forensic precison difference between models is striking, especially that Straight Leg Raise discrepancy you caught with Claude (80° vs 35°). What's interesting is how Gemini AI Studio caught the "moving furniture" incident that Claude missed, which shows these models have genuinely different strenghts rather than one being universally better. In my testing across domains, I've found similar patterns where tool selection really depends on whether you need granular measurement capture or narrative synthesis.