I Ran Identical Research Through Five AI Deep Research Tools. Only Two Produced Work You Could Show Leadership.
ChatGPT and Gemini delivered research you could present Tuesday morning. Claude took two hours and contradicted its own margin math. Perplexity claimed 175 pages and produced seven.
I Tested All Five Deep Research Tools on the Same Real-World Problem. Here’s What Actually Happened.
TL;DR: I ran the same detailed market research prompt through ChatGPT, Gemini, Claude, Perplexity, and Grok. ChatGPT delivered the most reliable output, Gemini brought the sharpest strategic framing, and Claude buried useful work under inconsistent data. Perplexity and Grok weren’t close. If you’re buying one of these tools for your team, this matters.
Why This Test Matters Now
Google shipped the first Deep Research tool in November 2024, and that’s when agentic AI stopped being a demo and became something you could actually use for work. ChatGPT followed in early February 2025, then Perplexity and Grok, and Claude finally joined in April.
I last wrote about these tools when Claude launched theirs. But eight months is a lifetime in this space, and I wanted to see how they stack up today with something real. Not a toy problem. Not a “summarize this article” test. A full market research report with structure, sources, and stakes.
The Setup: One Prompt, Five Tools, Zero Mercy
I used the same extremely detailed prompt across all five platforms. Same request, same industry, same expectations. I picked Pet Wellness and Insurance because it’s complex enough to stress-test these tools, but not so niche that they’d have no training data.
The prompt asked for a comprehensive market research report including an executive summary, case study, long-term performance analysis, competitive landscape, growth drivers, visual elements, and a full appendix with sources. You can see the exact prompt and all five outputs in this shared Dropbox folder.
Then I waited to see what came back.
What I Found: The Rankings
First place goes to ChatGPT. This was the most reliable output and the most complete against spec. It stays aligned with consensus numbers from NAPHIA, ties every claim to sources, and delivers each requested section with actual substance. The executive summary quantifies 2024 US premiums around $4.7B and roughly 6.4M insured pets, then frames global growth without chasing outlier estimates. The case study on Banfield Optimum Wellness Plans is concrete. You get enrollment levels, pricing bands, and real operating impacts. The long view separates pre-2018 growth, the COVID acceleration, and post-pandemic normalization, and it handles inflation and portfolio pruning with balance. Visuals are clean and reusable, including a US gross written premium time series and a clear market-share pie.
ChatGPT reviewed 23 sources, ran 11 searches, and finished in 21 minutes.
Gemini lands in second, and it’s close. This output is insight-dense and strategically sharp. It goes beyond table stakes and frames a clear story: a growth engine limited by profitability pressure. It reads share at the group level rather than brand-only, which is the right way to look at a consolidating category. It calls out real execution frictions like the lack of standardized veterinary coding and why vet-direct pay is a moat, not a UI tweak. Where it slips is the occasional over-reach on specific point estimates and a few mixed-quality sources. But if you want the strategy brain, this is it. I’d pair Gemini’s framing with ChatGPT’s numbers as your truth set.
Gemini pulled from 80 sources total.
Claude takes third. It’s broad and energetic, trying to cover segmentation, channel mix, and M&A comps. But it stumbles on internal consistency and fact hygiene. Up front, it pegs 2024 US premiums at $5.2B, which runs hot versus the ~$4.7B consensus. And it contradicts itself on margin math, citing “70 to 80 percent gross margins” in one place, then showing 20 to 37 percent after loss ratios elsewhere. There’s useful narrative and plenty of exhibits, but you’d need a QA pass before sharing this so no one wastes time chasing corrections.
Claude used 697 sources and took an astonishing 1 hour and 52 minutes to complete.
Grok lands in fourth. This reads like a placeholder draft. Many statements are hedged, and several figures are back-calculated from CAGR assumptions instead of identified primary data. It checks the boxes for sections, but they’re lightly sketched. The “case study” is more company mini-history than a problem-solution-metrics arc, and it mixes US and international points without crisp scoping. Not unusable, but not sufficient if you’re presenting this to anyone who’ll ask follow-up questions.
Grok thought for 1 minute and 7 seconds, and reviewed 101 web pages.
Perplexity finishes last. By far the weakest of the bunch. The report claimed “175+ pages,” but it only created seven. Several headline numbers look unvetted or inconsistent with stronger sources. It does include four CSVs, which is a nice idea in theory, but the PDF doesn’t tie those files to the charts or reconcile them against a trusted baseline. Use it for sparking ideas, not for numbers you’ll put in front of a client or your board.
Perplexity ran 145 web searches and utilized 32 sources.
What This Means for Your Team
If you’re buying one of these tools or deciding which one your team should default to, here’s what I’d do.
Use ChatGPT when accuracy and completeness matter most. It’s the workhorse. If someone’s going to cite your output in a deck or use it to make a six-figure decision, start here.
Use Gemini when you need strategic framing and industry structure thinking. It’s the strategist in the room. Pair it with ChatGPT’s numbers and you’ve got something powerful.
Use Claude if you’re willing to do a QA pass and you want volume. It generates a lot, and some of it is genuinely useful. But don’t skip the fact-check.
Skip Grok and Perplexity for this kind of work. They’re not ready yet. Maybe they’ll improve, but right now they’re not in the conversation for serious research output.
One More Thing
These tools are all moving fast. What’s true today might shift in three months. But the pattern here matters more than the snapshot. When you’re evaluating AI research tools, you want three things: accuracy against known benchmarks, internal consistency, and clear source attribution. Two of the five tools delivered on all three. Three didn’t.
That’s the signal.
Business leaders are drowning in AI hype but starving for answers about what actually works for their companies. We translate AI complexity into clear, business-specific strategies with proven ROI, so you know exactly what to implement, how to train your team, and what results to expect.
Contact: steve@intelligencebyintent.com


