Google Just Flipped the AI Leaderboard in 90 Days
The AI price war just got a front-runner, and it's not who you'd expect.
Image create by GPT Image 1.5
Google’s Gemini 3.1 Pro Just Changed the Leaderboard. Here’s What It Means for You.
TL;DR: Google released Gemini 3.1 Pro on February 19, and it’s a real jump forward. It more than doubled its predecessor’s reasoning score, took the #1 spot on the most respected independent AI intelligence ranking, and leads in many benchmarks. It still trails in coding. And honestly, Claude still writes circles around it. At $2 per million input tokens, though, it’s less than half the cost of Opus 4.6. That changes the math for a lot of teams.
I’ve been telling people for months that Google was getting close. That the gap between Gemini and the other frontier models was shrinking fast.
Turns out I was underselling it.
Last Thursday, Google dropped Gemini 3.1 Pro, and the numbers aren’t close on most of the things that matter. They lead. Not by a hair. By a lot.
But I don’t want to just throw benchmark scores at you. That’s what everyone else is doing this week, and most of it is useless if you don’t know what the benchmarks actually test. So I’m going to explain the ones that matter most in plain English, show you where Gemini wins and where it doesn’t, and then talk about something most people are missing: a free tool from Google that lets you access a much more powerful version of this model than what you get in the regular app.
What Is Gemini 3.1 Pro, in 30 Seconds?
It’s Google’s newest and most capable AI model. It launched February 19, just three months after Gemini 3 Pro came out in November. In those three months, both Anthropic (Claude Opus 4.6) and OpenAI (GPT-5.2, GPT-5.3-Codex) had pulled ahead of Google. This is Google’s answer.
The interesting part is where the intelligence came from. Google took the reasoning engine from Gemini Deep Think, their most powerful model that was previously locked behind a premium subscription, and distilled it into 3.1 Pro. Same price as before. Same million-token context window. Considerably smarter.
The Three Numbers That Tell the Story
I could bury you in benchmarks. Google published scores across sixteen different tests, and they lead on thirteen of them. But most of those numbers won’t change how you think about this. Three of them will.
It Can Reason Through Problems It’s Never Seen Before
The benchmark is called ARC-AGI-2. Here’s what it does: it gives the model a series of visual logic puzzles that are specifically designed so the AI can’t rely on anything it memorized during training. Think of it like an IQ test. The model either figures out novel problems on the fly, or it doesn’t. You can’t study your way to a high score.
Gemini 3.1 Pro scored 77.1%. Three months ago, Gemini 3 Pro scored 31.1% on this same test. That’s not an incremental improvement. It went from getting roughly one in three puzzles right to getting three out of four. Claude Opus 4.6 is at 68.8%. GPT-5.2 is at 52.9%.
I want to be clear about why this particular test matters more than most. A lot of AI benchmarks can be gamed. You train on similar data, you score higher. ARC-AGI-2 was built specifically to prevent that. So when a model more than doubles its score in three months, something real changed in how it reasons.
It Can Pass a PhD-Level Science Exam
GPQA Diamond is a collection of questions written by PhD experts in physics, biology, and chemistry. These aren’t trivia questions. They’re the kind of problems you’d see on a doctoral qualifying exam, written by professors who were actively trying to make them hard.
Gemini 3.1 Pro scored 94.3%. Best in the field.
If you work in a science-adjacent industry, or if your team ever needs to reason through technical or scientific material, that number matters. Getting above 90% on questions that were designed to challenge PhD candidates is a meaningful capability.
Graduate-Level Performance Across Every Subject at Once
The Artificial Analysis Intelligence Index aggregates performance across some of the hardest public benchmarks in science, math, coding, and reasoning. A score of 57 means the model is correctly solving roughly 57 percent of these advanced test problems across domains. For many of these exams, non specialists would score near zero. The significance is breadth: today’s leading models are performing at graduate level across multiple fields simultaneously on standardized evaluations, something no single human could realistically replicate.
Gemini 3.1 Pro scored 57. Claude Opus 4.6 scored 53. GPT-5.2 scored 51.
Where It Falls Short: Coding and Writing
I think it’s important to be straight about where this model doesn’t win, because you’ll make better decisions with the full picture.
On coding, Gemini 3.1 Pro is very good but not the best. The standard test for real-world bug fixing (SWE-Bench Verified) is basically a three-way tie between Opus 4.6 (80.8%), Gemini 3.1 Pro (80.6%), and GPT-5.3-Codex (80.0%). But on more specialized coding tasks, particularly terminal-based work like navigating file systems and managing builds, GPT-5.3-Codex pulls clearly ahead at 77.3% versus Gemini’s 68.5%. If your team lives in code all day, Claude and GPT still have edges that matter.
And then there’s writing. I want to be direct about this because I’ve tested it extensively.
Claude is still a far better writer.
I don’t mean a little better. I mean noticeably, consistently better at producing text that sounds like a human wrote it. I’ve run the same prompts through Gemini 3.1 Pro and Claude with identical system instructions, identical guidelines, identical context. The text that comes back from Claude has a natural rhythm to it. It varies its sentence length without being told to. It knows when to use a short sentence for emphasis and when to let a thought breathe across a longer one. The output from Gemini is competent, accurate, well-organized. But it reads like AI wrote it. Claude’s output reads like a person wrote it and then maybe had AI clean it up.
For my work, that distinction is everything. I write for senior executives who have finely tuned BS detectors. They can spot AI-generated content in the first paragraph. So when I need a writing partner to help me with a concept or a deep thought, I reach for Claude. When I need raw reasoning power, research synthesis, or multi-step analysis, Gemini 3.1 Pro and Gemini DeepThink are at the top of my list.
The SVG Test: Seeing Reasoning in Action
One of the fun, informal ways people have been testing these models is by asking them to generate SVGs, which are images made entirely from code. This forces the model to reason about spatial relationships, visual composition, and code all at the same time. No official benchmark tracks this. But it tells you a lot about how a model thinks.
I’ve been using the same ridiculous prompt across all the frontier models: “Create an SVG of a St. Bernard on a bicycle in a rainstorm wearing a cowboy hat.” The absurdity is the point. Juggling all those elements forces the model to think creatively and spatially.
Here’s how they did, ranked worst to best:
ChatGPT 5.2 Extended Thinking and Grok 4.2 Beta both struggled. The outputs were recognizable as attempts, but the compositions fell apart.
Here’s ChatGPT 5.2 Extended Thinking
Here is the brand new Grok 4.2 Beta
Claude Opus 4.6 Extended was OK. Elements were identifiable, composition was reasonable. Not embarrassing, not impressive.
Gemini 3.1 Pro (via the Gemini app) was very good. Better composition, clearer visual elements, more creative interpretation.
Gemini 3.1 Pro (via AI Studio) was outstanding. And that difference is worth talking about. You can’t see it in this still image, but the wheels are animated (turning) in this version.
AI Studio: The Free Tool Most People Don’t Know About
This might be the most practical thing I tell you in this whole article.
When you use Gemini through the regular Gemini app, you’re getting a version of the model that’s been wrapped in safety layers, system prompts, and various guardrails. That’s fine for everyday use. But Google also offers AI Studio (aistudio.google.com), and it gives you much more direct access to the raw model.
You don’t need to be a developer to use it. It’s free. And the difference in output quality, especially for creative and technical tasks, can be dramatic. That’s why the AI Studio SVG was so much better than the Gemini app SVG. Same model. Fewer layers between you and its capabilities.
Think about it this way. Using Gemini through the app is like talking to someone at a business dinner. Polished, careful, measured. AI Studio is like sitting with that same person at their kitchen table on a Saturday morning. Same brain. Fewer filters. More willing to take a risk.
If you’ve tested Gemini before and thought “pretty good, not great,” go try the exact same prompt in AI Studio. I think you’ll see what I mean.
So What Do You Do With This?
Here’s where I land on this, and I’ll give you the practical version.
What is it? A major intelligence upgrade from Google that leads most AI benchmarks at half the cost of the competition.
Why now? Three frontier models launched in sixteen days this month. The pace isn’t slowing down. If you’ve been waiting for things to stabilize before investing in AI workflows, that wait doesn’t have an end date.
What does it change? If you’re paying for AI at scale, the cost difference alone justifies a serious look. For research, analysis, and multi-step automated tasks, Gemini 3.1 Pro is now the strongest option. For writing and specialized coding, Claude and GPT still have real advantages.
What to do this week:
Go to aistudio.google.com and try Gemini 3.1 Pro on a real task from your work. It’s free and you don’t need to be technical.
If your team uses AI through APIs, compare your current costs against Gemini’s $2/$12 per million tokens. The savings might fund your next project.
For any research or analysis task, run a side-by-side test. The reasoning improvements are real.
Don’t move your writing or coding workflows. Not yet. Claude writes better. GPT-5.3-Codex codes better in specialized scenarios.
Watch for the general availability release. The model is still in preview, and Google has hinted at further improvements.
The Bottom Line
Three months ago, Google was behind. Now they’re leading on most independent measures. Three months from now, someone else will probably push ahead again.
But this release is different in one important way. It’s not just that Gemini got smarter. It got smarter at a price point that changes who can afford to use frontier AI in production. Highest intelligence score at less than half the cost. That’s not just a benchmark story. That’s a business story.
The era of one AI model to rule them all is over. The right question now isn’t “which model is best?” It’s “which model is best for this specific job?”
And that, honestly, is a much more interesting question.
Why I write these articles:
In this article, we looked at what happens when the highest-performing AI model is also the cheapest one, and why the idea of picking a single “best” model is the wrong frame for any team making real workflow decisions. The market is noisy, but the path forward is usually simpler than the hype suggests.
If you want help sorting this out:
Reply to this or email me at steve@intelligencebyintent.com. Tell me what your team is using today, what it’s costing you, and where the output isn’t meeting the bar. I’ll tell you what I’d test first, which part of the Gemini/Claude/GPT stack fits your actual work, and whether it makes sense for us to go further than that first conversation.
Not ready to talk yet?
Subscribe to my daily newsletter at smithstephen.com. I publish short, practical takes on AI for business leaders who need signal, not noise.







