The Most Expensive AI Model Isn't Your Best One Anymore For Knowledge Work
Claude Sonnet 4.6 just topped every model on the benchmark that measures real office output and it's 40% cheaper than the flagship model.
Image create by GPT Image 1.5
Claude Sonnet 4.6 Just Dropped, and It’s Built for the Way You Actually Work
TL;DR: Anthropic released Claude Sonnet 4.6 today. It’s not just another model update. The benchmarks that jumped the most are the ones that measure real office work: building spreadsheets, writing financial analyses, navigating complex computer tasks. On the test that most closely mirrors what knowledge workers do all day, Sonnet 4.6 beat every model available, including Anthropic’s own flagship Opus, which costs 67% more per token. It comes with a million-token context window in beta (roughly 3,000 pages, API-only for now), new Excel connector support for financial data services, and it’s now the default model for all Claude users. If you work with documents, data, or decks for a living, this one’s for you.
The Cheaper Model Just Passed the Expensive One
Here’s a thing that keeps happening in AI, and it still catches people off guard every time.
You’re paying for the top-tier model. You’ve built your workflows around it. Then the company releases an update to the mid-tier model, and suddenly it does what the expensive one used to do. Sometimes better.
That just happened again.
Anthropic released Claude Sonnet 4.6 today. Sonnet is their mid-tier model. It’s faster than their flagship Opus, and it costs meaningfully less: $3 per million input tokens versus $5 for Opus, and $15 per million output tokens versus $25. That’s about 40% cheaper across the board.
And on several tests that measure actual knowledge work, not abstract puzzles, Sonnet 4.6 didn’t just close the gap with Opus. It beat it.
Anthropic’s own blog post put it plainly: “Performance that would have previously required reaching for an Opus-class model, including on real-world, economically valuable office tasks, is now available with Sonnet 4.6.”
Here’s what I mean. Developers who got early access preferred Sonnet 4.6 to its predecessor by a wide margin. Many of them preferred it to Opus 4.5, which was the smartest model Anthropic had just three months ago. That’s the mid-tier model beating last season’s best.
The Benchmarks That Actually Matter for Your Work
I know what you’re thinking. “Benchmarks. Great. More numbers I don’t understand that may or may not mean anything.”
Fair. Most AI benchmarks test things like math competition problems or graduate-level science trivia. Interesting for researchers. Not that useful if you’re trying to figure out whether this thing can help your team get through a stack of work faster.
But a few benchmarks have come along recently that test something different. They test whether AI can do real work. The kind of work you’d hand to a capable new hire. And these are the ones where Sonnet 4.6 made the biggest jumps.
Let me walk you through three that tell the story.
It Produces Better Work Product Than Any Other Model
GDPval-AA is a benchmark that was built to answer a simple question: can AI do the work that drives the economy? It was developed using OpenAI’s GDPval dataset and run independently by Artificial Analysis. It covers 220 tasks across 44 occupations and 9 industries, everything from finance to healthcare to legal work. The tasks were designed by professionals averaging 14 years of experience.
And these aren’t multiple-choice questions. The model has to produce actual deliverables: documents, spreadsheets, presentations, data analyses. Real work product. Then the outputs get compared head-to-head in blind matchups, like chess rankings. Expert judges pick the winner without knowing which model produced which output. The score is an Elo rating, the same system used to rank chess players. Higher is better, and the gaps between scores tell you how often one model would beat another.
Sonnet 4.6 scored 1,633. The previous Sonnet scored 1,276. Opus 4.6, Anthropic’s flagship, scored 1,606. Gemini 3 Pro scored 1,201. GPT-5.2 scored 1,462.
The mid-tier model is #1. On the benchmark that most closely measures real office work.
Here’s what those numbers actually mean for you. The tasks in this test take an experienced professional about seven hours to complete. The AI finishes them in minutes. When expert judges compared the AI’s output to the human’s output side by side, not knowing which was which, the original GDPval study from OpenAI found that frontier models were winning or tying with the human expert close to half the time. That was with last generation’s models. Sonnet 4.6 scores well above those. And it’s producing better deliverables than any other model available, including Anthropic’s own flagship Opus. Think about it this way: the Elo gap between Sonnet 4.5 and Sonnet 4.6 is massive, 357 points. To put that in context, Artificial Analysis (which runs the benchmark) said a 150-point gap implies a roughly 70% win rate. The gap between the old Sonnet and the new one is more than double that.
It Does Junior Analyst Work Better Than Any General-Purpose AI
Finance Agent v1.1 was built by vals.ai in collaboration with Stanford researchers, a major global bank, and experts from hedge funds and private equity firms. It tests whether an AI can do the work expected of an entry-level financial analyst: pulling data from SEC filings, calculating growth rates, identifying revenue trends, running projections.
The model gets the same tools a human analyst would use: access to the SEC’s EDGAR database, Google search, and document parsing tools. Nobody hands it the right answer on a platter. It has to find the right filings, navigate to the right sections, pull the right numbers, do the math, and produce an answer. On its own.
Sonnet 4.6 scored 63.3%. That’s the highest of any model tested, including Opus 4.6 at 60.1% and GPT-5.2 at 58.5%.
So what does 63.3% actually mean? It means the model got roughly 340 out of 537 expert-written financial research questions right, working completely autonomously. Nine months ago, when the benchmark first launched, no general-purpose model could break 50%. The previous ceiling was about 47%. Sonnet 4.6 blew past it. And here’s the context that matters: entry-level analysts at investment banks spend up to 40% of their week just gathering data, not analyzing it. That’s the work this test measures. The model isn’t replacing anyone’s judgment on whether to make a deal. But the hours of pulling numbers, cross-referencing filings, and building the foundation for analysis? That’s where this hits.
It Uses a Computer as Well as You Do
OSWorld gives an AI its own computer and says: complete these tasks. Real tasks. Navigate a spreadsheet. Fill out a multi-step web form. Open applications, find information, work across multiple tabs. The same kind of computer work that fills your day.
Sonnet 4.5 scored 61.4%. Sonnet 4.6 scored 72.5%.
Here’s the number that stopped me. The human baseline on this test is 72.36%. Sonnet 4.6 scored 72.5%. It has essentially reached human-level performance on everyday computer tasks.
And the trajectory is what makes this wild. In April 2024, the best AI model scored 12.24% on this benchmark. Less than two years later, we’re at human parity. That’s not a gradual improvement. That’s a before-and-after story. When Anthropic first launched computer use in October 2024, they called it “cumbersome and error-prone.” Sixteen months later, the model is matching what a person can do on a computer. If you use Cowork, Claude’s desktop productivity tool, this is the engine underneath it. And it just got a lot better at its job.
A Million Tokens Changes Everything for Document-Heavy Work
Sonnet 4.6 comes with a million-token context window, currently available in beta through the API. A quick translation: that’s roughly 750,000 words, or about 3,000 pages of text, in a single request. It’s not available to everyday claude.ai users yet, but it’s coming, and when it does, the implications are significant.
If you’re an attorney, think about what that means for your caseload. Today, when you’re working a complex matter, you’re juggling dozens of documents: contracts, depositions, regulatory filings, correspondence, expert reports. You search through them one at a time, maybe run keyword searches that miss context, or rely on associates to manually pull together the threads.
With a million-token window, you could load the entire case file and ask Claude to find contradictions between a deposition and a contract clause. Drop in three expert reports and ask where the experts agree and disagree. Feed in a complete set of regulatory requirements alongside a client’s compliance documentation and get a gap analysis in minutes instead of days.
And it’s not just that the window is bigger. Anthropic specifically improved long-context reasoning, meaning the model doesn’t just hold more information. It thinks across it better. For discovery work on cases with massive document sets, that’s a real shift. Instead of sampling and hoping you catch the important stuff, you can process entire volumes at once.
This matters for anyone who works with long documents, not just attorneys. Consultants synthesizing research. Analysts working across multiple data sources. Anyone who’s ever had to read 200 pages before they could start doing their actual job.
Cowork, Excel, and PowerPoint All Got Better Today
A few product updates landed alongside the model that are worth knowing about.
Cowork is Anthropic’s desktop productivity tool, and Sonnet 4.6 is now the engine running it. Cowork isn’t just a chatbot with a desktop wrapper. It runs in an isolated environment on your computer with access to your local files. It can read documents, write files, browse the web, and handle multi-step tasks without you watching every click. With the computer use score jumping to 72.5%, the tasks it can handle reliably just expanded significantly.
Claude in Excel now supports MCP connectors, which means if your firm already subscribes to services like FactSet, Moody’s, S&P Global, or PitchBook, you can pull that data directly into your spreadsheet through Claude. You’ll need to enable connectors in your Claude account settings first, and those connections then carry over into Excel automatically. It’s not a magic switch you flip inside the add-in, and you need existing subscriptions to those data providers. But for firms that already pay for those services, the gap between “I need this data” and “I have this data” just shrank to the length of a sentence.
Claude in PowerPoint launched alongside Opus 4.6 earlier this month and lets the model create and edit presentations directly. Combined with Sonnet 4.6’s improved instruction following, you can move from raw data to a polished deck without switching between four different tools.
What to Do This Week
Try it right now. Sonnet 4.6 is already the default in claude.ai and Cowork. If you have a task that normally takes an hour, throw it at Claude and see what happens.
If your firm uses FactSet, Moody’s, or S&P Global, try the Excel connectors. Enable connectors in your Claude settings, open the Excel add-in, and see if you can pull data directly into a model you’re building. It could save hours of copy-paste.
Rethink your model choice. If you’ve been defaulting to Opus for complex work, test whether Sonnet 4.6 gets you the same quality at 40% less per token. The savings add up fast on high-volume work.
Share the GDPval and Finance Agent numbers with your leadership team. This isn’t a tech upgrade story. It’s a productivity story with real cost implications.
The Bottom Line
I’m just starting to test Sonnet 4.6 now, but I’m incredibly impressed with what I’m seeing. The jump in knowledge work benchmarks isn’t small. It’s the kind of step change that makes you rethink which model you reach for first thing in the morning.
And I can imagine this being a daily driver for most knowledge work activities.
Why I write these articles:
In this article, we looked at what happens when a mid-tier AI model starts outperforming the flagship on the work your team actually does: producing documents, pulling financial data, navigating multi-step computer tasks. The pricing tiers that made sense three months ago may not make sense today, and most firms haven’t revisited the question.
If you want help sorting this out:
Reply to this or email me at steve@intelligencebyintent.com. Tell me which models your team is using, what they’re using them for, and where the bottlenecks are. I’ll tell you whether Sonnet 4.6 changes your cost-quality equation, which parts of the Claude/Gemini/GPT stack actually fit your workflows, and whether it makes sense for us to go further than that first conversation.
Not ready to talk yet?
Subscribe to my daily newsletter at smithstephen.com. I publish short, practical takes on AI for business leaders who need signal, not noise.


