What 20,000 Legal Documents Taught Us About AI Synthesis
The search worked. Now here's how we made the synthesis work too.
Image created by ChatGPT Image 1.5
The AI Pipeline That Actually Works for Massive Legal Document Analysis (Part 2)
TL;DR: After hitting the limits of what even state-of-the-art AI can do with 20,000+ documents, we built a hierarchical summarization pipeline that breaks the problem into manageable pieces. Three passes. Progressive compression. The result: comprehensive analysis that would have taken a legal team months, delivered in days. Here’s exactly how it works.
Where We Left Off
In Part 1, I described the challenge: a law firm inheriting a seven-year case with 20,000+ documents, hundreds of motions, and nearly a hundred depositions. We built a secure, HIPAA-compliant system on Google Cloud Platform that could ingest and search across everything.
But search wasn’t enough. We needed synthesis. Comprehensive analysis. A complete picture of what happened, when, who said what, and what evidence supported each claim.
And that’s where we hit the wall.
Why “Just Ask the AI” Doesn’t Work at Scale
Let me be specific about the problem, because understanding it is the key to understanding the solution.
No single AI call can process seven years of legal documents and produce a coherent 50-page analysis. It doesn’t matter which model you use. Context windows fill up. Outputs get truncated. The model loses track of details buried in earlier materials.
I’ve seen people suggest using NotebookLM for this kind of work. And look, NotebookLM Enterprise has a massive context window. It’s impressive technology. But “massive” is still finite. When you’re dealing with seven years of depositions, declarations, financial records, messages, and court filings, even a million-token context window isn’t enough to hold everything and reason about it coherently.
Here’s what actually happens when you try. You ask for a comprehensive analysis. The model retrieves some relevant documents. It produces something that looks reasonable. But when you check the output against what you know is in the source materials, you find gaps. Important incidents are missing. Timelines are compressed. Details that matter for the legal strategy get averaged away into generalities.
We needed a different approach.
The Solution: Hierarchical Summarization
Think of it like a pyramid. At the base, you have raw documents. Each layer above compresses the one below it, preserving what matters while reducing volume. By the time you reach the top, you have a coherent analysis built on every piece of evidence in the collection.
We built a three-pass pipeline.
Pass 1: Grounded Retrieval
For each month of the seven-year case, we ran six targeted queries against the document repository using Vertex AI Search. Each query focused on a specific evidence category: allegations from one party, documented activities from the other party, third-party witness observations, relevant records, financial documents, and court filings.
The key word here is “grounded.” The AI retrieves relevant document chunks and cites its sources. This isn’t the model guessing or generating plausible-sounding information. It’s pulling actual text from actual documents and telling us exactly where it came from.
Seven years times twelve months times six queries equals 504 raw evidence files. That’s the base of our pyramid.
Pass 2: Monthly Synthesis
A second AI pass reads each month’s six raw query results and writes a narrative monthly report. This compresses scattered document fragments into coherent summaries while preserving key dates, names, and citations.
Now we have 84 monthly reports instead of 504 raw files. Still too much to process in a single call, but we’ve reduced the volume significantly while keeping the detail.
Pass 3: Quarterly to Yearly to Master
This is where the hierarchical approach really pays off.
Instead of asking the AI to read all 84 monthly reports at once (which would exceed practical limits), we compress in stages. Three monthly reports become one quarterly summary. That gives us 28 quarterly files. Four quarterly summaries become one yearly summary. That gives us seven yearly files. Seven yearly summaries combine into the final master documents.
Each step stays comfortably within the AI’s processing limits while preserving the details that matter.
Why This Actually Works
Token economics matter. AI models have hard limits on how much they can process in a single call. By compressing at each layer, we keep every individual call small and fast while still synthesizing massive amounts of source material. No single call is doing something impossible. Every call is doing something well within its capabilities.
Quality preservation requires explicit instructions. Here’s something we learned the hard way: if something must appear in the final output, you have to name it explicitly in every prompt where summarization occurs.
Let me give you an example. In this case, there were specific incidents that the legal team knew were critical. The kind of thing that, if it got summarized away, would undermine the entire analysis. So in our synthesis prompts, we literally listed them: “Do NOT omit the [specific incident], the [other incident], the [third incident]...”
Without this, important details get averaged away during compression. The model isn’t trying to hide anything. It’s just making judgment calls about what to include when space is limited. If you don’t tell it what matters, you’re leaving that judgment to chance.
Resumability saves time and money. Every intermediate output gets saved to disk. If the pipeline fails halfway through, it picks up where it left off. If the legal team reviews the output and wants to regenerate just the master documents with revised emphasis, we don’t have to reprocess the raw queries. We start from the stage we need.
This might sound like a technical detail, but on a project this size, it’s the difference between iterating quickly and starting from scratch every time.
Dual output tracks from the same data. We produced both a chronological timeline and a thematic summary from the same underlying materials. Same raw queries. Same monthly reports. Different prompts at the yearly and master levels.
The timeline organized everything by date: what happened, when, what evidence supports it. The thematic summary organized by topic: patterns of behavior, documented incidents in specific categories, third-party observations. The legal team got two complementary views of the case without running the pipeline twice.
The Technical Stack
Let me be specific about what tools we used where.
Document storage and retrieval ran on Google Cloud Vertex AI Search. This gave us hybrid retrieval combining vector search (semantic similarity) and keyword matching. For a legal case, you need both. Sometimes you’re looking for documents about a topic. Sometimes you’re looking for documents that mention a specific name or date.
The grounded retrieval pass used Gemini with the Vertex AI Search tool. This is what gives us the citations. The model doesn’t just return answers. It returns the specific document chunks that support those answers.
The synthesis passes used Gemini 3 Pro. For this stage, we needed the highest quality reasoning and writing available. The synthesis prompts are complex. They require the model to read multiple documents, identify what matters, preserve specific details, and write coherent narrative. Gemini 3 Pro handled this well.
One technical note: we had to use streaming for the large output requests. Without streaming, API timeouts became a problem. The synthesis steps produce substantial amounts of text, and waiting for the complete response before any data returns was unreliable.
What We Delivered
After running the full pipeline, the legal team received a comprehensive analysis of the seven-year case. Every significant incident documented. Every piece of evidence catalogued with citations to source documents. Timelines showing patterns over time. Thematic summaries organizing evidence by category.
Work that would have taken a legal team months of reading and note-taking. Delivered in days.
And because every intermediate output was saved, the team could drill down into any period they wanted. Need more detail on what happened in Q3 of Year 4? We have the quarterly summary. Need even more detail? We have the three monthly reports that fed into it. Need the raw evidence? We have the six query results for each month.
The pyramid works in both directions. You can climb up to the big picture or climb down to the source.
Key Lessons for Your Projects
Don’t fight the limits. Design around them. I spent too long early in this project wishing for infinite context windows. The breakthrough came when we accepted real constraints and built something that works within them.
Intermediate outputs are features, not waste. The quarterly and yearly summaries aren’t just stepping stones to the master document. They’re useful deliverables on their own. The legal team uses them regularly when they need to focus on a specific period.
Explicit instructions beat implicit hopes. Every critical detail needs to be named explicitly in every prompt where summarization occurs. If you just hope the model will know what matters, you’ll be disappointed.
Grounding prevents hallucination. By forcing the retrieval pass to cite actual documents, we anchor the entire analysis in real evidence. Everything traces back to something in the record. That’s not just good practice. For legal work, it’s essential.
What This Means for Your Firm
If you’re facing a document-intensive case and wondering whether AI can help, the answer is yes. But not by pointing a model at 20,000 documents and asking for an answer.
The real opportunity is in building pipelines that match AI capabilities to the actual problem. Breaking complex analysis into stages that each stay within practical limits. Using grounding to ensure everything traces to source documents. Preserving intermediate outputs for flexibility and verification.
This isn’t the kind of AI implementation you read about in most marketing materials. It requires understanding what the tools can and can’t do, then designing systems that maximize the “can” while working around the “can’t.”
But when you get it right, the results are remarkable. Work that took months now takes days. Analysis that required teams of associates reviewing boxes of documents now happens automatically, with full citations to every source.
The tools are powerful. Using them well takes work. But for the right cases, the investment pays off many times over.
This is Part 2 of a two-part series on building AI-powered legal research systems. Part 1 covered the initial setup on Google Cloud Platform and the challenges we encountered. If you’re starting a similar project, I’d recommend reading both parts before diving in.
Why I write these articles:
I write these pieces because senior leaders don’t need another AI tool ranking. They need someone who can look at how work actually moves through their organization and say: here’s where AI belongs, here’s where your team and current tools should still lead, and here’s how to keep all of it safe and compliant.
In this article, we looked at why “just ask the AI” fails when you’re facing thousands of documents, and how a hierarchical pipeline solves the problem by designing around real constraints instead of fighting them. The market is noisy, but the path forward is usually simpler than the hype suggests.
If you want help sorting this out:
Reply to this or email me at steve@intelligencebyintent.com. Tell me what’s slowing your team down and where work is getting stuck. I’ll tell you what I’d test first, which part of the Google Cloud and Gemini stack fits your situation, and whether it makes sense for us to go further than that first conversation.
Not ready to talk yet?
Subscribe to my daily newsletter at smithstephen.com. I publish short, practical takes on AI for business leaders who need signal, not noise.


