What 20,000 Legal Documents Taught Us About AI Search

The search worked. The synthesis didn't. What we're doing differently.

Feb 04, 2026

Image created by Gemini Nano Banana Pro

When 20,000 Documents Meet AI: Building a Legal Knowledge System on Google Cloud (Part 1)

TL;DR: I helped a law firm inherit a massive, seven-year-old case with 20,000+ documents and hundreds of motions. We built an AI-powered system on Google Cloud Platform to make sense of it all. The setup worked. Getting the deep analysis we actually needed? That’s where things got interesting. This is Part 1 of what we learned.

The Call No One Wants

Here’s a scenario that might sound familiar. A law firm calls you in to help. They’ve just taken over a case from another firm. The case has been running non-stop for the past seven years. There are 20,000+ documents. Hundreds of motions. Several hundred potential witnesses, fewer than 100 of whom have actually been deposed. And the previous firm’s institutional knowledge? Gone.

They need to get up to speed. Fast. And they want to know: can AI help?

I said yes. What followed was one of the most challenging and ultimately rewarding AI implementation projects I’ve worked on. But I’m getting ahead of myself.

What We Were Actually Trying to Do

Let me be specific about the ask, because “help us understand this case” sounds simple until you break it down.

We needed a true chronology of everything that had happened over seven years. Not just dates and document names. We needed to understand what the plaintiff claimed at each stage and what evidence they offered. What the defendant claimed and what proof they had. There were several key third parties whose roles, testimony, and data needed tracking. We needed to see which witnesses had testified, what they said, and which potential witnesses hadn’t been called yet.

Think about that for a second. Seven years of legal proceedings. Hundreds of depositions. Court rulings. Exhibits. Emails. Text Messages. Financial documents. Video recordings. And we needed to synthesize all of it into something a legal team could actually use.

You can’t just drop 20,000 documents into ChatGPT and ask for a summary. That’s not how any of this works.

Why Google Cloud Platform Made Sense

Before I get into the technical approach, let me address the obvious question: why GCP?

Three reasons. First, security. This case involves sensitive information, including medical data that falls under HIPAA. We needed enterprise-grade security with a proper Business Associate Agreement in place. GCP offers this out as part of their enterprise solution.

Second, the document types we were dealing with. We had PDFs (both text-based and scanned images), Word documents, emails, text messages, video files, audio recordings, CSVs, and a handful of proprietary legal formats tied to Windows applications. (That last one was a headache, but we dealt with it.) Vertex AI Search can ingest most of these natively. For the ones it couldn’t, we needed conversion pipelines.

Third, integration. When you’re building something this complex, having your storage, AI capabilities, indexing, and search all in one environment matters. Could we have used a third-party vector database? Sure. But the time required to set up the security, build the conversion tools, and connect everything would have been significant. GCP gave us the infrastructure to move fast.

If your firm is already a Google shop, this decision gets even easier.

The Document Ingestion Problem

Here’s what we learned in the first two weeks: getting documents into a searchable format is its own project.

We started by cataloging what we had. PDFs made up the bulk, maybe 60%. But a meaningful percentage were scanned images rather than searchable text. Word documents were straightforward. Emails and messages came in various formats. Video and audio files needed transcription. And those proprietary legal formats? We had to convert them to something usable.

We used Gemini to build Python programs that converted each document type into formats Vertex could ingest. This sounds simple in a sentence, but it took days of iteration. Some documents needed OCR. Others needed language translation. Video files required transcription and timestamping. Audio files had their own workflow.

By the time we finished, we had a repeatable pipeline that could take any document in the collection, convert it if necessary, and prepare it for indexing. We loaded everything into a GCP storage bucket with subdirectories, then built a Vertex AI Search application with a datastore that indexed the full collection. We also built out a datastore for each subdirectory as well in case we just wanted to go deep on depositions.

At this point, we had something powerful: 20,000+ documents, searchable by AI, with proper security and HIPAA compliance in place.

Where It Worked (And Where It Didn’t)

Here’s where I want to be honest about what we discovered.

The basic search capability worked well. Need to find all references to a specific witness? Done. Looking for documents from a particular month mentioning a specific claim? Got it. Want to see every exhibit introduced in a certain motion? No problem.

But we didn’t just need search. We needed synthesis.

My first instinct was to ground Gemini to our datastore and ask it to build a comprehensive history of the case. Month by month. Plaintiff claims and evidence. Defendant claims and evidence. Third-party roles and testimony. The works.

It works. Sort of. For a simple query on a narrow time window, you get decent results. But the comprehensive analysis we actually needed? The 20,000-foot overview kept coming out exactly that: high-level to the point of being unhelpful. We just couldn’t get the depth no matter how we structured the prompts.

We experimented for days. Different queries. Different subsets of the data. Minimizing what we asked for to see if narrower requests got better results. We learned a tremendous amount about the limits of what even a state-of-the-art model can do when pointed at a massive dataset.

One other thing we wrestled with: could we use NotebookLM Enterprise for parts of this? (Not the workspace version. Enterprise, given our HIPAA/baa requirements.) Could Google Drive handle big chunks of the analysis? We got useful data from these tools, but not at the level we needed.

The Honest Assessment

Let me be clear about where we landed after this first phase.

What worked: We built a secure, HIPAA-compliant system that could ingest, convert, and index 20,000+ documents of varying types. We could search across the entire collection using natural language. We could ground Gemini’s responses in actual case documents rather than hallucinated content.

What didn’t work: The complex, multi-dimensional synthesis we really needed. A query like “Give me the complete history of Plaintiff’s claims about X, including all supporting evidence, contradicting testimony, and related rulings” would return something. But it was too thin. Too general. We needed a different approach.

What Gemini and GCP Can Actually Do

Here’s the thing. What Gemini and GCP can do together is honestly incredible. The ability to ingest documents at scale, maintain security and compliance, and enable AI-powered search across everything is real and it works.

But there’s a significant learning curve. Knowing that these capabilities exist is different from knowing how to use them together effectively. The default approaches that seem obvious don’t always deliver the results you need for complex use cases.

Which brings me to Part 2 of this series. We did figure it out. The approach we developed has yielded remarkable results on this case. But it required rethinking how we used the tools, not just which tools we used.

What to Do If You’re Facing Something Similar

If you’re looking at a complex document analysis project and considering GCP:

Start with a document audit. Know exactly what file types you’re dealing with before you commit to a platform.
Budget time for conversion pipelines. If you have scanned PDFs, video, audio, or proprietary formats, plan for the work required to make them searchable.
Get your security and compliance in place early. If HIPAA or other regulations apply, don’t treat this as an afterthought.
Test your actual use case, not just search. Make sure the platform can deliver the synthesis and analysis you need, not just the ability to find individual documents.
Expect iteration. The obvious approach probably won’t be the final approach.

In Part 2, I’ll walk through the specific methodology we developed to get the deep, multi-dimensional analysis this case required. The short version: it required breaking the problem down differently than I initially expected.

The tools are powerful. Using them well takes work.

This is Part 1 of a two-part series on building AI-powered legal research systems. Part 2 will cover the specific approach that finally delivered the comprehensive analysis we needed.

Why I write these articles:

I write these pieces because senior leaders don’t need another AI tool ranking. They need someone who can look at how work actually moves through their organization and say: here’s where AI belongs, here’s where your team and current tools should still lead, and here’s how to keep all of it safe and compliant.

In this article, we looked at the gap between what AI search can find and what complex litigation actually requires. The demo works. The 20,000-document case doesn’t, at least not with default approaches. The market is noisy, but the path forward is usually simpler than the hype suggests.

If you want help sorting this out:

Reply to this or email me at steve@intelligencebyintent.com. Tell me what document analysis challenges your team is facing and where synthesis is falling short. I’ll tell you what I’d test first, which parts of the GCP and Gemini stack fit your situation, and whether it makes sense for us to go further than that first conversation.

Not ready to talk yet?

Subscribe to my daily newsletter at smithstephen.com. I publish short, practical takes on AI for business leaders who need signal, not noise.

Thanks for reading Intelligence by Intent! This post is public so feel free to share it.

Intelligence by Intent

Discussion about this post

Ready for more?