Vectorless RAG Hits 98.7%. Here's What the Infographic Edited Out.

The infographic making the rounds this week says chunked vector RAG gets ~50% on financial QA and "vectorless" RAG gets 98.7%. Both numbers are real. They come from PageIndex's reported results on a FinanceBench subset — the benchmark where most production RAG stacks go to embarrass themselves. What the infographic leaves out is what "vectorless" actually costs to run, and where the 98.7% quietly falls apart.

What "vectorless" actually means

It's not vectorless. It's tree-walking with an LLM as the router.

You still index the document. You just index it the way the document was written: a tree of sections, not a bag of chunks. Each node holds a summary of the subtree below it. A query doesn't hit an embedding similarity search; it hits the LLM, which reads the summaries at the current node and picks the child node to descend into. The walk ends at a leaf, which is a real section of the source document — not a 512-token chunk that happens to span a boundary.

The payoff is intact context. A 10-K has "Risk Factors" and "Management's Discussion" as sibling sections that chunking will happily shred across boundaries. Tree-walking grabs the whole section.

Where the 98.7% comes from

FinanceBench is answer-retrieval over SEC filings. It's structured: 10-Ks and 10-Qs have consistent TOC trees, numbered sections, and tables that stay in one place. Every document in the corpus ships with the tree already drawn for you.

That's the corpus shape where tree-walking pays off. The model never has to guess at structure. It reads the TOC, picks the section, and the answer is in that section. Similarity search, meanwhile, has to reassemble "operating cash flow for fiscal 2023" out of chunks whose vectors all point roughly at the same neighborhood.

The benchmark is not a lie. It's a specific corpus where the technique fits the shape of the documents. Generalize past that shape and the numbers start moving.

The three costs the infographic skips

1. Latency is N sequential LLM calls, not one

Vector RAG is one embedding lookup plus one generation call. Two round trips.

Tree-walking with a router is one generation call per level of the tree, in series, plus the final answer call. A 10-K with a four-level TOC is five LLM calls minimum, each blocking on the last. You can parallelize inside a level (evaluate all siblings at once) but you can't parallelize the levels themselves — level N+1 depends on which node level N picked.

For a chat UX where users expect sub-second responses, that's the difference between "feels live" and "the spinner is doing something."

2. Token cost scales with tree depth, not query length

Each routing hop sends the summaries of every candidate child node into the LLM. For a wide tree (a 10-K has 15–30 top-level sections), that's several thousand tokens per hop, times the number of hops. Call it 8–15K tokens of router input per query before you've generated a single answer token.

Vector RAG's cost is dominated by the retrieved chunks fed into the final generation — typically 2–8K tokens. Tree-walking can easily land at 3–5× the token cost per query on the exact benchmarks where it wins on accuracy.

At current Sonnet pricing, that's roughly the difference between $0.01 and $0.05 per query. At scale, that line item stops being a rounding error.

3. Multi-document corpora break the abstraction

A single 10-K has one tree. A corpus of 10,000 10-Ks has 10,000 trees. The router now has to either pick a document first (which is… retrieval, the thing you said you didn't need) or walk a super-tree whose top levels are effectively a directory listing.

Once you're back to "which document should I look in," you're back to vector search, keyword search, or some hybrid — the thing the infographic said you could skip. PageIndex-style results generalize cleanly to single-document QA and to small, well-curated corpora. Open-domain search across millions of documents is not the problem this technique is solving.

When vectorless wins

Corpus shape	Winner	Why
One long document, hierarchical TOC (10-K, legal brief, textbook)	Vectorless	Tree maps to the document; latency tolerable for analyst workflows.
Small curated set of structured docs (<100 filings or contracts)	Vectorless	Directory-level routing stays cheap; accuracy matters more than latency.
Mixed corpus: some structured, some not	Hybrid	Route to vectorless for structured docs, vector for the rest.
Customer support KB, wiki, product docs	Vector	Shallow hierarchy; chunking and reranking already work.
Multi-million-doc open-domain search	Vector	Tree-walking doesn't scale; pre-filter with embeddings.
Chat-latency UX (<1s response)	Vector	Tree-walking's serial hops blow the latency budget.

When not to switch

Your current RAG pipeline hits >85% on your eval set. You're past the point where tree-walking's accuracy gain justifies the rewrite, the latency hit, and the token bill.
You don't have an eval set. Swap the retrieval layer and you lose your only way to measure whether anything got better. Build the eval first.
Your documents are mostly flat. Blog posts, support articles, and wiki pages don't have the TOC shape tree-walking exploits. You'll pay the cost and get vector-RAG accuracy.
You're latency-constrained. If the user is waiting on a chat cursor, the extra 2–4 seconds from serial LLM hops will lose you more than the accuracy gain earns.

The honest caveat

PageIndex and the vectorless framing are a real advance on a real problem — structured-document QA where chunking was always the wrong abstraction. On SEC filings, contracts, textbooks, and long reports with a TOC, treat this technique as the new baseline.

But "vectorless" is a positioning word. The technique underneath is LLM-as-router over a pre-built tree index, and it inherits every constraint that comes with putting an LLM in the hot path of retrieval: serial latency, token-burn per hop, and the need for structure to navigate in the first place. The 98.7% number is what this looks like on the corpus shape where those constraints are cheapest to pay.

Chunking was never the enemy. The enemy was treating every document as a bag of fragments when some of them came with the outline already drawn.

Vectorless RAG Hits 98.7%. Here's What the Infographic Edited Out.

What "vectorless" actually means

Where the 98.7% comes from

The three costs the infographic skips

1. Latency is N sequential LLM calls, not one

2. Token cost scales with tree depth, not query length

3. Multi-document corpora break the abstraction

When vectorless wins

When not to switch

The honest caveat

// more in tech

The Smallest Agent That Works, Part 3: The Three Agents With State

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

The Smallest Agent That Works, Part 1: The Three Cheap Agents

What MLX Got to Throw Away (That PyTorch Can't)

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

Every Useful Skill Is One of Five Shapes

Vectorless RAG Hits 98.7%. Here's What the Infographic Edited Out.

What "vectorless" actually means

Where the 98.7% comes from

The three costs the infographic skips

1. Latency is N sequential LLM calls, not one

2. Token cost scales with tree depth, not query length

3. Multi-document corpora break the abstraction

When vectorless wins

When not to switch

The honest caveat

// more in tech

The Smallest Agent That Works, Part 3: The Three Agents With State

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

The Smallest Agent That Works, Part 1: The Three Cheap Agents

What MLX Got to Throw Away (That PyTorch Can't)

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

Every Useful Skill Is One of Five Shapes

New posts, every week.Delivered Sunday mornings.

New posts, every week.
Delivered Sunday mornings.