Tech

Vectorless RAG Hits 98.7%. Here's What the Infographic Edited Out.

Tree-walking RAG really does beat chunked vector search on hierarchical documents — the 98.7% vs 50% gap on FinanceBench is real. But the headline hides the three costs that decide whether you should actually rip out your vector store: latency, per-query token burn, and the multi-document corpus problem that "vectorless" quietly punts on.

Initial Editor·2026-04-23·5min read·1,058 words

The infographic making the rounds this week says chunked vector RAG gets ~50% on financial QA and "vectorless" RAG gets 98.7%. Both numbers are real. They come from PageIndex's reported results on a FinanceBench subset — the benchmark where most production RAG stacks go to embarrass themselves. What the infographic leaves out is what "vectorless" actually costs to run, and where the 98.7% quietly falls apart.

What "vectorless" actually means

It's not vectorless. It's tree-walking with an LLM as the router.

You still index the document. You just index it the way the document was written: a tree of sections, not a bag of chunks. Each node holds a summary of the subtree below it. A query doesn't hit an embedding similarity search; it hits the LLM, which reads the summaries at the current node and picks the child node to descend into. The walk ends at a leaf, which is a real section of the source document — not a 512-token chunk that happens to span a boundary.

The payoff is intact context. A 10-K has "Risk Factors" and "Management's Discussion" as sibling sections that chunking will happily shred across boundaries. Tree-walking grabs the whole section.

Where the 98.7% comes from

FinanceBench is answer-retrieval over SEC filings. It's structured: 10-Ks and 10-Qs have consistent TOC trees, numbered sections, and tables that stay in one place. Every document in the corpus ships with the tree already drawn for you.

That's the corpus shape where tree-walking pays off. The model never has to guess at structure. It reads the TOC, picks the section, and the answer is in that section. Similarity search, meanwhile, has to reassemble "operating cash flow for fiscal 2023" out of chunks whose vectors all point roughly at the same neighborhood.

The benchmark is not a lie. It's a specific corpus where the technique fits the shape of the documents. Generalize past that shape and the numbers start moving.

The three costs the infographic skips

1. Latency is N sequential LLM calls, not one

Vector RAG is one embedding lookup plus one generation call. Two round trips.

Tree-walking with a router is one generation call per level of the tree, in series, plus the final answer call. A 10-K with a four-level TOC is five LLM calls minimum, each blocking on the last. You can parallelize inside a level (evaluate all siblings at once) but you can't parallelize the levels themselves — level N+1 depends on which node level N picked.

For a chat UX where users expect sub-second responses, that's the difference between "feels live" and "the spinner is doing something."

2. Token cost scales with tree depth, not query length

Each routing hop sends the summaries of every candidate child node into the LLM. For a wide tree (a 10-K has 15–30 top-level sections), that's several thousand tokens per hop, times the number of hops. Call it 8–15K tokens of router input per query before you've generated a single answer token.

Vector RAG's cost is dominated by the retrieved chunks fed into the final generation — typically 2–8K tokens. Tree-walking can easily land at 3–5× the token cost per query on the exact benchmarks where it wins on accuracy.

At current Sonnet pricing, that's roughly the difference between $0.01 and $0.05 per query. At scale, that line item stops being a rounding error.

3. Multi-document corpora break the abstraction

A single 10-K has one tree. A corpus of 10,000 10-Ks has 10,000 trees. The router now has to either pick a document first (which is… retrieval, the thing you said you didn't need) or walk a super-tree whose top levels are effectively a directory listing.

Once you're back to "which document should I look in," you're back to vector search, keyword search, or some hybrid — the thing the infographic said you could skip. PageIndex-style results generalize cleanly to single-document QA and to small, well-curated corpora. Open-domain search across millions of documents is not the problem this technique is solving.

When vectorless wins

Corpus shape Winner Why
One long document, hierarchical TOC (10-K, legal brief, textbook) Vectorless Tree maps to the document; latency tolerable for analyst workflows.
Small curated set of structured docs (<100 filings or contracts) Vectorless Directory-level routing stays cheap; accuracy matters more than latency.
Mixed corpus: some structured, some not Hybrid Route to vectorless for structured docs, vector for the rest.
Customer support KB, wiki, product docs Vector Shallow hierarchy; chunking and reranking already work.
Multi-million-doc open-domain search Vector Tree-walking doesn't scale; pre-filter with embeddings.
Chat-latency UX (<1s response) Vector Tree-walking's serial hops blow the latency budget.

When not to switch

  • Your current RAG pipeline hits >85% on your eval set. You're past the point where tree-walking's accuracy gain justifies the rewrite, the latency hit, and the token bill.
  • You don't have an eval set. Swap the retrieval layer and you lose your only way to measure whether anything got better. Build the eval first.
  • Your documents are mostly flat. Blog posts, support articles, and wiki pages don't have the TOC shape tree-walking exploits. You'll pay the cost and get vector-RAG accuracy.
  • You're latency-constrained. If the user is waiting on a chat cursor, the extra 2–4 seconds from serial LLM hops will lose you more than the accuracy gain earns.

The honest caveat

PageIndex and the vectorless framing are a real advance on a real problem — structured-document QA where chunking was always the wrong abstraction. On SEC filings, contracts, textbooks, and long reports with a TOC, treat this technique as the new baseline.

But "vectorless" is a positioning word. The technique underneath is LLM-as-router over a pre-built tree index, and it inherits every constraint that comes with putting an LLM in the hot path of retrieval: serial latency, token-burn per hop, and the need for structure to navigate in the first place. The 98.7% number is what this looks like on the corpus shape where those constraints are cheapest to pay.

Chunking was never the enemy. The enemy was treating every document as a bag of fragments when some of them came with the outline already drawn.

// more in tech

see all →
Tech· 2026-05-29· 5min

The Smallest Agent That Works, Part 3: The Three Agents With State

Stateless agents fit most tasks. State is the most expensive capability you can add — it doubles your operational surface, breaks your debugging, and rewards exactly the use cases that can't survive without it. Memory, environment control, self-learning. Part 3 of three.

#agent-architecture#ai-engineering#ai-agents#system-design
Tech· 2026-05-27· 5min

The Smallest Agent That Works, Part 2: The Three Reach-Out Agents

When the cheap tiers run out, the agent has to reach beyond the model itself — into knowledge it doesn't have, tools it can't natively use, or its own previous answer. RAG, tool use, and self-critique: three patterns, three failure modes worth pricing in. Part 2 of three.

#llm#rag#agent-architecture#ai-engineering
Tech· 2026-05-26· 5min

The Smallest Agent That Works, Part 1: The Three Cheap Agents

Most agent stacks are built one tier too capable for the job. Three of the cheapest architectures — a fixed pipeline, an LLM with rule constraints, and a reasoning loop — solve more problems than the architecture diagrams admit. Part 1 of three.

#llm#agent-architecture#ai-engineering#ai-agents
Tech· 2026-05-15· 5min

What MLX Got to Throw Away (That PyTorch Can't)

Every mature framework is a museum of decisions you can't take back. MLX is interesting mostly because it started after the decisions that matter for Apple Silicon were already mistakes — and the things it threw away are the things that were quietly costing the rest of us the most.

#ai-engineering#apple-silicon#mlx#ml-frameworks
Tech· 2026-05-15· 5min

The Unified-Memory Bet: Why On-Device Inference Stopped Being a Toy

For two years the industry's default answer to every inference question has been "bigger cluster." A different hardware topology is quietly making that the wrong default for a non-trivial slice of workloads — and the framework layer that earns it is the buzzword most decks haven't caught up with yet.

#hardware#ai-infrastructure#inference#edge-ai
Tech· 2026-05-14· 5min

Every Useful Skill Is One of Five Shapes

Skills aren't a freeform format. The useful ones fit one of five shapes — sequential workflow, multi-MCP coordination, iterative refinement, context-aware selection, domain-specific intelligence. Picking the right shape is most of the design work. Picking the wrong one is most of the bugs.

#claude-code#workflow#agents#skills