State of the field · AI
State of RAG — Q2 2026
TL;DR: RAG is the default for production LLM systems on proprietary data, but the design space has split. Long-context for slow, infrequent, high-value queries; retrieval for cost, latency, freshness. Graph RAG crossed the production threshold this quarter. Evaluation consolidated around Ragas 1.0 + CI gates. Chunking is the unsolved problem.
By Taranpreet Singh · Published 4 June 2026 · 12 minute read
Headline
Retrieval-augmented generation entered Q2 2026 as the default architecture for nearly every production LLM application that touches proprietary data. It also entered Q2 2026 with the most credible challenge it has ever faced: long-context models that genuinely make “just stuff the documents into context” a viable alternative for many workloads.
The takeaway from a quarter of Notifire coverage: RAG is not going away, but the design space has split. Long-context for slow, infrequent, high-value queries. Retrieval for cost-sensitive, latency-sensitive, freshness-sensitive workloads. The teams shipping the best RAG systems in 2026 know which side of that split each of their queries sits on, and design accordingly.
What changed in Q2
Three substantive shifts. First, graph-enhanced RAG crossed the production threshold. Neo4j shipped a first-class LLM tool-calling integration; TigerGraph followed with hybrid vector + graph indexes; LangChain and LlamaIndex both added graph retrievers to their default builders. We covered four production case studies of graph RAG replacing pure vector retrieval, each citing 20–40% accuracy gains on multi-hop questions.
Second, the long-context regime matured. Claude Sonnet 4.5 at 1M context with caching, Gemini 2.5 Pro at 2M, and GPT-5 holding at 400k all became reliable enough for production use. The cost economics shifted with them — caching means a 500-page document loaded once is essentially free on subsequent queries within the cache TTL. For workloads where one document is reused many times, long-context now beats retrieval on total cost.
Third, evaluation consolidated. Ragas hit 1.0 in April. OpenAI Evals open-sourced its full harness. Anthropic shipped Inspect with first-class RAG metrics. The “every team rolls their own scorer” era is over; CI gates with standardised metrics are now the default for production RAG.
Graph-enhanced RAG, in production
The classic vector-RAG failure mode is the multi-hop question: “Which of our suppliers source materials from companies sanctioned in 2024?” Vector search finds documents that mention suppliers and documents that mention sanctions, but cannot reason across the relationships between them. Graph RAG fixes this by augmenting the vector store with explicit relationship metadata — entities, edges, and traversal queries — that the LLM can consult during retrieval.
The architecture is two-step. First, an ingestion pipeline extracts entities and relationships from source documents into a graph database (Neo4j, TigerGraph, KuzuDB, FalkorDB). Second, at query time, the LLM is given access to both vector search (for semantic similarity) and graph traversal (for structural reasoning) as separate tools, and decides which to invoke based on the question shape.
The trade-off is operational complexity. A graph database adds a second system to operate, and the entity-extraction quality bounds the whole approach — bad NER produces a bad graph and a worse-than-vector-only RAG. Teams shipping graph RAG in 2026 typically use an LLM-driven extraction pipeline with a human review gate on a sample of every batch.
When long-context beats retrieval
The popular framing — “long context will replace RAG” — was overstated. The accurate framing: long context replaces RAG for a specific subset of workloads. Specifically: workloads where the same document or document set is queried repeatedly within a short window, where freshness doesn’t matter, where latency tolerance is several seconds, and where the per-query cost is acceptable.
Concrete example: a legal team analysing a single contract for 30 questions over an hour. Loading the contract once into a cached context costs the input-token price once; each of the 30 questions reuses the cache at the cache-read rate (typically 10× cheaper than input). That’s strictly cheaper than embedding the contract, querying a vector store 30 times, and paying retrieval-then-generate per query.
Counter-example: a customer-support assistant answering questions against a 50,000-document knowledge base. The base is too large to fit in context (even at 2M tokens), changes daily, and queries arrive from many users with different question patterns. RAG remains the only viable architecture.
Evaluation: from artisanal to industrial
RAG eval at the start of 2026 was a mess of ad-hoc scoring scripts, hand-curated test sets, and unstandardised metrics. By end of Q2 it had consolidated around three pieces: Ragas 1.0 for retrieval and faithfulness metrics, OpenAI Evals or Anthropic Inspect for full-pipeline scoring, and a CI gate that fails the build if scores regress.
The standard three-layer eval is: (1) retrieval precision/recall — did we find the right passages?; (2) faithfulness — does the answer match the retrieved text, or did the model hallucinate?; (3) answer quality — is the final response useful? Each layer has both deterministic (overlap, exact-match) and LLM-as-judge metrics. The CI gate runs daily against a frozen test set and pages an on-call engineer if any layer regresses below threshold.
What didn’t consolidate: test-set curation. Every team’s eval set looks different and the consensus is this is correct — your eval should reflect your users’ actual questions, not a generic benchmark. The standard advice is to mine real production queries (with user consent) into a test set, label answers in pairs, and treat the test set as a living document.
Which vector database, actually
The vector-database market settled in Q2. Four real choices: pgvector (Postgres extension, fine up to ~100M vectors, killer feature is joined queries with relational data), Pinecone (managed serverless, dominant at the high end), Qdrant (self-hosted Rust, popular with cost-conscious teams), and Weaviate (self-hosted Go, popular with on-prem requirements). Milvus is still around but losing share to Qdrant.
The choice rarely matters as much as people fear. Embedding quality, chunk size, and retrieval-quality engineering matter far more than which vector store sits underneath. Most teams who switched between vector databases reported single-digit-percent change in eval scores.
If you’re already on Postgres and your scale is moderate, pgvector with HNSW indexing is the right answer until you have a specific performance reason to change. Notifire’s comparison page on pgvector vs Pinecone breaks down the crossover point in detail.
Chunking: the unsolved problem
The boring secret of RAG quality is chunking. The standard recipe — split text every 1000 characters with 200 characters of overlap — works tolerably on prose but breaks on structured documents (code, tables, contracts) where the meaningful unit is a section, not a character window.
Three approaches gained traction in Q2. Semantic chunking (split where embedding similarity between adjacent sentences drops) is conceptually clean but slow and not always better. Hierarchical retrieval (retrieve coarse chunks, then re-retrieve within them) handles long documents well but doubles the retrieval cost. Late chunking (Jina’s approach: embed the entire document with a long-context embedding model, then chunk the resulting vector) is the most promising recent direction — it preserves document-level context inside each chunk’s vector.
We expect the next year to be a chunking-strategy bake-off. Notifire will cover whichever approach wins.
What engineering teams should do
Pick the regime per workload, not per company. Some queries are long-context wins; others are RAG wins. The teams shipping the best systems route per query.
Adopt graph RAG only when you have multi-hop questions. If your queries are flat lookups (“find me the document about X”), graph RAG adds operational burden without helping. If your queries chain entities (“which X are affected by Y because of Z”), graph RAG is worth the complexity.
Industrialise your eval today. Ragas 1.0 + CI gate is a half-day setup that pays back the first time a model upgrade breaks production silently.
Use pgvector if you’re already on Postgres. The crossover point to a dedicated vector DB is higher than most teams assume.
Invest in chunking before reranking. Most teams reach for a reranker (Cohere Rerank, BGE-Reranker) to fix retrieval quality. Reranking helps, but better chunking helps more.
Frequently asked questions
What changed in RAG between Q1 and Q2 2026?
Three things. Graph-enhanced RAG moved from research to production at several large enterprises after Neo4j and TigerGraph shipped first-class LLM integrations. Long-context models (Claude Sonnet 4.5 at 1M tokens, Gemini 2.5 Pro at 2M) made “just stuff the documents into context” viable for many workloads that previously required retrieval. And the eval ecosystem (Ragas 1.0, OpenAI Evals, Inspect) consolidated, ending the era of every team rolling their own scorer.
Is RAG still relevant when context windows are 1M+ tokens?
Yes — for cost, latency, and freshness reasons. Stuffing a 500-page document into context at every query is 10–100× more expensive than retrieving the right 3 pages. Latency on full-context queries is several seconds; retrieval-then-generate stays under one. And long-context still doesn’t help with documents that change daily — those have to be re-indexed, which is exactly what RAG already does.
When should you use graph RAG instead of vector RAG?
When the answer requires multi-hop reasoning across explicit relationships. Pure vector search finds semantically similar text but cannot follow “customer → account → contract” chains. Graph RAG combines vector retrieval with graph traversal, dramatically improving accuracy on enterprise knowledge bases where entities reference each other.
What’s the production-tested RAG eval stack in 2026?
Ragas 1.0 for retrieval and faithfulness metrics, OpenAI Evals or Anthropic’s Inspect for full-pipeline scoring with LLM-as-judge, and a CI gate that fails the build if pass rate regresses against a frozen test set of 50–200 reference Q&As.
Which vector database wins in 2026?
Workload-dependent. pgvector inside Postgres for most teams under 100M vectors — same backup, same transactions, no second system to operate. Pinecone for serverless scale beyond that. Qdrant and Weaviate for self-hosted control. Most teams need exactly one; the choice rarely matters as much as the embedding model quality and chunk size.
What’s the biggest unsolved problem in production RAG?
Chunking. The standard “split on 1000 characters with 200 overlap” heuristic loses meaning at section boundaries; semantic chunkers help but are slow. Late chunking (Jina’s approach), hierarchical retrieval, and document-structure-aware splitting are the techniques production teams are converging on.