Retrieval-augmented generation | Agentic Delivery Glossary

Retrieval-augmented generation, or RAG, is a pattern that fetches relevant external source material into a model's context at query time so the answer is grounded in information outside the model's trained weights, rather than relying on what the model memorized.

How it works

RAG runs in three stages around the model. First it retrieves: the user's question is turned into a query, usually a vector embedding, and matched against an index of document chunks to pull back the passages most likely to be relevant. Then it augments: those passages are placed into the model's context alongside the question, so the model has the source material in front of it rather than only its trained weights. Finally it generates: the model writes an answer drawing on the retrieved passages, which makes the answer groundable in specific sources it can cite. The quality of the answer is bounded by the quality of the retrieval, so a system that pulls the wrong passage produces a confident answer built on the wrong source. The index is a living component rather than a fixed asset, because updating it adds or removes what the model can reach without retraining the model itself.

Why it matters

A model's built-in knowledge is frozen at training time and cannot see private documents, internal systems, or anything published since, and the alternative of retraining to teach it new facts is slow and expensive. RAG sidesteps that by moving the knowledge out of the weights and into a store the system controls, so refreshing what the model can answer is a matter of updating an index rather than training a model. It also changes the trust story, because an answer built from retrieved passages can cite them, which makes the output auditable in a way a model speaking only from memory is not. The honest limit is that RAG moves the hard problem rather than removing it, since the answer is only as good as what retrieval surfaced, and a system that retrieves plausibly-related but wrong material fails more convincingly than one that admits it does not know. Grounding is also not the same as correctness, because the model can still misread or overreach on a correct passage, so retrieval reduces hallucination without guaranteeing a right answer.

In practice

A support assistant is asked about a policy that changed last week, long after the model behind it was trained. Instead of answering from memory, the system embeds the question, retrieves the current policy document from its index, places that document in the model's context, and asks the model to answer from it. The reply is grounded in the retrieved policy and can point back to it, so a reviewer can check the source rather than trust the model's recall. When the policy changes again, the fix is to update the indexed document, not to retrain or re-prompt the model, because the knowledge lives in the store the retrieval reads from.

Practical considerations

Retrieval quality is the lever that matters most, so the work shifts from prompting to the unglamorous parts: how documents are chunked, how chunks are embedded, and whether dense vector search is paired with keyword search to catch exact terms a vector misses. Chunking is its own trap, since splitting a document at an arbitrary boundary can sever the context a passage needs to be understood, leaving a technically-relevant chunk that retrieves poorly. More retrieved context is not better past a point, because stuffing the window with marginally-relevant passages dilutes the signal and can push the model to attend to the wrong material. The index has to be maintained like any other data store, with a path to add, update, and remove documents, and on a private corpus retrieval has to enforce the asking user's permissions and tenant boundary, or RAG quietly turns into a path that serves stale or unauthorized material with confidence. RAG is the retrieval layer of context engineering and a common source of an agent's longer-term memory, so it rarely stands alone, and its failure modes are easier to catch with evaluation over real queries than by inspecting the pipeline. Retrieved content is also untrusted input, so a passage pulled from an open corpus is a prompt-injection surface and deserves the same caution as any other external text.

Related standards and prior art

Lewis et al.: Retrieval-Augmented Generation (NeurIPS 2020) · 2020-05-22 the paper that named and defined RAG: dense retrieval over an index combined with a generative model that reasons over the retrieved passages
LlamaIndex: introduction to RAG · continuously updated framework documentation describing the RAG pipeline stages (loading, indexing, storing, querying, evaluation)
Towards AI: the complete guide to RAG (2026) · 2026-02-16 a current independent overview of RAG as a grounding pattern for enterprise systems, covering freshness and citability without retraining

Defined by Ready Solutions AI

How it works

Why it matters

In practice

Practical considerations

Related standards and prior art

Related terms