How retrieval-augmented generation decides what sources to use.
⬡ What you'll build
RAG (retrieval-augmented generation) is the reason AI search behaves differently from keyword search — and understanding its pipeline tells you exactly why specific content decisions increase citation probability. This is not a conceptual model; it is the operational mechanism that Perplexity, ChatGPT with browsing, and Gemini with Search are running every time they answer a query.
RAG combines a retrieval system with a generation model. The retrieval system finds relevant source content; the generation model writes the answer using that content as context. The retrieval stage does not use keyword matching. It uses vector embeddings — chunks of text that have been converted into numerical representations capturing their semantic meaning. When a query arrives, it is embedded as a vector, and the system finds chunks whose vectors are mathematically close to the query vector. This is why exact keyword density is largely irrelevant for AI citation. A page that never uses the phrase "machine learning inference cost" but clearly explains the concept of paying per compute cycle will retrieve for that query. A page stuffed with the phrase but without coherent explanation may not.
| Stage | What Happens | What You Can Affect |
|---|---|---|
| Crawl + Index | Crawler fetches content, extracts text, splits into ~512-token chunks, embeds each chunk as a vector stored in a vector database | Page crawlability, clean HTML extraction, chunk-level content quality |
| Query Processing | User query is embedded as a vector; similarity search returns the most semantically close chunks from the index | How semantically specific your content is to real queries people ask |
| Reranking | A second model scores retrieved chunks for relevance, quality, and authority before passing them to the LLM | Domain authority, entity signals, structural quality |
| Generation | The LLM uses the top-ranked chunks as context to write its answer; it cites sources where chunks directly contributed | Whether individual sentences are specific and quotable enough to be pulled directly |
Web crawlers fetch your page and pass it through a text extraction pipeline. The extracted text is split into chunks — typically around 512 tokens, roughly 350-400 words — and each chunk is converted into a vector by an embedding model (such as OpenAI's text-embedding-ada-002 or a comparable model). Those vectors are stored in a vector database alongside metadata about the source URL and chunk position.
Two failure points exist here. First, JavaScript-rendered content may not be extracted if the crawler does not execute JS — meaning entire sections of pages built with client-side rendering may never make it into the index. Second, dense technical jargon or malformed HTML can reduce embedding quality because the embedding model struggles to assign clear semantic meaning to incoherent text.
When a user submits a query, the RAG system embeds that query and runs a similarity search against the vector database. The result is a ranked list of chunks — not pages — whose embeddings are closest to the query embedding. A single page may contribute zero, one, or many chunks to the retrieval results for a given query.
This is the stage most content operators misunderstand. The unit of retrieval is the chunk, not the article. An article structured as one long argument that requires full reading to make sense will have chunks that are individually incoherent. An article structured so that each section is self-contained will have chunks that independently answer specific questions — and each of those chunks competes independently in similarity search.
Retrieved chunks are passed to a reranker — typically a cross-encoder model that scores each chunk specifically against the query. Reranking is where authority signals enter the picture. A chunk from a domain with strong inbound links, established crawl history, and recognized entity mentions will score higher than an equally specific chunk from an unknown domain. This is the mechanism by which traditional SEO signals still matter for AI search: not in retrieval, but in reranking.
Rerankers also score for quality signals — coherence, specificity, and whether the chunk contains a direct answer rather than a reference to an answer elsewhere. "See our full guide for details" is a low-quality chunk. "The default timeout is 30 seconds, configurable via the request_timeout parameter" is a high-quality chunk.
The LLM receives the top-ranked chunks as context and generates an answer. It does not summarize your entire article — it uses the specific chunks it received. When it cites a source, it is citing the source of a chunk it directly used. The implication: a specific, quotable sentence in a chunk increases citation probability. A vague paragraph that paraphrases an idea without a concrete claim does not.
Content must be crawlable, extractable, and semantically coherent at the chunk level. Practical requirements:
Each ~512-token chunk should be independently valuable. The test: take any 400-word section of your article and ask whether it answers a specific question without requiring the reader to have read anything else. If it requires context from earlier sections, it will underperform in retrieval.
The Answer-First pattern directly addresses this: every H2 section answers its own implied question in the first 2-3 sentences, then elaborates. The first sentences of each section form the highest-value part of that chunk because they establish the semantic anchor the embedding model uses to represent the entire chunk.
Authority signals affect reranking scores. The signals with the most observable effect:
The LLM cites chunks it directly used. Specific, assertive sentences get pulled into answers; hedged or vague sentences do not. Compare:
auth() helper from next-auth/server for server-side session validation in Route Handlers."The second sentence contains a version number, a named framework, a specific function name, and an exact use case. It is semantically dense, verifiable, and directly answerable to a query.
| System | Real-Time Retrieval | Explicit Citations | Observable Retrieval Behavior | Notes |
|---|---|---|---|---|
| Perplexity | Yes — retrieves at query time | Yes — numbered citations with source URLs | Retrieves 5-10 chunks per query; tends to prefer recency for news queries, authority for technical queries | Most transparent RAG implementation for testing purposes |
| ChatGPT (web browsing) | Yes — Bing-powered web retrieval | Yes — inline citations | Appears to retrieve fewer sources per query than Perplexity; strong preference for established domains | Browsing behavior varies by query type; not always triggered |
| Gemini (with Search) | Yes — Google Search integration | Yes — cited sources panel | Leverages Google's existing index; structured data and established rankings have more influence | Benefits from Google's full link graph in reranking |
| Claude (with search) | Yes — when search tool is enabled | Yes — cites retrieved sources | Conservative retrieval; tends to cite fewer sources but with higher specificity match | Tool-use architecture differs from native RAG |
Use this exact prompt structure in Perplexity (set to "All" mode, not focused):
What does [your domain name] say about [specific topic you have published content on]?
Example: What does failarchive.dev say about Vercel deployment cold start causes?
Interpret results as follows:
Run this test for 3-5 topics you have published on. The pattern across tests tells you whether you have a crawlability problem (nothing gets retrieved), an authority problem (you're mentioned but not cited), or a specificity problem (you're not mentioned at all on topics you should own).
The specific writing pattern that maximizes chunk retrieval across all four pipeline stages is Answer-First paragraphs:
An article with 8 H2 sections written this way has 8 independently retrievable chunks, each of which can place in similarity search for a different query. An article with 8 sections written as a flowing narrative has 8 chunks that are contextually dependent on each other and retrieve poorly in isolation.
This is not a stylistic preference. It is a structural adaptation to how the retrieval unit works.
These are common assumptions that do not hold for AI search citation:
ℹChunk size varies by system
Different RAG implementations use different chunk sizes and embedding models. The 512-token figure is representative, not universal — some systems use 256-token chunks, others use larger windows, and some use overlapping chunks to preserve context at boundaries. The underlying pattern holds regardless: self-contained, specific, Answer-First paragraphs retrieve better than narrative prose that requires context from surrounding sections. Design for chunk independence, not for a specific token count.
To confirm your understanding of the RAG pipeline before moving to the citation signals lesson, verify you can answer the following without referring back:
Implementation Checkpoint