Running SmolLM2 360M on a Samsung Galaxy Watch 4 Classic (380MB free RAM). Found that llama.cpp was loading the model twice simultaneously — APK mmap page cache + its own tensor allocations — peaking at 524MB for a 270MB model.
Fix: added host_ptr to llama_model_params. CPU tensors point directly at the mmap region. Only Vulkan tensors get copied.
Result on real hardware: Peak RAM: 524MB → 142MB (74% reduction) First boot: 19s → 11s Second boot: ~2.5s (mmap + KV cache)
Code: https://github.com/Perinban/llama.cpp/tree/axon-dev
Write-up with VmRSS proof: https://www.linkedin.com/posts/perinban-parameshwaran_machin...