Show HN: Running LLM on smartwatch – found llama.cpp loading model twice in RAM

1 point

3 months ago

Running SmolLM2 360M on a Samsung Galaxy Watch 4 Classic (380MB free RAM). Found that llama.cpp was loading the model twice simultaneously — APK mmap page cache + its own tensor allocations — peaking at 524MB for a 270MB model.

Fix: added host_ptr to llama_model_params. CPU tensors point directly at the mmap region. Only Vulkan tensors get copied.

Result on real hardware: Peak RAM: 524MB → 142MB (74% reduction) First boot: 19s → 11s Second boot: ~2.5s (mmap + KV cache)

Code: https://github.com/Perinban/llama.cpp/tree/axon-dev

Write-up with VmRSS proof: https://www.linkedin.com/posts/perinban-parameshwaran_machin...

No comments

No comments