Found caveman a while back (https://github.com/JuliusBrussee/caveman) and got kind of obsessed with it. Dense outputs, but you can't show them to a real user. So I've been trying to hide that in the middle -- local SLM compresses the input, cloud model reasons in caveman-style, local SLM expands it back. User never sees the compressed parts.
Running Phi-3 via Candle for compression. Fast enough. Cloud calls are shorter. Haven't done real token counting yet.
The expansion step is the problem. Re-hydrating caveman output into readable text is harder than compressing the input and the local model makes more mistakes there. Not sure if that's a prompting issue or just a ceiling for a model this size.
Also not sure this makes sense at low API volumes. The added complexity might not be worth it.