Created a tool that lets you use LLMs to automate task across mobile (android) and computer. Currently, this uses screenshots and LLMs support for extracting screen UI elements effectively. This is still a work in progress and attempting to make this work with local models via Ollama (the code is in place with some issues). As of now, Gemini and GPT 4o works the best for finding UI elements and planning the task.
Some examples that work as of now:
1. Use gmail and ask <friend>@example.com for lunch next saturday
2. Start a 3+2 chess game on lichess
Working demos: https://github.com/BandarLabs/clickclickclickThis improves the cost of one automation task from approx. $0.6 via Claude to:
$0.06 - OpenAI 4o mini as planner + free Gemini flash 1.5 (15 calls/min)
The Llama vision models will eventually make it 0.