Made sonnet-3.5 write a simple text-to-image program. Trained it on mnist dataset with 50 epochs. Training took like 20 minutes only on my M1 mac with 8GB RAM only.
It was able to produce very good images based on training data. And is such a simple network.
My question is: why is all that extra complexity needed in today's text-to-image models based on transformers? Wouldn't scaling this out work equally well?
Code: https://gist.github.com/freakynit/1118403ad80448ee0313ba6c879f8688
Generated image: https://imgur.com/LCHDBhI