MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasksai.googleblog.com72 pointsmfiguiere3 years ago