MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks

Heykuki News

72 points

3 years ago

33 comments

Threaded

Loading comments...

MaMMUT: A simple vision-encoder text-decoder architecture for multimodal tasks | Heykuki News