Usual implementation of attention transformers (SDPA) is kind of bad, actuallygist.github.com1 pointteleforcea month ago