Can interleaved cross-attention learn image-text correlations better than CLIP?github.com/lnairGT2 pointsIllustriousSir3 years ago