The Pile: An 800GB dataset of diverse text for language modeling (2020)arxiv.org184 pointscharlysl3 years ago