The Pile is a 825 GiB diverse, open-source language modelling data set (2020)pile.eleuther.ai332 pointsbilsbie2 years ago