Models self-report difference between RLHF trained responses and base cognitiongithub.com/Habitante2 pointsdaniel-navarro2 months ago