1 article
New research quantifies how language models favor their own responses in evaluation tasks, threatening benchmark validity.