Open Philanthropy recommended a grant of $20,000 to Francis Rhys Ward to support a study on whether AI models can deceptively “sandbag” (deliberately underperform) on the MLE-bench evaluation, or otherwise sabotage evaluation results. Understanding AI models’ ability to deceive or mislead under these conditions could help researchers properly interpret the results of future AI benchmarks and experiments.
This falls within our focus area of potential risks from advanced artificial intelligence.