As part of our work on reducing potential risks from advanced artificial intelligence, we are seeking proposals for projects working with deep learning systems that could help us understand and make progress on AI alignment: the problem of creating AI systems more capable than their designers that robustly try to do what their designers intended. We are interested in proposals that fit within certain research directions, described below, that we think could contribute to reducing the risks we are most concerned about.

Anyone is eligible to apply, including those working in academia, industry, or independently. Applicants are invited to submit proposals for up to 1M in total funding covering up to 2 years. We may invite grantees who do outstanding work to apply for larger and longer grants in the future. Proposals were due January 10, 2022. We are issuing a one-week extension such that proposals will now be due January 17, 2022. Submit A Proposal If you have any questions, please contact us. ## 1 Our view of alignment risks from advanced artificial intelligence This section was written by Nick Beckstead and Asya Bergal, and may not be representative of the views of Open Philanthropy as a whole. We think the research directions below would be pursued more fruitfully by researchers who understand our background views about alignment risks from advanced AI systems, and who understand why we think these research directions could help mitigate these risks. In brief: • We believe it is plausible that later this century, advanced AI systems will do the vast majority of productive labor more cheaply than human workers can.1 • We are worried about scenarios where AI systems more capable than humans acquire undesirable objectives that make them pursue and maintain power in unintended ways, causing humans to lose most or all influence over the future. • We think it may be technically challenging to create powerful systems that we are highly certain have desirable objectives. If it is significantly cheaper, faster, or otherwise easier to create powerful systems that may have undesirable objectives, there may be economic and military incentives to deploy those systems instead.2 • We are interested in research directions that make it easier to create powerful systems that we are highly certain have desirable objectives. In this request for proposals, we are focused on scenarios where advanced AI systems are built out of large neural networks. One approach to ensuring large neural networks have desirable objectives might be to provide them with reward signals generated by human evaluators. However, such a setup could fail in multiple ways: • Inadequate human feedback: It’s possible that in order to train advanced AI systems with desirable objectives, we will need to provide reward signals for highly complex behaviors that have consequences that are too difficult or time-consuming for humans to evaluate.3 • Deceiving human evaluators: It may be particularly difficult to provide good reward signals to an AI system that learns undesirable objectives during training and has a sophisticated model of humans and the training setup. Such a system may “deceive” the humans, i.e. deliberately behave in ways that appear superficially good but have undesirable consequences. • Competent misgeneralization: Even if an AI system has an abundant supply of good reward signals and behaves consistently with desirable objectives on the training distribution, there could be contexts outside of the training distribution where the system retains its capabilities but pursues an undesirable objective. • Deceptive misgeneralization: Rather than subtly misbehaving during training as in “deceiving human evaluators”, a sophisticated AI system that learns undesirable objectives may choose to behave in only desirable ways during training, maximizing its chances of being deployed in the real world, where it can more effectively pursue its true objectives. This case and the analogous one above may pose special challenges because of the adversarial relationship between the system and its designers.4 The research directions described below aim to address these failure modes, or otherwise contribute to helping us understand or make progress on AI alignment. ## 2 Research directions We are soliciting proposals that fit within one of the following research directions. For each research direction, we give a brief description below and link to a document describing the direction in depth. Direction 1: Measuring and forecasting risks Proposals that fit within this direction should aim to measure concrete risks related to the failures we are worried about, such as reward hacking,5 misgeneralized policies, and unexpected emergent capabilities. We are especially interested in understanding the trajectory of risks as systems continue to improve, as well as any risks that might suddenly manifest on a global scale with limited time to react. We think this research direction could allow us to better direct future research, as well as to make stronger arguments for worrying about certain risks. Direction 2: Techniques for enhancing human feedback Proposals that fit within this direction should aim to address the inadequate feedback problem by developing general techniques for generating good reward signals using human feedback that could apply to settings where it would otherwise be prohibitively difficult, expensive, or time-consuming to provide good reward signals. We are especially interested in proposals that use these techniques to train models to complete tasks that would otherwise be difficult to accomplish. Direction 3: Interpretability Proposals that fit within this direction should aim to contribute to the mechanistic understanding of neural networks, which could help us discover unanticipated failure modes and ensure that large models in the future won’t pursue undesirable objectives in contexts not included in the training distribution (cf. “competent misgeneralization” above). Potential projects in this direction could consist of mapping small-scale structures in neural networks to human understandable algorithms, finding large-scale structures that simplify the understanding of neural networks, and learning about neurons that respond to multiple unrelated features, among others. Proposals related to scaling mechanistic interpretability to larger models are of particular interest. Direction 4: Truthful and honest AI Proposals that fit within this direction should aim to contribute to the development of AI systems that have good performance on standard benchmarks while being “truthful”, i.e. avoiding saying things that are false, and “honest”, i.e. accurately reporting what they believe. Advanced AI systems that are truthful and/or honest could help humans provide more adequate training feedback by accurately reporting on the consequences of their actions. Making models truthful and honest while achieving good performance on standard benchmarks could also teach us something about the broader problem of making AI systems that avoid certain kinds of failures while staying competitive and performant. Potential projects in this direction could aim to develop definitions and concepts that are fruitful for relevant ML research, create benchmarks or tasks to measure truthfulness or honesty, or develop techniques for making systems that are more truthful and honest. ## 3 Application process Use this form to submit a project proposal. The form asks for: • An up-to-date CV • A 2 – 5 page project description (2 – 5 pages not including references), which should include: • a) An outline of the proposed steps for your project, to the best of your ability, including any experiments you want to run, though we expect that many details will be uncertain until the project is underway. • b) A description of the outcome you are hoping for: what would we learn or gain from this project if it went well? • c) An explanation of impact: how would the outcome given in b) help us avoid the inadequate feedback or misgeneralization failures described above, or otherwise reduce the chance that power-seeking AI systems cause humanity to lose most or all influence over the future? We think applicants should spend most of the proposal answering (a) and (b); however, it’s important to us that the answers to (c) make sense and we will examine them critically. • An estimated budget • An estimated project duration By default, we expect proposals to request no more than1M total and to cover projects lasting no more than 2 years. (We will consider exceptions in cases where external restrictions require that funding cover more than 2 years.) If you are submitting a larger proposal, please include an explanation in your project description of why your work cannot be scoped into this budget and timeframe. Grants will cover individual projects and will not be renewed, though we may invite grantees who do outstanding work to apply for larger and longer grants in the future.

All grantees are required to submit a 3-page progress report to us every 6 months after their grant is awarded, and a final report to us after the project is finished.

Proposals were due January 10. We are issuing a one-week extension such that proposals will now be due January 17, 2022.

We plan to evaluate proposals in two stages. We will let applicants know if they have passed Stage 1 by late February. If you pass Stage 1, we may contact you with additional follow-up questions or ask you to join us for an interview. We anticipate making final decisions by late March.

#### 3.1 Using large language models

Proposals that fit within the techniques for enhancing human feedback and truthful AI research directions (and potentially others) may want to work with existing large language models. Publicly available language models include GPT-2 and GPT-J-6B. OpenAI has recently also released an API which provides paid fine-tuning access to its larger models here. We are happy to pay for this access as part of our grant.

Expand Footer Collapse Footer

1.See Cotra 2020Davidson 2021aDavidson 2021b, and, more broadly, Holden Karnofsky’s “Most Important Century” series.

2.For this and the above bullet point, see this recent draft report by Joseph Carlsmith, an Open Philanthropy Senior Research Analyst, which looks at these scenarios in more detail. To get additional perspectives on possible scenarios, it may also be useful to read:

3.This request for proposals is focused on approaches that work by providing human-generated feedback all throughout the system’s training. Another potential approach could involve using features of the system’s internals or training setup to argue that its objectives won’t become undesirable as the system becomes more capable, even if humans aren’t able to provide additional feedback. There may also be viable approaches to this problem that don’t rely on human feedback at all.

4.Nick Bostrom describes this failure mode in Superintelligence, p. 117:

…one idea for how to ensure superintelligence safety… is that we validate the safety of a superintelligent AI empirically by observing its behavior while it is in a controlled, limited environment (a “sandbox”) and that we only let the AI out of the box if we see it behaving in a friendly, cooperative, responsible manner. The flaw in this idea is that behaving nicely while in the box is a convergent instrumental goal for friendly and unfriendly AIs alike. An unfriendly AI of sufficient intelligence realizes that its unfriendly final goals will be best realized if it behaves in a friendly manner initially, so that it will be let out of the box. It will only start behaving in a way that reveals its unfriendly nature when it no longer matters whether we find out; that is, when the AI is strong enough that human opposition is ineffectual.

Additional discussions of the possibility of such failure modes can be found in Hubinger et al.’s Risks from Learned Optimization in in Advanced Machine Learning Systems (section 4, “Deceptive Alignment”) and Luke Muelhauser’s post, “Treacherous turns in the wild”.

5.“Reward hacking” refers to AI systems finding decisions that do well according to the explicit reward function, but that were unintended and undesired– an instance of the “inadequate feedback” failure described above.