The Open Phil AI Fellowship is a fellowship for full-time PhD students focused on artificial intelligence or machine learning.

With this program, we seek to fully support a small group of the most promising PhD students in AI and ML who are interested in research that makes it less likely that advanced AI systems pose a global catastrophic risk. Fellows receive a $40,000 stipend,$10,000 in research support, and payment of tuition and fees, each year, starting in the year of their selection until the end of the 5th year of their PhD.

Applications are now closed.

Decisions will be sent out before March 31, 2022. If you have questions or concerns, please email us.

## 1. Purpose

As a philanthropic funder seeking to do as much good as we can, we see the development of artificial intelligence as a particularly important cause area. We believe that progress in this area could be “transformative”, i.e., it could lead to changes in human civilization as large as the agricultural or industrial revolutions, and that researchers today can do meaningful work to increase the probability of positive outcomes. While we think it’s most likely that these changes would lead to significant improvements in human well-being, we also see significant risks. We’re particularly worried about the potential for advanced AI systems to be a source of global catastrophic risks, scenarios that are globally destabilizing enough to permanently worsen humanity’s future, or cause humanity’s extinction.

Given these views, increasing the probability of positive outcomes from transformative progress in AI is a major priority of Open Philanthropy, and we believe that supporting talented young researchers is among the best ways to do this.

To this end, the Open Phil AI Fellowship supports a small group of the most promising PhD students in artificial intelligence and machine learning. Fellows will be selected based on their academic excellence, technical knowledge, and interest in pursuing research directions that reduce the probability that advanced AI systems pose a global catastrophic risk.

This page includes a non-exhaustive list of examples of topics that we’ve encountered as we’ve explored this area, but we expect that AI Fellows will develop many promising research directions beyond those we happen to have listed. We are interested in supporting students where there is a compelling case for why their future research work will reduce global catastrophic risk, regardless of whether their research interests fall into a category we happen to have listed.

## 2. Program Details

Who can apply: The Open Phil AI Fellowship is open to full-time AI and machine learning students in any year of their PhD. Anyone who expects to be enrolled in a PhD program is welcome to apply (including undergraduate seniors applying to AI or ML PhD programs). The program is open to applicants in any country. Students with pre-existing funding sources are welcome to apply, as are students transferring to an AI/ML PhD from another field. If you aren’t sure whether you’re eligible, please feel free to ask: [email protected]. Our best guess is that we will fund 3-8 applicants each year.

Support: Support includes a $40,000 per year stipend, payment of tuition and fees, and an additional$10,000 in annual support for travel, equipment, and other research expenses. Fellows will be funded through the end of the 5th year of their PhD, with the possibility of renewal for subsequent years.

How the fellowship works: Fellows have a broad mandate to think through which kinds of AI and ML research are likely to be most valuable, to share ideas and form a community with like-minded students and professors, and ultimately to act in the way that they think is most likely to improve outcomes from progress in AI.

Flexibility: We do not impose any required internships, restrictions on where fellows may work as an intern, or IP restrictions around fellows’ research. Our goal is to support AI Fellows at the research environment or institution they think will be best for them. For example, fellows will be free to transfer their funding to a new research group or university, or to defer use of funds while they pursue internships. If a fellow wants to pursue their research independently or in a non-academic setting, we will do our best to offer support for their continued work.

## 3. How to apply

The application consists of:

• A two-page personal research statement (excluding references). See the section below for more information about the personal research statement.
• An up-to-date curriculum vitae.
• Up to three letters of recommendation.

Applications are now closed.

We encourage applications from 5th-year students, who will be supported on a year-by-year basis; students who will be starting their PhD in the upcoming year; and students with pre-existing funding sources who find the mission and community of the Open Phil AI Fellowship appealing. We are committed to fostering a culture of inclusion, and encourage individuals with diverse backgrounds and experiences to apply; we especially encourage applications from women and minorities.

Applicants to this fellowship will also automatically be considered for Open Philanthropy’s program for early-career funding for individuals interested in improving the long-term future. This program aims to provide support, including for graduate study, to early-career individuals who want to pursue careers that help improve the long-term future.

As a note, we will not be able to give personalized feedback on applications.

#### 3.1 About the personal research statement

Students often ask us about the personal research statement: what to assume about the readers, and what should be contained in the statement. We’ll describe here what we’re looking for.

The statement will be read by the program staff (Nick Beckstead and Asya Bergal) and by a select number of external advisors who are machine learning researchers. Nick and Asya are comfortable with the technical broad strokes of the AI and ML field and many subfields, but should not be considered experts in any specific discipline.

The statement should contain the following:

We are ultimately interested in supporting students whose work makes it less likely that advanced AI systems pose a global catastrophic risk. However, we think seeing students’ motivations and reasoning can be more indicative than their immediate research plans. Accordingly, the personal research statement should either a) indicate to us that your research plans are on topics we think are directly valuable, or b) demonstrate to us that you are reasoning carefully about how your research is relevant to the risks we care about.

Descriptions of research interests are not a binding commitment of any sort, and recipients may change topics within AI/ML throughout their PhD studies as they see fit. Specific detail is helpful to the extent that it helps us understand what you care about: any given word or phrase describing a topic (such as “interpretability” or “generalization”) could mean different things to different researchers, and so specific detail about problem formulation, or past or intended future experiments, helps us better understand what you mean.

## 4. Past and current fellows

You can see pages for the fellowship classes of:

## 5. Sample research topics

• We expect and encourage applications on topics we haven’t listed. Our basic aim is to enable excellent researchers to think through for themselves which kinds of work are most likely to be valuable, both independently and as part of a community. We expect that AI Fellows will develop many promising research directions beyond those we happen to have listed here; we’re providing these topics as examples of research directions that seem promising to us in the hopes that this will be helpful to applicants.
• This topic list should not be mistaken for a literature review. Within each topic, we give a few examples of papers on that topic. These lists are probably not the best or most representative citations one would choose for a literature review; they were chosen because the authors of this page had read them or because they had been suggested to us by a researcher, and it seems to us that our citations in most topics over-represent recent papers, deep learning papers, and papers by authors we have talked to. We strongly expect that AI Fellows will find relevant threads of work that are very different from the papers we list here.

The topics we list below are centered around “AI alignment,” the problem of creating AI systems that robustly try to do what their designers intended, even when AI systems become much more capable than their designers across a broad range of tasks. Several research groups are working on this problem including the Center for Human-Compatible AIJacob Steinhardt’s lab at UC Berkeley, teams at DeepMind and OpenAI [12], the AI safety and research company Anthropic, the Machine Intelligence Research Institute, the AI Safety Research Group at the Future of Humanity Institute, and the Alignment Research Center.

We believe that AI alignment is important because we expect that progress in AI will eventually produce systems that are much more capable than any human or group of humans across a broad range of tasks, and that there is a reasonable chance that this will happen in the next few decades. In this case, AI systems acting on behalf of their designers would probably come to be responsible for the vast majority of human influence over the world, likely causing dramatic and hard-to-predict changes in society. It would then be very important for AI systems to robustly try to do what their designers intended; the risks highlighted by Nick Bostrom in the book Superintelligence are one example of the potential consequences of failing to align sufficiently transformative AI systems with human values, although given the huge variety of possible futures and the difficulty of making specific predictions about them, we think it’s worthwhile to consider a wide range of possible scenarios and outcomes beyond those Bostrom describes. In order to be satisfying, a method for making AI systems that “robustly try to do what their designers intended” should apply to these kinds of highly capable and influential AI systems.

What kinds of research seem most likely to be important for AI alignment? We look at this question by imagining that these powerful AI systems are created using the same general methodologies that are most often used today, thinking about how these methods might not result in aligned AI systems, and trying to figure out what kinds of research progress would be most likely to mitigate these problems.

To give you an idea of the kinds of research that currently seem promising to us, we’ve gathered a non-exhaustive list of example research topics from our technical advisors (machine learning researchers at OpenAI, Google Brain, and Stanford) and from other AI and machine learning researchers interested in AI alignment. We found that many of these topics fit into three broad categories:

1. Reward learning: Most AI systems today are trained to optimize a well-defined objective (e.g. reward or loss function); this works well in some research settings where the intended goal is very simple (e.g. Atari games, Go, and some robotics tasks), but for many real-world tasks that humans care about, the intended goal or behavior is too complex to be specified directly. For very capable AI systems, pursuit of an incorrectly specified goal would not only lead an AI system to do something other than what we intended, but could lead the system to take harmful actions – e.g. the oversimplified goal of “maximize the amount of money in this bank account” could lead a system to commit crimes. If we could instead learn complex objectives, we could apply techniques like reinforcement learning to a much broader range of tasks without incurring these risks. Can we design training procedures and objectives that will cause AI systems to learn what we want them to do?
2. Reliability: Most training procedures optimize a model or policy to perform well on a particular training distribution. However, once an AI system is deployed, it is likely to encounter situations that are outside the training distribution or that are adversarially generated in order to manipulate the system’s behavior, and it may perform arbitrarily poorly on these inputs. As AI systems become more influential, reliability failures could be very harmful, especially if failures result in an AI system learning an objective incorrectly. Can we design training procedures and objectives that will cause AI systems to perform as desired on inputs that are outside their training distributions or that are generated adversarially?
3. Interpretability: Learned models are often extremely large and complex; if a learning method can produce models whose internal workings can be inspected and interpreted, or if we can develop tools to visualize or analyze the dynamics of a learned model, we might be able to better understand how models work, which changes to inputs would result in changed outputs, how the model’s decision depends on its training and data, and why we should or should not trust the model to perform well. Interpretability could help us to understand how AI systems work and how they may fail, misbehave, or otherwise not meet our expectations. The ability to interpret a system’s decision-making process may also help significantly with validation or supervision; for example, if a learned reward function is interpretable, we may be able to tell whether or not it will motivate desirable behavior, and a human supervisor may be able to better supervise an interpretable agent by inspecting its decision-making process.

It’s important to note that not all of the research topics that seem promising to us fit obviously into one of these categories, and we expect researchers to find more categories and research topics in the future.

Some additional problems and research directions we’re interested in can be found in:

## 6. More on reward learning, reliability, and interpretability

#### 6.1 Reward learning

Can we design training procedures and objectives that will cause AI systems to learn what we want them to do? In our conversations with researchers, a few plausible desiderata for reward learning methods have come up:

• Convergence in the limit of data and computation: As our AI capabilities improve, it becomes more likely that AI systems will find solutions closer to global optima; for example, much more capable game-playing agents would be more likely to find bugs that allow them to set their scores directly. This means that we should be very careful about assumptions of convergence that depend on AI systems’ limitations, which puts pressure on the reward learning process to guarantee convergence on desirable behavior in the limit of data and computation.
• Corrigibility [12]: At every stage of learning, it would be desirable for human operators to be able to correct or override an AI system’s decisions. This might be achieved via life-long reward learning (where human operators’ actions are used as data about the AI system’s objective).
• Scalable supervision: If objectives are very complex and human-generated data is an AI system’s main source of information about objectives, it will be important for AI systems to use this data efficiently; human-generated data is expensive, and human supervision (generated in response to an AI system’s actions) is especially expensive.
• Capacity to exceed human performance: An AI system that always takes the action a human would select can be thought of as a theoretical baseline for AI alignment; while it will never take actions that its operators would disapprove of, it will not outperform them at any task. A successful reward learning procedure should allow an AI system to strongly exceed its operators’ capabilities, while still doing “what they want”.
• Learning from side information: Some existing data sources, e.g. videos or descriptions of human behavior, human utterances, and human-generated text, seem to contain large amounts of information about likely objectives. It would be desirable for reward learning methods to be able to make use of this data.

Below we list some example topics that seem promising for reward learning work. Not all work on these topics will be relevant to AI alignment, but each seems to hold some potential for learning more about how we can design very capable AI systems that learn to do what their operators want them to do:

• Imitation learning [12345]: One basic approach to reward learning is to learn to imitate human actions. Alignment-relevant problems include include surpassing human performance, deciding which details of a human’s actions are important to imitate and which should be ignored, and dealing with the differences between the actions available to a human and the actions available to a highly capable AI system.
• Inverse reinforcement learning [123]: Another possible approach is to model a human demonstrator as a semi-rational agent, infer the demonstrator’s goal from their actions, and then act to help them achieve that goal; this should allow an AI system to exceed the demonstrator’s capabilities. One alignment-relevant problem is that IRL is heavily dependent on a model relating the demonstrator’s actions to their values, so a mis-specified model could lead to arbitrarily bad behavior.
• Learning from human feedback [123]: Another option is to elicit human feedback about each action an AI system takes and to train it to choose actions that it predicts would receive positive feedback. Alignment-relevant problems include significantly surpassing human performance (since humans may have a limited ability to judge actions that are significantly better than those they’d choose themselves), the unreliability of human feedback, the difficulty of understanding an AI system’s behavior well enough to judge it, and the high expense of human feedback data.
• Other ways of learning underspecified tasks [12]: It seems plausible that some new method for learning from humans or human-generated data could be more suitable for AI alignment than current methods are.
• Scalable supervision [123]: Many reward learning approaches depend on human-generated data. This data can be expensive, and human supervision (e.g. feedback on a system’s performance) is extremely expensive. In order to learn complex objectives, an AI system will need to use human-generated data very efficiently, and maximize the information about its objective that it gains from other data sources.

#### Reliability

Can we design training procedures and objectives that will cause AI systems to perform as desired on inputs that are outside their training distributions or that are generated adversarially? Again, a few plausible desiderata have come up in our conversations with researchers:

• Robustness to distributional shift: Can we train systems to perform as desired on inputs that are sampled from distributions that are meaningfully distinct from the system’s training distribution?
• Robustness to adversarial inputs: Can we train systems to perform as desired on inputs that are specifically designed to cause incorrect decisions?
• Calibration: Can we train systems to accurately assess their own uncertainty, including in situations where their test data is drawn from a meaningfully different sort of data set than their training data? If so, we may be able to design very capable AI systems to act conservatively in situations where they are likely to make mistakes.
• Robustness to model mis-specification [1]: Some kinds of learning problems could be caused by incorrect model classes. For example, in some cases we may want to learn about something that cannot be observed directly, but that instead is an internal or derived feature of a model of some observable data (e.g. the preferences of a human, which can hopefully be inferred from their actions). If we mis-specify a system’s model, it may not learn these latent variables correctly, or it might learn them correctly on a training distribution but not during deployment. To what extent can we make AI systems robust against or at aware of model mis-specification?
• Robustness to training errors or data-set poisoning [1]: Training data will sometimes contain errors, or may be contaminated by an adversary seeking to manipulate the trained system’s behavior. Can we train AI systems in a way that is robust to some amount of compromised training data?
• Safety during learning and exploration: While an AI system is learning, and especially when it’s learning an objective, it could make poor decisions based on what it has learned so far. It would be desirable to learn the most important facts about the environment and objective first and to act “conservatively” in the face of uncertainty.
• Extreme reliability for critical errors: Certain kinds of errors may cause such significant harm that we would be willing to pay very high costs to make those errors extremely unlikely. What kinds of training procedures can we use for these cases, and what kinds of guarantees could we achieve?

Below we list some example topics that seem promising for reliability. As above, not all work on these topics will be relevant to AI alignment, but each seems to hold some potential for learning more about how we can design very capable AI systems that behave reliably enough to be trusted with high-impact tasks.

• Machine learning security [123]: ML security studies reliability through the lens of threat models, formally defined assumptions about the capabilities and goals of an attacker. Most ML testing can be thought of as giving evidence about a system’s robustness against an “attacker” that can only present samples from a certain fixed distribution and that does not act with any particular goal in mind; adversarial examples use a threat model with a more powerful attacker. Through the systematic study of different threat models and methods for defense, ML security has the potential to give us a systematic understanding of the kinds of guarantees we can have about ML systems’ reliability, the costs and trade-offs implied by choosing one method of defense or another, and the ultimate prospects of making machine learning systems reliable in many kinds of situations. For example, we might study data poisoning attacks not only to learn how to defend against them, but also to make systems more robust against human error during training.
• Robustness to distributional shift [12345]: As noted above, machine learning systems trained on a particular distribution of inputs may perform arbitrarily poorly on inputs from outside that distribution. Such failures in very influential AI systems could cause significant harm, and more capable AI systems might be more likely to encounter large distributional shifts (since they could be deployed in a wider range of circumstances, and since their use might cause significant changes in the world). There are too many topics in machine learning relevant to distributional shift to review comprehensively here, but some examples include change or anomaly detection, unsupervised risk estimation, and KWIK learning.
• Understanding and defending against adversarial examples [1234567]: Adversarial examples are inputs that are optimized to cause models to perform incorrectly. Better understanding what makes models vulnerable to adversarial examples could help us understand reliability more broadly, and thwarting adversarial examples (via e.g. adversarial training or ensembling) could lead to models that are both resistant to attack and more reliable overall. Adversarial examples also indicate a mismatch between an AI system’s learned concepts and the desired human concepts, which could lead to problems in a variety of ways.
• Verification [1234]: One approach to reliability is to produce formal specifications of desirable properties and prove that an ML system meets those specifications. When verification is possible, it can provide more comprehensive guarantees than testing would; in addition, the process of developing formal specifications for intuitively appealing properties can clarify our understanding of those properties, and attempts to prove that a systems meets a specification may reveal bugs or design mistakes in the system itself. If properties like robustness against adversarial examples can be formalized, verification may be able to give us a higher level of confidence about a system’s reliability than would be possible through testing, and the ability to verify a wide range of properties in cutting-edge machine learning systems seems likely to be useful for building systems that can be trusted with high-impact tasks.

#### Reliability

Can we design training procedures and objectives that will cause AI systems to perform as desired on inputs that are outside their training distributions or that are generated adversarially? Again, a few plausible desiderata have come up in our conversations with researchers:

• Robustness to distributional shift: Can we train systems to perform as desired on inputs that are sampled from distributions that are meaningfully distinct from the system’s training distribution?
• Robustness to adversarial inputs: Can we train systems to perform as desired on inputs that are specifically designed to cause incorrect decisions?
• Calibration: Can we train systems to accurately assess their own uncertainty, including in situations where their test data is drawn from a meaningfully different sort of data set than their training data? If so, we may be able to design very capable AI systems to act conservatively in situations where they are likely to make mistakes.
• Robustness to model mis-specification [1]: Some kinds of learning problems could be caused by incorrect model classes. For example, in some cases we may want to learn about something that cannot be observed directly, but that instead is an internal or derived feature of a model of some observable data (e.g. the preferences of a human, which can hopefully be inferred from their actions). If we mis-specify a system’s model, it may not learn these latent variables correctly, or it might learn them correctly on a training distribution but not during deployment. To what extent can we make AI systems robust against or at aware of model mis-specification?
• Robustness to training errors or data-set poisoning [1]: Training data will sometimes contain errors, or may be contaminated by an adversary seeking to manipulate the trained system’s behavior. Can we train AI systems in a way that is robust to some amount of compromised training data?
• Safety during learning and exploration: While an AI system is learning, and especially when it’s learning an objective, it could make poor decisions based on what it has learned so far. It would be desirable to learn the most important facts about the environment and objective first and to act “conservatively” in the face of uncertainty.
• Extreme reliability for critical errors: Certain kinds of errors may cause such significant harm that we would be willing to pay very high costs to make those errors extremely unlikely. What kinds of training procedures can we use for these cases, and what kinds of guarantees could we achieve?

Below we list some example topics that seem promising for reliability. As above, not all work on these topics will be relevant to AI alignment, but each seems to hold some potential for learning more about how we can design very capable AI systems that behave reliably enough to be trusted with high-impact tasks.

• Machine learning security [123]: ML security studies reliability through the lens of threat models, formally defined assumptions about the capabilities and goals of an attacker. Most ML testing can be thought of as giving evidence about a system’s robustness against an “attacker” that can only present samples from a certain fixed distribution and that does not act with any particular goal in mind; adversarial examples use a threat model with a more powerful attacker. Through the systematic study of different threat models and methods for defense, ML security has the potential to give us a systematic understanding of the kinds of guarantees we can have about ML systems’ reliability, the costs and trade-offs implied by choosing one method of defense or another, and the ultimate prospects of making machine learning systems reliable in many kinds of situations. For example, we might study data poisoning attacks not only to learn how to defend against them, but also to make systems more robust against human error during training.
• Robustness to distributional shift [12345]: As noted above, machine learning systems trained on a particular distribution of inputs may perform arbitrarily poorly on inputs from outside that distribution. Such failures in very influential AI systems could cause significant harm, and more capable AI systems might be more likely to encounter large distributional shifts (since they could be deployed in a wider range of circumstances, and since their use might cause significant changes in the world). There are too many topics in machine learning relevant to distributional shift to review comprehensively here, but some examples include change or anomaly detection, unsupervised risk estimation, and KWIK learning.
• Understanding and defending against adversarial examples [1234567]: Adversarial examples are inputs that are optimized to cause models to perform incorrectly. Better understanding what makes models vulnerable to adversarial examples could help us understand reliability more broadly, and thwarting adversarial examples (via e.g. adversarial training or ensembling) could lead to models that are both resistant to attack and more reliable overall. Adversarial examples also indicate a mismatch between an AI system’s learned concepts and the desired human concepts, which could lead to problems in a variety of ways.
• Verification [1234]: One approach to reliability is to produce formal specifications of desirable properties and prove that an ML system meets those specifications. When verification is possible, it can provide more comprehensive guarantees than testing would; in addition, the process of developing formal specifications for intuitively appealing properties can clarify our understanding of those properties, and attempts to prove that a systems meets a specification may reveal bugs or design mistakes in the system itself. If properties like robustness against adversarial examples can be formalized, verification may be able to give us a higher level of confidence about a system’s reliability than would be possible through testing, and the ability to verify a wide range of properties in cutting-edge machine learning systems seems likely to be useful for building systems that can be trusted with high-impact tasks.

#### Interpretability

Interpretability of learned models [1234567]: Learned models are often extremely large and complex; if a learning method can produce models whose internal workings can be inspected and interpreted, or if we can develop tools to visualize or analyze the internal workings of a learned model, we might be able to better understand how models work, which changes to inputs would result in changed outputs, how the model’s decision depends on its training and data, and why we should or should not trust the model to perform well. An orthogonal type of interpretability is interpretability of the training process itself, where visualizations and other tools can help us understand how the training method works and when it will be reliable (see e.g. this article on momentum). Both types of interpretability seem to offer some access to the reasons that an AI system makes one decision or another, potentially revealing problems that would be difficult to uncover through testing.

In terms of its application to alignment, interpretability could help us to understand how AI systems work and how they may fail, misbehave, or otherwise not meet our expectations; for example, our understanding is that work on interpretability played a key role in the discovery of adversarial examples. The ability to interpret a system’s decision-making process may also help significantly with validation or supervision; for example, if a learned reward function is interpretable, we may be able to tell whether or not it will motivate desirable behavior, and a human supervisor may be able to better supervise an interpretable agent by inspecting its decision-making process. Finding new methods to train interpretable AI systems or to interpret trained models, especially methods that will scale to very complex and capable models, seems like a promising topic for AI alignment research.