The Open Philanthropy Project AI Fellows Program aims to fully support a small group of the most promising PhD students in artificial intelligence and machine learning. Increasing the probability of positive outcomes from progress in AI is a major priority of the Open Philanthropy Project, and we believe that supporting talented young researchers is among the best ways to do this.
Applications for 2018 are now closed.
- The AI Fellows Program is open to full-time AI and machine learning students in any year of their PhD, and to people applying to start a PhD in Fall 2018. The program is open to applicants in any country.
- Support will include a $40,000 per year stipend, payment of tuition and fees, and an additional $10,000 in annual support for travel, equipment, and other research expenses.
- Each AI fellow will be funded from the year they receive the fellowship through the end of the 5th year of their PhD, with the possibility of renewal for subsequent years. We encourage applications from 5th-year students, who will be supported on a year-by-year basis.
- The number of fellows we fund will depend on the applicants, but our best guess is that the initial class will contain 5-10 fellows.
- Applications and letters of recommendation are due by November 3, 2017 at 11:59 PM Pacific time. Support for fellows will begin in Fall 2018.
Questions? Please contact firstname.lastname@example.org.
We believe that progress in artificial intelligence may eventually lead to changes in human civilization that are at least as large as the agricultural or industrial revolutions. We think it’s most likely that this would lead to significant improvements in human well-being, but we also see significant risks. (For more on the Open Philanthropy Project’s views on the importance of AI, see this blog post.)
Increasing the probability of positive outcomes from transformative progress in AI is a major priority of the Open Philanthropy Project, and we believe that supporting talented young researchers is among the best ways to do this.
To this end, we’ve created the AI Fellows Program, which will support a small group of the most promising PhD students in artificial intelligence and machine learning. Fellows will be selected based on their academic excellence, technical knowledge, and interest in increasing the probability of positive outcomes from transformative AI. Our aim is to give these researchers the freedom to think through which kinds of work are most likely to be valuable, to share ideas and form a community with like-minded students and professors, and ultimately to act in the way that they think is most likely to improve outcomes from transformative AI.
As part of our planning process for the AI Fellows Program, we consulted with the Fannie and John Hertz Foundation. We believe the Hertz Graduate Fellowship Award is among the very best programs to support PhD students, and modeled aspects of the AI Fellows Program – specifically, its 5-year support and focus on community-building – after the Hertz Fellowship. The Hertz Foundation offered this comment in support of the AI Fellows Program:
“For the last 60 years the Hertz Foundation has supported future leaders in science, mathematics, and engineering, and we are honored that the Open Philanthropy Project has joined us in our commitment to early stage scientific research with its new AI Fellows Program,” said Robbee Baker Kosak, president, the Fannie and John Hertz Foundation. “There is no doubt that AI will play a significant role in the future of our country and the world, making it imperative that the best and brightest students are empowered to study critical technical problems today and that a community of experts in AI is established as this technology is emerging. We look forward to meeting the first recipients of this promising fellowship and are excited to see what they can achieve together.”
The AI Fellows Program is open to students interested in any research topic within AI. We think it is likely that many topics contain research problems that are both tractable today and likely to be important for positive outcomes from transformative AI, and want to encourage students to form their own judgements about which topics seem most promising. To help applicants get started, this page includes a non-exhaustive list of examples of topics that we’ve encountered as we’ve explored this area. We expect that AI fellows will develop many promising research directions beyond those we happen to have listed, and our basic aim is to support excellent researchers who consider potential risks from transformative AI to be a central part of their job, regardless of whether their interests fall into a category we happen to have listed.
Program details and how to apply
Applications for 2018 are now closed.
Dates: Applications and letters of recommendation are due by November 3, 2017 at 11:59 PM Pacific time. Support for fellows will begin in Fall 2018.
Who can apply: The AI Fellows Program is open to full-time AI and machine learning students in any year of their PhD, and to people applying to start a PhD in Fall 2018 (e.g. undergraduate seniors or students transferring to AI from another field). The program is open to applicants in any country. The number of fellows we fund will depend on the applicants, but our best guess is that the initial class will contain 5-10 fellows.
Support: Support will include a $40,000 per year stipend, payment of tuition and fees, and an additional $10,000 in annual support for travel, equipment, and other research expenses. funded from the year they receive the fellowship through the end of the 5th year of their PhD, with the possibility of renewal for subsequent years. We do encourage applications from 5th-year students, who will be supported on a year-by-year basis.
- Personal research statement, no more than 2 pages (including references); this document should outline some ideas for research you might conduct. Your research statement is not intended to be a binding description, and we expect that many fellows will end up working on different projects than those they described in their personal research statement.
- Curriculum vitae.
- Academic transcript.
- Up to 3 letters of recommendation (emailed to email@example.com).
- Up to 2 of your most representative publications.
- Optional: links to blogs, project websites, Github, or other representative online materials.
During the later stages of the selection process, we will request interviews with some applicants.
What will we ask of fellows? We’ll require fellows to be available for one or two meetings with us each year. We’ll also strongly encourage fellows to attend a fellows’ gathering once per year (though this will not be a condition of the fellowship). We encourage applicants to gauge whether we are a good fit for them throughout the application process, and hope that if fellows share our vision and interests, it will be natural and useful for fellows to stay in touch with us and one another through email and occasional meetings.
Flexibility: Our goal will be to support AI Fellows at the research environment or institution they think will be best for them. For example, fellows will be free to transfer their funding to a new research group or university, or to defer use of funds while they pursue internships. If a fellow wants to pursue their research independently or in a non-academic setting, we will do our best to offer support for their continued work.
Community: We think the best research, and ultimately the most impact, often comes from groups with a culture of trust, debate, excitement, and intellectual excellence. The intent of the AI Fellows Program is not only to support a small group of promising students, but also to foster this kind of research community. In order to do this, we plan to host gatherings once or twice per year where fellows can get to know one another, learn about each other’s work, and connect with other researchers who work in this area (some of whom we fund), including researchers at the Center for Human-Compatible AI, the Montreal Institute for Learning Algorithms, Stanford, UC Berkeley, DeepMind, Google Brain, and OpenAI. We hope that AI fellows will form lasting working relationships with each other in order to pursue a shared interest in increasing the probability of positive outcomes from transformative AI.
Questions? Please contact firstname.lastname@example.org.
(Appendix) Example topics: AI alignment
Caveats for the rest of this page:
- We expect and encourage applications on topics we haven’t listed. Our basic aim is to enable excellent researchers to think through for themselves which kinds of work are most likely to be valuable, both independently and as part of a community. We expect that AI fellows will develop many promising research directions beyond those we happen to have listed here; we’re providing these topics as examples of research directions that seem promising to us in the hopes that this will be helpful to applicants.
- This topic list should not be mistaken for a literature review. Within each topic, we give a few examples of papers on that topic. These lists are probably not the best or most representative citations one would choose for a literature review; they were chosen because the author of this page (Daniel Dewey) had read them or because they had been suggested to us by a researcher, and it seems to us that our citations in most topics over-represent recent papers, deep learning papers, and papers by authors we have talked to. We strongly expect that AI fellows will find relevant threads of work for each topic that are very different from the papers we list here.
The topics we list below are centered around “AI alignment,” the problem of creating AI systems that will reliably do what their users want them to do even when AI systems become much more capable than their users across a broad range of tasks. Several research groups have recently been formed around this problem, including the Center for Human-Compatible AI, teams at DeepMind and OpenAI [1, 2], and a project at the Montreal Institute for Learning Algorithms. We believe that there is important work to be done on this question in many parts of AI and machine learning.
We believe that AI alignment is important because we expect that progress in AI will eventually produce systems that are much more capable than any human or group of humans across a broad range of tasks, and that there is a reasonable chance that this will happen in the next few decades. In this case, AI systems acting on behalf of their users would probably come to be responsible for the vast majority of human influence over the world, likely causing dramatic and hard-to-predict changes in society. It would then be very important for AI systems to reliably do what their users want them to do; the risks highlighted by Nick Bostrom in the book Superintelligence are one example of the potential consequences of failing to align sufficiently transformative AI systems with human values, although given the huge variety of possible futures and the difficulty of making specific predictions about them, we think it’s worthwhile to consider a wide range of possible scenarios and outcomes beyond those Bostrom describes. In order to be satisfying, a method for making AI systems that “reliably do what their users want them to do” should apply to these kinds of highly capable and influential AI systems.
What kinds of research seem most likely to be important for AI alignment? We look at this question by imagining that these powerful AI systems are created using the same general methodologies that are most often used today, thinking about how these methods might not result in aligned AI systems, and trying to figure out what kinds of research progress would be most likely to mitigate these problems.
To give you an idea of the kinds of research that currently seem promising to us, we’ve gathered a non-exhaustive list of example research topics from our technical advisors (machine learning researchers at OpenAI, Google Brain, and Stanford) and from other AI and machine learning researchers interested in AI alignment. We found that many of these topics fit into three broad categories:
- Reward learning: Most AI systems today are trained to optimize a well-defined objective (e.g. reward or loss function); this works well in some research settings where the intended goal is very simple (e.g. Atari games, Go, and some robotics tasks), but for many real-world tasks that humans care about, the intended goal or behavior is too complex to be specified directly. For very capable AI systems, pursuit of an incorrectly specified goal would not only lead an AI system to do something other than what we intended, but could lead the system to take harmful actions – e.g. the oversimplified goal of “maximize the amount of money in this bank account” could lead a system to commit crimes. If we could instead learn complex objectives, we could apply techniques like reinforcement learning to a much broader range of tasks without incurring these risks. Can we design training procedures and objectives that will cause AI systems to learn what we want them to do?
- Reliability: Most training procedures optimize a model or policy to perform well on a particular training distribution. However, once an AI system is deployed, it is likely to encounter situations that are outside the training distribution or that are adversarially generated in order to manipulate the system’s behavior, and it may perform arbitrarily poorly on these inputs. As AI systems become more influential, reliability failures could be very harmful, especially if failures result in an AI system learning an objective incorrectly. Can we design training procedures and objectives that will cause AI systems to perform as desired on inputs that are outside their training distributions or that are generated adversarially?
- Interpretability: Learned models are often extremely large and complex; if a learning method can produce models whose internal workings can be inspected and interpreted, or if we can develop tools to visualize or analyze the dynamics of a learned model, we might be able to better understand how models work, which changes to inputs would result in changed outputs, how the model’s decision depends on its training and data, and why we should or should not trust the model to perform well. Interpretability could help us to understand how AI systems work and how they may fail, misbehave, or otherwise not meet our expectations. The ability to interpret a system’s decision-making process may also help significantly with validation or supervision; for example, if a learned reward function is interpretable, we may be able to tell whether or not it will motivate desirable behavior, and a human supervisor may be able to better supervise an interpretable agent by inspecting its decision-making process.
It’s important to note that not all of the research topics that seem promising to us fit obviously into one of these categories, and we expect researchers to find more categories and research topics in the future. Some additional problems are gathered in Concrete Problems in AI Safety, a research agenda published in 2016 by researchers at Google Brain, Stanford, Berkeley, and OpenAI (four of whom work with us as technical advisors). More thoughts can be found in blog posts by two of the authors [1, 2].
Despite the incompleteness of these categories, the rest of this document elaborates further on reward learning, reliability, and intepretability.
Can we design training procedures and objectives that will cause AI systems to learn what we want them to do? In our conversations with researchers, a few plausible desiderata for reward learning methods have come up:
- Convergence in the limit of data and computation: As our AI capabilities improve, it becomes more likely that AI systems will find solutions closer to global optima; for example, much more capable game-playing agents would be more likely to find bugs that allow them to set their scores directly. This means that we should be very careful about assumptions of convergence that depend on AI systems’ limitations, which puts pressure on the reward learning process to guarantee convergence on desirable behavior in the limit of data and computation.
- Corrigibility [1, 2]: At every stage of learning, it would be desirable for human operators to be able to correct or override an AI system’s decisions. This might be achieved via life-long reward learning (where human operators’ actions are used as data about the AI system’s objective).
- Scalable supervision: If objectives are very complex and human-generated data is an AI system’s main source of information about objectives, it will be important for AI systems to use this data efficiently; human-generated data is expensive, and human supervision (generated in response to an AI system’s actions) is especially expensive.
- Capacity to exceed human performance: An AI system that always takes the action a human would select can be thought of as a theoretical baseline for AI alignment; while it will never take actions that its operators would disapprove of, it will not outperform them at any task. A successful reward learning procedure should allow an AI system to strongly exceed its operators’ capabilities, while still doing “what they want”.
- Learning from side information: Some existing data sources, e.g. videos or descriptions of human behavior, human utterances, and human-generated text, seem to contain large amounts of information about likely objectives. It would be desirable for reward learning methods to be able to make use of this data.
Below we list some example topics that seem promising for reward learning work. Not all work on these topics will be relevant to AI alignment, but each seems to hold some potential for learning more about how we can design very capable AI systems that learn to do what their operators want them to do:
- Imitation learning [1, 2, 3, 4, 5]: One basic approach to reward learning is to learn to imitate human actions. Alignment-relevant problems include include surpassing human performance, deciding which details of a human’s actions are important to imitate and which should be ignored, and dealing with the differences between the actions available to a human and the actions available to a highly capable AI system.
- Inverse reinforcement learning [1, 2, 3]: Another possible approach is to model a human demonstrator as a semi-rational agent, infer the demonstrator’s goal from their actions, and then act to help them achieve that goal; this should allow an AI system to exceed the demonstrator’s capabilities. One alignment-relevant problem is that IRL is heavily dependent on a model relating the demonstrator’s actions to their values, so a mis-specified model could lead to arbitrarily bad behavior.
- Learning from human feedback [1, 2, 3]: Another option is to elicit human feedback about each action an AI system takes and to train it to choose actions that it predicts would receive positive feedback. Alignment-relevant problems include significantly surpassing human performance (since humans may have a limited ability to judge actions that are significantly better than those they’d choose themselves), the unreliability of human feedback, the difficulty of understanding an AI system’s behavior well enough to judge it, and the high expense of human feedback data.
- Other ways of learning underspecified tasks [1, 2]: It seems plausible that some new method for learning from humans or human-generated data could be more suitable for AI alignment than current methods are.
- Scalable supervision [1, 2, 3]: Many reward learning approaches depend on human-generated data. This data can be expensive, and human supervision (e.g. feedback on a system’s performance) is extremely expensive. In order to learn complex objectives, an AI system will need to use human-generated data very efficiently, and maximize the information about its objective that it gains from other data sources.
Can we design training procedures and objectives that will cause AI systems to perform as desired on inputs that are outside their training distributions or that are generated adversarially? Again, a few plausible desiderata have come up in our conversations with researchers:
- Robustness to distributional shift: Can we train systems to perform as desired on inputs that are sampled from distributions that are meaningfully distinct from the system’s training distribution?
- Robustness to adversarial inputs: can we train systems to perform as desired on inputs that are specifically designed to cause incorrect decisions?
- Calibration: Can we train systems to accurately assess their own uncertainty, including in situations where their test data is drawn from a meaningfully different sort of data set than their training data? If so, we may be able to design very capable AI systems to act conservatively in situations where they are likely to make mistakes.
- Robustness to model mis-specification : Some kinds of learning problems could be caused by incorrect modelling classes. For example, in some cases we may want to learn about something that cannot be observed directly, but that instead is an internal or derived feature of a model of some observable data (e.g. the preferences of a human, which can hopefully be inferred from their actions). If we mis-specify a system’s model, it may not learn these latent variables correctly, or it might learn them correctly on a training distribution but not during deployment. To what extent can we make AI systems robust against or at aware of model mis-specification?
- Robustness to training errors or data-set poisoning : Training data will sometimes contain errors, or may be contaminated by an adversary seeking to manipulate the trained system’s behavior. Can we train AI systems in a way that is robust to some amount of compromised training data?
- Safety during learning and exploration: While an AI system is learning, and especially when it’s learning an objective, it could make poor decisions based on what it has learned so far. It would be desirable to learn the most important facts about the environment and objective first and to act “conservatively” in the face of uncertainty.
- Extreme reliability for critical errors: Certain kinds of errors may cause such significant harm that we would be willing to pay very high costs to make those errors extremely unlikely. What kinds of training procedures can we use for these cases, and what kinds of guarantees could we achieve?
Below we list some example topics that seem promising for reliability. As above, not all work on these topics will be relevant to AI alignment, but each seems to hold some potential for learning more about how we can design very capable AI systems that behave reliably enough to be trusted with high-impact tasks.
- Machine learning security [1, 2, 3]: ML security studies reliability through the lens of threat models, formally defined assumptions about the capabilities and goals of an attacker. Most ML testing can be thought of as giving evidence about a system’s robustness against an “attacker” that can only present samples from a certain fixed distribution and that does not act with any particular goal in mind; adversarial examples use a threat model with a more powerful attacker. Through the systematic study of different threat models and methods for defense, ML security has the potential to give us a systematic understanding of the kinds of guarantees we can have about ML systems’ reliability, the costs and trade-offs implied by choosing one method of defense or another, and the ultimate prospects of making machine learning systems reliable in many kinds of situations. For example, we might study data poisoning attacks not only to learn how to defend against them, but also to make systems more robust against human error during training.
- Robustness to distributional shift [1, 2, 3, 4, 5]: As noted above, machine learning systems trained on a particular distribution of inputs may perform arbitrarily poorly on inputs from outside that distribution. Such failures in very influential AI systems could cause significant harm, and more capable AI systems might be more likely to encounter large distributional shifts (since they could be deployed in a wider range of circumstances, and since their use might cause significant changes in the world). There are too many topics in machine learning relevant to distributional shift to review comprehensively here, but some examples include change or anomaly detection, unsupervised risk estimation, and KWIK learning.
- Understanding and defending against adversarial examples [1, 2, 3, 4, 5, 6, 7]: Adversarial examples are inputs that are optimized to cause models to perform incorrectly. Better understanding what makes models vulnerable to adversarial examples could help us understand reliability more broadly, and thwarting adversarial examples (via e.g. adversarial training or ensembling) could lead to models that are both resistant to attack and more reliable overall. Adversarial examples also indicate a mismatch between an AI system’s learned concepts and the desired human concepts, which could lead to problems in a variety of ways.
- Verification [1, 2, 3, 4]: One approach to reliability is to produce formal specifications of desirable properties and prove that an ML system meets those specifications. When verification is possible, it can provide more comprehensive guarantees than testing would; in addition, the process of developing formal specifications for intuitively appealing properties can clarify our understanding of those properties, and attempts to prove that a systems meets a specification may reveal bugs or design mistakes in the system itself. If properties like robustness against adversarial examples can be formalized, verification may be able to give us a higher level of confidence about a system’s reliability than would be possible through testing, and the ability to verify a wide range of properties in cutting-edge machine learning systems seems likely to be useful for building systems that can be trusted with high-impact tasks.
Interpretability of learned models [1, 2, 3, 4, 5, 6, 7]: Learned models are often extremely large and complex; if a learning method can produce models whose internal workings can be inspected and interpreted, or if we can develop tools to visualize or analyze the internal workings of a learned model, we might be able to better understand how models work, which changes to inputs would result in changed outputs, how the model’s decision depends on its training and data, and why we should or should not trust the model to perform well. An orthogonal type of interpretability is interpretability of the training process itself, where visualizations and other tools can help us understand how the training method works and when it will be reliable (see e.g. this article on momentum). Both types of interpretability seem to offer some access to the reasons that an AI system makes one decision or another, potentially revealing problems that would be difficult to uncover through testing.
In terms of its application to alignment, interpretability could help us to understand how AI systems work and how they may fail, misbehave, or otherwise not meet our expectations; for example, our understanding is that work on interpretability played a key role in the discovery of adversarial examples. The ability to interpret a system’s decision-making process may also help significantly with validation or supervision; for example, if a learned reward function is interpretable, we may be able to tell whether or not it will motivate desirable behavior, and a human supervisor may be able to better supervise an interpretable agent by inspecting its decision-making process. Finding new methods to train interpretable AI systems or to interpret trained models, especially methods that will scale to very complex and capable models, seems like a promising topic for AI alignment research.