The Open Phil AI Fellowship is a fellowship for full-time PhD students focused on artificial intelligence or machine learning.
With this program, we seek to fully support a small group of the most promising PhD students in AI and ML who are interested in research that makes it less likely that advanced AI systems pose a global catastrophic risk. Fellows receive a $40,000 stipend, $10,000 in research support, and payment of tuition and fees, each year, starting in the year of their selection until the end of the 5th year of their PhD.
Applications are now closed.
Decisions will be sent out before March 31, 2022. If you have questions or concerns, please email [email protected].
Read on for more information about the Open Phil AI Fellowship.
As a philanthropic funder seeking to do as much good as we can, we see the development of artificial intelligence as a particularly important cause area. We believe that progress in this area could be “transformative”, i.e., it could lead to changes in human civilization as large as the agricultural or industrial revolutions, and that researchers today can do meaningful work to increase the probability of positive outcomes. While we think it’s most likely that these changes would lead to significant improvements in human well-being, we also see significant risks. We’re particularly worried about the potential for advanced AI systems to be a source of global catastrophic risks, scenarios that are globally destabilizing enough to permanently worsen humanity’s future, or cause humanity’s extinction.
Given these views, increasing the probability of positive outcomes from transformative progress in AI is a major priority of Open Philanthropy, and we believe that supporting talented young researchers is among the best ways to do this.
To this end, the Open Phil AI Fellowship supports a small group of the most promising PhD students in artificial intelligence and machine learning. Fellows will be selected based on their academic excellence, technical knowledge, and interest in pursuing research directions that reduce the probability that advanced AI systems pose a global catastrophic risk.
This page includes a non-exhaustive list of examples of topics that we’ve encountered as we’ve explored this area, but we expect that AI Fellows will develop many promising research directions beyond those we happen to have listed. We are interested in supporting students where there is a compelling case for why their future research work will reduce global catastrophic risk, regardless of whether their research interests fall into a category we happen to have listed.
Who can apply: The Open Phil AI Fellowship is open to full-time AI and machine learning students in any year of their PhD. Anyone who expects to be enrolled in a PhD program is welcome to apply (including undergraduate seniors applying to AI or ML PhD programs). The program is open to applicants in any country. Students with pre-existing funding sources are welcome to apply, as are students transferring to an AI/ML PhD from another field. If you aren’t sure whether you’re eligible, please feel free to ask: [email protected]. Our best guess is that we will fund 3-8 applicants each year.
Support: Support includes a $40,000 per year stipend, payment of tuition and fees, and an additional $10,000 in annual support for travel, equipment, and other research expenses. Fellows will be funded through the end of the 5th year of their PhD, with the possibility of renewal for subsequent years.
How the fellowship works: Fellows have a broad mandate to think through which kinds of AI and ML research are likely to be most valuable, to share ideas and form a community with like-minded students and professors, and ultimately to act in the way that they think is most likely to improve outcomes from progress in AI.
Flexibility: We do not impose any required internships, restrictions on where fellows may work as an intern, or IP restrictions around fellows’ research. Our goal is to support AI Fellows at the research environment or institution they think will be best for them. For example, fellows will be free to transfer their funding to a new research group or university, or to defer use of funds while they pursue internships. If a fellow wants to pursue their research independently or in a non-academic setting, we will do our best to offer support for their continued work.
How to apply
The application consists of:
- A two-page personal research statement (excluding references). See the section below for more information about the personal research statement.
- An up-to-date curriculum vitae.
- Up to three letters of recommendation.
Applications are now closed.
We encourage applications from 5th-year students, who will be supported on a year-by-year basis; students who will be starting their PhD in the upcoming year; and students with pre-existing funding sources who find the mission and community of the Open Phil AI Fellowship appealing. We are committed to fostering a culture of inclusion, and encourage individuals with diverse backgrounds and experiences to apply; we especially encourage applications from women and minorities.
Applicants to this fellowship will also automatically be considered for Open Philanthropy’s program for early-career funding for individuals interested in improving the long-term future. This program aims to provide support, including for graduate study, to early-career individuals who want to pursue careers that help improve the long-term future.
As a note, we will not be able to give personalized feedback on applications.
Questions? Please contact [email protected].
About the personal research statement
Students often ask us about the personal research statement: what to assume about the readers, and what should be contained in the statement. We’ll describe here what we’re looking for.
The statement will be read by the program staff (Nick Beckstead and Asya Bergal) and by a select number of external advisors who are machine learning researchers. Nick and Asya are comfortable with the technical broad strokes of the AI and ML field and many subfields, but should not be considered experts in any specific discipline.
The statement should contain the following:
- A description of the research topics that you are interested in or care about
- Why you are interested in or care about this work
- How this work relates to Open Philanthropy’s interest in potential global catastrophic risks from advanced AI systems
We are ultimately interested in supporting students whose work makes it less likely that advanced AI systems pose a global catastrophic risk. However, we think seeing students’ motivations and reasoning can be more indicative than their immediate research plans. Accordingly, the personal research statement should either a) indicate to us that your research plans are on topics we think are directly valuable, or b) demonstrate to us that you are reasoning carefully about how your research is relevant to the risks we care about.
Descriptions of research interests are not a binding commitment of any sort, and recipients may change topics within AI/ML throughout their PhD studies as they see fit. Specific detail is helpful to the extent that it helps us understand what you care about: any given word or phrase describing a topic (such as “interpretability” or “generalization”) could mean different things to different researchers, and so specific detail about problem formulation, or past or intended future experiments, helps us better understand what you mean.
Current Open Phil AI Fellows
Collin is an incoming PhD student in Computer Science at UC Berkeley. He is broadly interested in doing foundational work to make AI systems more trustworthy, aligned with human values, and helpful for human decision making. He is especially excited about using language to control and interpret machine learning models more effectively. Collin received his B.A. in Computer Science from Columbia University. For more information, see his website.
Jared Quincy Davis
Jared Quincy Davis is a PhD Student in Computer Science at Stanford University. His research asks what progress is necessary for the most compelling advances of machine learning (e.g. those that powered AlphaGo) to be applied more broadly and extensively in real-world, non-stationary, multi-particle domains with nth order dynamics. Jared is motivated by the potential of such AI advances to accelerate the rate of progress in, and adoption of, technology. Thus far, his work has focused on addressing the specific computational complexity, memory, and optimization challenges that arise when learning high resolution representations of the dynamics within complex systems. Jared’s fundamental research has thus far been applied to great effect in problems spanning structural biology, industrial systems control, robotic planning and navigation, natural language processing, and beyond. To learn more about his research, visit his scholar page.
Meena Jagadeesan is a first-year PhD student at UC Berkeley, advised by Moritz Hardt, Michael I. Jordan, and Jacob Steinhardt. She aims to develop theoretical foundations for machine learning systems that account for economic and societal effects, especially in strategic or dynamic environments. Her work currently focuses on reasoning about the incentives created by decision-making systems and on ensuring fairness in multi-stage systems. Meena completed her Bachelor’s and Master’s degrees at Harvard University in 2020, where she studied computer science, mathematics, and statistics. For more information, visit her website.
Jesse is a PhD student in Computer Science at Stanford University, advised by Noah Goodman and affiliated with the Stanford NLP Group and Stanford AI Lab. He is interested in using language and communication to improve the interpretability and generalization of machine learning models, especially in multimodal or embodied settings. Previously, Jesse received an MPhil in Advanced Computer Science from the University of Cambridge as a Churchill scholar, and a BS in Computer Science from Boston College. For more information, see his website.
Xuan (Sh-YEN) is a PhD student at MIT co-advised by Vikash Mansinghka and Joshua Tenenbaum. Their research sits at the intersection of AI, philosophy, and cognitive science, asking questions like: How can we specify and perform inference over rich yet structured generative models of human motivation and bounded reasoning, in order to accurately infer human goals and values? To answer these questions, Xuan’s work includes the development of probabilistic programming infrastructure, so as to enable fast and flexible Bayesian inference over complex models of agents and their environments. Prior to MIT, Xuan worked with Desmond Ong at the National University of Singapore on deep generative models, and Brian Scassellati at Yale on human-robot interaction. They graduated from Yale with a B.S. in Electrical Engineering & Computer Science.
Alex is a PhD student in Computer Science at Stanford University, where he is advised by Noah Goodman and a member of the Stanford NLP Group. Alex’s research focuses on unsupervised learning: how can we understand and guide systems that learn general-purpose representations of the world? Alex received his B.S. in Computer Science from Stanford, and has spent time at Google Research on the Brain and Language teams. For more information, visit his website.
Clare is pursuing a PhD in Computer Science at the University of Oxford as a Rhodes Scholar, advised by Yarin Gal and Marta Kwiatkowska. She is interested in developing theoretical tools to better understand the generalization properties of modern ML methods, and in creating principled algorithms based on these insights. She received a BSc in mathematics and computer science from McGill University in 2018. For more information, see her website.
Cody is a computer science Ph.D. student at Stanford University, advised by Professors Matei Zaharia and Peter Bailis. His research focuses on democratizing machine learning by reducing the cost of producing state-of-the-art models and creating novel abstractions that simplify machine learning development and deployment. His work spans from performance benchmarking of hardware and software systems (i.e., DAWNBench and MLPerf) to computationally efficient methods for active learning and core-set selection. He completed his B.S. and M.Eng. in electrical engineering and computer science at MIT. For more information, visit his website.
Dami is a PhD student in computer science at the University of Toronto, supervised by David Duvenaud and Chris Maddison. Dami is interested in ways to make neural network training faster, more reliable, and more interpretable via inductive bias. Previously, she spent a year at Google as an AI Resident, working with George Dahl on studying optimizers, and speeding up neural network training. She obtained her Bachelor’s degree in Engineering Science from the University of Toronto. You can find more about Dami’s research in her scholar page.
Dan Hendrycks is a second-year PhD student at UC Berkeley, advised by Jacob Steinhardt and Dawn Song. His research aims to disentangle and concretize the components necessary for safe AI. This leads him to work on quantifying and improving the performance of models in unforeseen out-of-distribution scenarios, and he works towards constructing tasks to measure a model’s alignment with human values. Dan received his BS from the University of Chicago. You can find out more about his research at his website.
Ethan is a PhD student in Computer Science at New York University working with Kyunghyun Cho and Douwe Kiela of Facebook AI Research. His research focuses on developing learning algorithms that have the long-term potential to answer questions that people cannot. Supervised learning cannot answer such questions, even in principle, so he is investigating other learning paradigms for generalizing beyond the available supervision. Previously, Ethan worked with Aaron Courville and Hugo Larochelle at the Montreal Institute for Learning Algorithms, and he has also spent time at Facebook AI Research and Google. Ethan earned a Bachelor’s from Rice University as the Engineering department’s Outstanding Senior. For more information, visit his website.
Frances is a PhD student in Computer Science at UC Berkeley advised by Jacob Steinhardt and Moritz Hardt. Her research aims to improve the reliability of machine learning systems and ensure that they have positive, equitable social impacts. She is interested in developing algorithms that can handle dynamic environments and adaptive behavior in the real world, and in building empirical and theoretical understanding of modern ML methods. Frances received her B.A. in Biology from Harvard University and her M.Phil. in Machine Learning from the University of Cambridge. For more information, visit her website.
Leqi is a PhD student in machine learning at Carnegie Mellon University, where she is advised by Zachary Lipton. Her research aims to develop learning systems that can infer human preferences from their behaviors, and better facilitate humans to achieve their goals and well-being. In particular, she is interested in bringing theory from social sciences into algorithmic design. You can learn more about her research on her website.
Peter is a PhD student at Stanford University advised by Dan Jurafsky. His research focuses on creating robust decision-making systems grounded in causal inference mechanisms -- particularly in natural language domains. He also spends time investigating reproducible and thorough evaluation methodologies to ensure that such systems perform as expected when deployed. Peter’s other work reaches into policy and technical issues related to the use of machine learning in governance and law, as well as applications of machine learning for positive social impact. Previously he earned his B.Eng. and M.Sc. from McGill University with a thesis on reproducibility and reusability in deep reinforcement learning advised by Joelle Pineau and David Meger. For more information, see his website.
Stanislav is a PhD student at Stanford University, advised by Surya Ganguli. His research focuses on developing a scientific understanding of deep learning and on applications of machine learning and artificial intelligence in the physical sciences, in domains spanning from X-ray astrophysics to quantum computing. Stanislav spent a year as a Google AI Resident, where he worked on deep learning theories and their applications in collaboration with colleagues from Google Brain and DeepMind. He received his Bachelor’s and Master’s degrees in Physics at Trinity College, University of Cambridge, and a Master’s degree at Stanford University. For more information, visit his website.
Aidan is a doctoral student of Yarin Gal and Yee Whye Teh at the University of Oxford. He leads the research group FOR.ai, focusing on providing resources, mentorship, and facilitating collaboration between academia and industry. On a technical front, Aidan’s research pursues new methods of scaling individual neural networks towards trillions of parameters, and hundreds of tasks. On an ethical front, his work takes a humanist stance on machine learning applications and their risks. Aidan is a Student Researcher at Google Brain, working with Jakob Uszkoreit; Previously at Brain, he worked with Geoffrey Hinton and Łukasz Kaiser. He obtained his B.Sc from The University of Toronto with supervision from Roger Grosse.
Andrew Ilyas is a first-year PhD student at MIT working on machine learning. His interests are in building robust and reliable learning systems, and in understanding the underlying principles of modern ML methods. Andrew completed his B.Sc and MEng. in Computer Science as well as B.Sc. in Mathematics at MIT in 2018. For more information, see his website.
Julius is a PhD student in Computer Science at MIT. He is interested in provable methods to enable algorithms and machine learning systems exhibit robust and reliable behavior. Specifically, he is interested in constraints relating to privacy/security, bias/fairness, and robustness to distribution shift for agents and systems deployed in the real world. Julius received masters degrees in computer science and technology policy from MIT, where he looked at bias and interpretability of machine learning models. For more information, visit his website.
Lydia T. Liu
Lydia T. Liu is a PhD student in Computer Science at the University of California, Berkeley, advised by Moritz Hardt and Michael I. Jordan. Her research aims to establish the theoretical foundations for machine learning algorithms to have reliable and robust performance, as well as positive long-term societal impact. She is interested in developing learning algorithms with multifaceted guarantees and understanding their distributional effects in dynamic or interactive settings. Lydia graduated with a Bachelor of Science in Engineering degree from Princeton University. She is the recipient of an ICML Best Paper Award (2018) and a Microsoft Ada Lovelace Fellowship. For more information, visit her website.
Max Simchowitz is a PhD student in Electrical Engineering and Computer Science at UC Berkeley, co-advised by Benjamin Recht and Michael Jordan. He works on machine learning problems with temporal structure: either because the learning agent is allowed to make adaptive decisions about how to collect data, or because the agent’s the environment dynamically reacts to measurements taken. He received his A.B. in mathematics from Princeton University in 2015, and is a co-recipient of the ICML 2018 best paper award. You can find out more about his research on his website.
Pratyusha “Ria” Kalluri is a second year PhD student in Computer Science at Stanford, advised by Stefano Ermon and Dan Jurafsky. She is working towards discovering and inducing conceptual reasoning inside machine learning models. This leads her to work on interpretability, novel learning objectives, and learning disentangled representations. She believes this work can help shape a more radical and equitable AI future. Ria received her Bachelors degree in Computer Science at MIT in 2016 and was a Visiting Researcher at Complutense University of Madrid before beginning her PhD. For more information, visit her website.
Sidd is an incoming PhD student in Computer Science at Stanford University. He is interested in grounded language understanding, with a goal of building agents that can collaborate with humans and act safely in different environments. He is finishing up a one-year residency at Facebook AI Research in New York. He received his Sc.B. from Brown University, where he did research in human-robot interaction and natural language processing advised by Professors Stefanie Tellex and Eugene Charniak. You can find more information on his website.
Smitha is a 2nd year PhD student in computer science at UC Berkeley, where she is advised by Moritz Hardt and Anca Dragan. Her research aims to create machine learning systems that are more value-aligned. She focuses, in particular, on difficulties that arise from complexities of human behavior. For example, learning what a user prefers the system to do, despite “irrationalities” in the user’s behavior, or learning the right decisions to make, despite strategic adaptation from humans. For links to publications and other information, you can visit her website.
Aditi is a second year PhD student in Computer Science at Stanford University, advised by Percy Liang. She is interested in making Machine Learning systems provably reliable and fair, especially in the presence of adversaries. Aditi received her Bachelors degree in Computer Science and Engineering from IIT Madras in 2016. For links to publications and more information, please visit her website.
Chris is a DPhil student at the University of Oxford, supervised by Yee Whye Teh and Arnaud Doucet, and a Research Scientist at DeepMind. Chris is interested in the tools used for inference and optimization in scalable and expressive models. He aims to expand the range of such models by expanding the toolbox needed to work with them. Chris received his MSc. from the University of Toronto, working with Geoffrey Hinton. He received a NIPS Best Paper Award in 2014 and was one of the founding members of the AlphaGo project. For more information, visit his website.
Felix is a PhD student in Computer Science at ETH Zurich, working with Andreas Krause and Angela Schoellig (University of Toronto). He is interested in enabling robots to safely and autonomously learn in uncertain real-world environments, which requires new reinforcement learning algorithms that respect the physical limitations and constraints of dynamic systems and provide theoretical safety guarantees. He received his Masters degree in Mechanical Engineering from ETH Zurich in 2015. You can find out more about his research on his website.
Jon is a PhD student at the Massachusetts Institute of Technology in the Department of Brain and Cognitive Sciences, where he works with Roger Levy and Joshua Tenenbaum to build computational models of how people acquire and understand language. His research bridges between artificial intelligence, cognitive science, and linguistics in order to specify better concrete objectives for building language understanding systems. Before joining MIT, Jon studied at Stanford University and worked with Christopher Manning in the Stanford Natural Language Processing Group. He also spent time at Google Brain and OpenAI, where his advisors included Ilya Sutskever and Oriol Vinyals. You can find out more about Jon, including his blog and research articles, at his website.
Michael is an incoming PhD student at UC Berkeley. He is interested in reproducing humans’ flexible problem-solving abilities in machines, in particular through compositional representations. In June 2018, he will receive his Bachelors degree in computer science from MIT, where he worked with Professors Joshua Tenenbaum and Regina Barzilay. More information can be found on his website.
Noam is a PhD student in computer science at Carnegie Mellon University advised by Tuomas Sandholm. His research applies computational game theory to produce AI systems capable of strategic reasoning in imperfect-information multi-agent interactions. He has applied this research to creating Libratus, the first AI to defeat top humans in no-limit poker. Noam received a NIPS Best Paper award in 2017 and an Allen Newell Award for Research Excellence. Prior to starting a PhD, Noam worked at the Federal Reserve researching the effects of algorithm trading on financial markets. Before that, he developed algorithmic trading strategies. His papers and videos are available on his website.
Ruth is a PhD student in Engineering Science at the University of Oxford, where she is advised by Andrea Vedaldi. She is interested in understanding, explaining, and improving the internal representations of deep neural networks. Ruth received her Bachelors degree in Computer Science from Harvard University in 2015; she also earned a Masters degree in Neuroscience from Oxford in 2016 as a Rhodes Scholar. For more information about her research, visit her website.
Example research topics
Caveats for the rest of this page:
- We expect and encourage applications on topics we haven’t listed. Our basic aim is to enable excellent researchers to think through for themselves which kinds of work are most likely to be valuable, both independently and as part of a community. We expect that AI Fellows will develop many promising research directions beyond those we happen to have listed here; we’re providing these topics as examples of research directions that seem promising to us in the hopes that this will be helpful to applicants.
- This topic list should not be mistaken for a literature review. Within each topic, we give a few examples of papers on that topic. These lists are probably not the best or most representative citations one would choose for a literature review; they were chosen because the authors of this page had read them or because they had been suggested to us by a researcher, and it seems to us that our citations in most topics over-represent recent papers, deep learning papers, and papers by authors we have talked to. We strongly expect that AI Fellows will find relevant threads of work that are very different from the papers we list here.
The topics we list below are centered around “AI alignment,” the problem of creating AI systems that robustly try to do what their designers intended, even when AI systems become much more capable than their designers across a broad range of tasks. Several research groups are working on this problem including the Center for Human-Compatible AI, Jacob Steinhardt’s lab at UC Berkeley, teams at DeepMind and OpenAI [1, 2], the AI safety and research company Anthropic, the Machine Intelligence Research Institute, the AI Safety Research Group at the Future of Humanity Institute, and the Alignment Research Center.
We believe that AI alignment is important because we expect that progress in AI will eventually produce systems that are much more capable than any human or group of humans across a broad range of tasks, and that there is a reasonable chance that this will happen in the next few decades. In this case, AI systems acting on behalf of their designers would probably come to be responsible for the vast majority of human influence over the world, likely causing dramatic and hard-to-predict changes in society. It would then be very important for AI systems to robustly try to do what their designers intended; the risks highlighted by Nick Bostrom in the book Superintelligence are one example of the potential consequences of failing to align sufficiently transformative AI systems with human values, although given the huge variety of possible futures and the difficulty of making specific predictions about them, we think it’s worthwhile to consider a wide range of possible scenarios and outcomes beyond those Bostrom describes. In order to be satisfying, a method for making AI systems that “robustly try to do what their designers intended” should apply to these kinds of highly capable and influential AI systems.
What kinds of research seem most likely to be important for AI alignment? We look at this question by imagining that these powerful AI systems are created using the same general methodologies that are most often used today, thinking about how these methods might not result in aligned AI systems, and trying to figure out what kinds of research progress would be most likely to mitigate these problems.
To give you an idea of the kinds of research that currently seem promising to us, we’ve gathered a non-exhaustive list of example research topics from our technical advisors (machine learning researchers at OpenAI, Google Brain, and Stanford) and from other AI and machine learning researchers interested in AI alignment. We found that many of these topics fit into three broad categories:
- Reward learning: Most AI systems today are trained to optimize a well-defined objective (e.g. reward or loss function); this works well in some research settings where the intended goal is very simple (e.g. Atari games, Go, and some robotics tasks), but for many real-world tasks that humans care about, the intended goal or behavior is too complex to be specified directly. For very capable AI systems, pursuit of an incorrectly specified goal would not only lead an AI system to do something other than what we intended, but could lead the system to take harmful actions – e.g. the oversimplified goal of “maximize the amount of money in this bank account” could lead a system to commit crimes. If we could instead learn complex objectives, we could apply techniques like reinforcement learning to a much broader range of tasks without incurring these risks. Can we design training procedures and objectives that will cause AI systems to learn what we want them to do?
- Reliability: Most training procedures optimize a model or policy to perform well on a particular training distribution. However, once an AI system is deployed, it is likely to encounter situations that are outside the training distribution or that are adversarially generated in order to manipulate the system’s behavior, and it may perform arbitrarily poorly on these inputs. As AI systems become more influential, reliability failures could be very harmful, especially if failures result in an AI system learning an objective incorrectly. Can we design training procedures and objectives that will cause AI systems to perform as desired on inputs that are outside their training distributions or that are generated adversarially?
- Interpretability: Learned models are often extremely large and complex; if a learning method can produce models whose internal workings can be inspected and interpreted, or if we can develop tools to visualize or analyze the dynamics of a learned model, we might be able to better understand how models work, which changes to inputs would result in changed outputs, how the model’s decision depends on its training and data, and why we should or should not trust the model to perform well. Interpretability could help us to understand how AI systems work and how they may fail, misbehave, or otherwise not meet our expectations. The ability to interpret a system’s decision-making process may also help significantly with validation or supervision; for example, if a learned reward function is interpretable, we may be able to tell whether or not it will motivate desirable behavior, and a human supervisor may be able to better supervise an interpretable agent by inspecting its decision-making process.
It’s important to note that not all of the research topics that seem promising to us fit obviously into one of these categories, and we expect researchers to find more categories and research topics in the future.
Some additional problems and research directions we’re interested in can be found in:
- Concrete Problems in AI Safety, a research agenda published in 2016 by researchers at Google Brain, Stanford, Berkeley, and OpenAI (four of whom work with us as technical advisors). More thoughts can be found in blog posts by two of the authors [1, 2].
- This 2019 AI Alignment Research Overview written by one of the authors of the agenda above.
- Risks from Learned Optimization in Advanced Machine Learning Systems, a 2019 paper describing a novel kind of risk that may emerge in future AI systems.
- Certain posts on the AI Alignment Forum, including An overview of 11 proposals for building safe advanced AI, Imitative Generalisation, and Experimentally evaluating whether honesty generalizes.
Can we design training procedures and objectives that will cause AI systems to learn what we want them to do? In our conversations with researchers, a few plausible desiderata for reward learning methods have come up:
- Convergence in the limit of data and computation: As our AI capabilities improve, it becomes more likely that AI systems will find solutions closer to global optima; for example, much more capable game-playing agents would be more likely to find bugs that allow them to set their scores directly. This means that we should be very careful about assumptions of convergence that depend on AI systems’ limitations, which puts pressure on the reward learning process to guarantee convergence on desirable behavior in the limit of data and computation.
- Corrigibility [1, 2]: At every stage of learning, it would be desirable for human operators to be able to correct or override an AI system’s decisions. This might be achieved via life-long reward learning (where human operators’ actions are used as data about the AI system’s objective).
- Scalable supervision: If objectives are very complex and human-generated data is an AI system’s main source of information about objectives, it will be important for AI systems to use this data efficiently; human-generated data is expensive, and human supervision (generated in response to an AI system’s actions) is especially expensive.
- Capacity to exceed human performance: An AI system that always takes the action a human would select can be thought of as a theoretical baseline for AI alignment; while it will never take actions that its operators would disapprove of, it will not outperform them at any task. A successful reward learning procedure should allow an AI system to strongly exceed its operators’ capabilities, while still doing “what they want”.
- Learning from side information: Some existing data sources, e.g. videos or descriptions of human behavior, human utterances, and human-generated text, seem to contain large amounts of information about likely objectives. It would be desirable for reward learning methods to be able to make use of this data.
Below we list some example topics that seem promising for reward learning work. Not all work on these topics will be relevant to AI alignment, but each seems to hold some potential for learning more about how we can design very capable AI systems that learn to do what their operators want them to do:
- Imitation learning [1, 2, 3, 4, 5]: One basic approach to reward learning is to learn to imitate human actions. Alignment-relevant problems include include surpassing human performance, deciding which details of a human’s actions are important to imitate and which should be ignored, and dealing with the differences between the actions available to a human and the actions available to a highly capable AI system.
- Inverse reinforcement learning [1, 2, 3]: Another possible approach is to model a human demonstrator as a semi-rational agent, infer the demonstrator’s goal from their actions, and then act to help them achieve that goal; this should allow an AI system to exceed the demonstrator’s capabilities. One alignment-relevant problem is that IRL is heavily dependent on a model relating the demonstrator’s actions to their values, so a mis-specified model could lead to arbitrarily bad behavior.
- Learning from human feedback [1, 2, 3]: Another option is to elicit human feedback about each action an AI system takes and to train it to choose actions that it predicts would receive positive feedback. Alignment-relevant problems include significantly surpassing human performance (since humans may have a limited ability to judge actions that are significantly better than those they’d choose themselves), the unreliability of human feedback, the difficulty of understanding an AI system’s behavior well enough to judge it, and the high expense of human feedback data.
- Other ways of learning underspecified tasks [1, 2]: It seems plausible that some new method for learning from humans or human-generated data could be more suitable for AI alignment than current methods are.
- Scalable supervision [1, 2, 3]: Many reward learning approaches depend on human-generated data. This data can be expensive, and human supervision (e.g. feedback on a system’s performance) is extremely expensive. In order to learn complex objectives, an AI system will need to use human-generated data very efficiently, and maximize the information about its objective that it gains from other data sources.
Can we design training procedures and objectives that will cause AI systems to perform as desired on inputs that are outside their training distributions or that are generated adversarially? Again, a few plausible desiderata have come up in our conversations with researchers:
- Robustness to distributional shift: Can we train systems to perform as desired on inputs that are sampled from distributions that are meaningfully distinct from the system’s training distribution?
- Robustness to adversarial inputs: Can we train systems to perform as desired on inputs that are specifically designed to cause incorrect decisions?
- Calibration: Can we train systems to accurately assess their own uncertainty, including in situations where their test data is drawn from a meaningfully different sort of data set than their training data? If so, we may be able to design very capable AI systems to act conservatively in situations where they are likely to make mistakes.
- Robustness to model mis-specification : Some kinds of learning problems could be caused by incorrect model classes. For example, in some cases we may want to learn about something that cannot be observed directly, but that instead is an internal or derived feature of a model of some observable data (e.g. the preferences of a human, which can hopefully be inferred from their actions). If we mis-specify a system’s model, it may not learn these latent variables correctly, or it might learn them correctly on a training distribution but not during deployment. To what extent can we make AI systems robust against or at aware of model mis-specification?
- Robustness to training errors or data-set poisoning : Training data will sometimes contain errors, or may be contaminated by an adversary seeking to manipulate the trained system’s behavior. Can we train AI systems in a way that is robust to some amount of compromised training data?
- Safety during learning and exploration: While an AI system is learning, and especially when it’s learning an objective, it could make poor decisions based on what it has learned so far. It would be desirable to learn the most important facts about the environment and objective first and to act “conservatively” in the face of uncertainty.
- Extreme reliability for critical errors: Certain kinds of errors may cause such significant harm that we would be willing to pay very high costs to make those errors extremely unlikely. What kinds of training procedures can we use for these cases, and what kinds of guarantees could we achieve?
Below we list some example topics that seem promising for reliability. As above, not all work on these topics will be relevant to AI alignment, but each seems to hold some potential for learning more about how we can design very capable AI systems that behave reliably enough to be trusted with high-impact tasks.
- Machine learning security [1, 2, 3]: ML security studies reliability through the lens of threat models, formally defined assumptions about the capabilities and goals of an attacker. Most ML testing can be thought of as giving evidence about a system’s robustness against an “attacker” that can only present samples from a certain fixed distribution and that does not act with any particular goal in mind; adversarial examples use a threat model with a more powerful attacker. Through the systematic study of different threat models and methods for defense, ML security has the potential to give us a systematic understanding of the kinds of guarantees we can have about ML systems’ reliability, the costs and trade-offs implied by choosing one method of defense or another, and the ultimate prospects of making machine learning systems reliable in many kinds of situations. For example, we might study data poisoning attacks not only to learn how to defend against them, but also to make systems more robust against human error during training.
- Robustness to distributional shift [1, 2, 3, 4, 5]: As noted above, machine learning systems trained on a particular distribution of inputs may perform arbitrarily poorly on inputs from outside that distribution. Such failures in very influential AI systems could cause significant harm, and more capable AI systems might be more likely to encounter large distributional shifts (since they could be deployed in a wider range of circumstances, and since their use might cause significant changes in the world). There are too many topics in machine learning relevant to distributional shift to review comprehensively here, but some examples include change or anomaly detection, unsupervised risk estimation, and KWIK learning.
- Understanding and defending against adversarial examples [1, 2, 3, 4, 5, 6, 7]: Adversarial examples are inputs that are optimized to cause models to perform incorrectly. Better understanding what makes models vulnerable to adversarial examples could help us understand reliability more broadly, and thwarting adversarial examples (via e.g. adversarial training or ensembling) could lead to models that are both resistant to attack and more reliable overall. Adversarial examples also indicate a mismatch between an AI system’s learned concepts and the desired human concepts, which could lead to problems in a variety of ways.
- Verification [1, 2, 3, 4]: One approach to reliability is to produce formal specifications of desirable properties and prove that an ML system meets those specifications. When verification is possible, it can provide more comprehensive guarantees than testing would; in addition, the process of developing formal specifications for intuitively appealing properties can clarify our understanding of those properties, and attempts to prove that a systems meets a specification may reveal bugs or design mistakes in the system itself. If properties like robustness against adversarial examples can be formalized, verification may be able to give us a higher level of confidence about a system’s reliability than would be possible through testing, and the ability to verify a wide range of properties in cutting-edge machine learning systems seems likely to be useful for building systems that can be trusted with high-impact tasks.
Interpretability of learned models [1, 2, 3, 4, 5, 6, 7]: Learned models are often extremely large and complex; if a learning method can produce models whose internal workings can be inspected and interpreted, or if we can develop tools to visualize or analyze the internal workings of a learned model, we might be able to better understand how models work, which changes to inputs would result in changed outputs, how the model’s decision depends on its training and data, and why we should or should not trust the model to perform well. An orthogonal type of interpretability is interpretability of the training process itself, where visualizations and other tools can help us understand how the training method works and when it will be reliable (see e.g. this article on momentum). Both types of interpretability seem to offer some access to the reasons that an AI system makes one decision or another, potentially revealing problems that would be difficult to uncover through testing.
In terms of its application to alignment, interpretability could help us to understand how AI systems work and how they may fail, misbehave, or otherwise not meet our expectations; for example, our understanding is that work on interpretability played a key role in the discovery of adversarial examples. The ability to interpret a system’s decision-making process may also help significantly with validation or supervision; for example, if a learned reward function is interpretable, we may be able to tell whether or not it will motivate desirable behavior, and a human supervisor may be able to better supervise an interpretable agent by inspecting its decision-making process. Finding new methods to train interpretable AI systems or to interpret trained models, especially methods that will scale to very complex and capable models, seems like a promising topic for AI alignment research.