The Open Phil AI Fellowship is a fellowship for full-time PhD students focused on artificial intelligence or machine learning.
With this program, we seek to fully support a small group of the most promising PhD students in AI and ML who are interested in making the long-term, large-scale impacts of AI a central focus of their research. Fellows receive a $40,000 stipend, $10,000 in research support, and payment of tuition and fees, each year, starting in the year of their selection until the end of the 5th year of their PhD.
Applications for 2019 are currently closed. We expect applications for 2020 to reopen in the fall of 2020. To submit a letter of recommendation that was delayed in submission, email email@example.com.
Decision timeline: Decisions will be sent out in the spring. Last year, we reviewed applications in November and December, and interviewed candidates in early 2019 (Jan-Mar). Some candidates were invited to a second interview if we felt we needed more information to decide. We informed candidates in April or so, and announced publicly in May. Some candidates needed a decision earlier; in a few cases we were able to accommodate this. Once we have a more specific timeline this year we’ll keep applicants updated.
Read on for more information about the Open Phil AI Fellowship.
As a philanthropic funder seeking to do as much good as we can, we see the development of artificial intelligence as a particularly important cause area. We believe that progress in this area could lead to changes in human civilization as large as the agricultural or industrial revolutions, and that researchers today can do meaningful work to increase the probability of positive outcomes. While we think it’s most likely that these changes would lead to significant improvements in human well-being, we also see significant risks.
Given these views, increasing the probability of positive outcomes from transformative progress in AI is a major priority of the Open Philanthropy Project, and we believe that supporting talented young researchers is among the best ways to do this.
To this end, we’ve created the Open Phil AI Fellowship, which will support a small group of the most promising PhD students in artificial intelligence and machine learning. Fellows will be selected based on their academic excellence, technical knowledge, and interest in increasing the probability of positive outcomes from transformative AI. The AI Fellows have a broad mandate to think through which kinds of AI and ML research are likely to be most valuable, to share ideas and form a community with like-minded students and professors, and ultimately to act in the way that they think is most likely to improve outcomes from progress in AI.
The Open Phil AI Fellowship is open to students interested in any research topic within AI. We think it is likely that many topics contain research problems that are both tractable today and likely to be important for positive outcomes from transformative AI, and want to encourage students to form their own judgements about which topics seem most promising. This page includes a non-exhaustive list of examples of topics that we’ve encountered as we’ve explored this area. We expect that AI Fellows will develop many promising research directions beyond those we happen to have listed, and our basic aim is to support excellent researchers who consider potential risks from transformative AI to be a central part of their job, regardless of whether their interests fall into a category we happen to have listed.
Who can apply: The Open Phil AI Fellowship is open to full-time AI and machine learning students in any year of their PhD. Anyone who expects to be enrolled in a PhD program in Fall 2020 is welcome to apply (including undergraduate seniors applying to AI or ML PhD programs). The program is open to applicants in any country. Students with pre-existing funding sources are welcome to apply, as are students transferring to an AI/ML PhD from another field. If you aren’t sure whether you’re eligible, please feel free to ask: firstname.lastname@example.org. Our best guess is that we will fund 5-10 applicants each year.
Support: Support includes a $40,000 per year stipend, payment of tuition and fees, and an additional $10,000 in annual support for travel, equipment, and other research expenses. Fellows will be funded through the end of the 5th year of their PhD, with the possibility of renewal for subsequent years.
How the fellowship works: Fellows have a broad mandate to think through which kinds of AI and ML research are likely to be most valuable, to share ideas and form a community with like-minded students and professors, and ultimately to act in the way that they think is most likely to improve outcomes from progress in AI. Fellows participate in one or two virtual meetings each year, and attend a fellows’ gathering once per year.
Flexibility: We do not impose any required internships, restrictions on where fellows may work as an intern, or IP restrictions around fellows’ research. Our goal is to support AI Fellows at the research environment or institution they think will be best for them. For example, fellows will be free to transfer their funding to a new research group or university, or to defer use of funds while they pursue internships. If a fellow wants to pursue their research independently or in a non-academic setting, we will do our best to offer support for their continued work.
Community: We think the best research, and ultimately the most impact, often comes from groups with a culture of trust, debate, excitement, and intellectual excellence. The intent of the Open Phil AI Fellowship is not only to support a small group of promising students, but also to foster this kind of research community. In order to do this, we plan to host gatherings once or twice per year where fellows can get to know one another, learn about each other’s work, and connect with other researchers who work in this area (some of whom we fund), including researchers at the Center for Human-Compatible AI, the Montreal Institute for Learning Algorithms, Stanford, UC Berkeley, DeepMind, Google Brain, and OpenAI. We hope that AI Fellows will form lasting working relationships with each other in order to pursue a shared interest in increasing the probability of positive outcomes from transformative AI. We expect all fellows to understand and value racial, gender, and other forms of equity as operating principles, and to be committed to continued learning on issues related to race, equity, diversity, and inclusion.
How to apply
Applications for 2019 are currently closed. We expect applications for 2020 to reopen in the fall of 2020.
In past years, the application has consisted of:
- A two-page personal research statement.
- An up-to-date curriculum vitae.
- Up to three letters of recommendation.
We encourage applications from 5th-year students, who will be supported on a year-by-year basis; students who will be starting their PhD in the upcoming year; and students with pre-existing funding sources who find the mission and community of the Open Phil AI Fellowship appealing. We are committed to fostering a culture of inclusion, and encourage individuals with diverse backgrounds and experiences to apply; we especially encourage applications from women and minorities.
Questions? Please contact email@example.com.
Current Open Phil AI Fellows
Aidan is a doctoral student of Yarin Gal and Yee Whye Teh at the University of Oxford. He leads the research group FOR.ai, focusing on providing resources, mentorship, and facilitating collaboration between academia and industry. On a technical front, Aidan’s research pursues new methods of scaling individual neural networks towards trillions of parameters, and hundreds of tasks. On an ethical front, his work takes a humanist stance on machine learning applications and their risks. Aidan is a Student Researcher at Google Brain, working with Jakob Uszkoreit; Previously at Brain, he worked with Geoffrey Hinton and Łukasz Kaiser. He obtained his B.Sc from The University of Toronto with supervision from Roger Grosse.
Andrew Ilyas is a first-year PhD student at MIT working on machine learning. His interests are in building robust and reliable learning systems, and in understanding the underlying principles of modern ML methods. Andrew completed his B.Sc and MEng. in Computer Science as well as B.Sc. in Mathematics at MIT in 2018. For more information, see his website.
Julius is a PhD student in Computer Science at MIT. He is interested in provable methods to enable algorithms and machine learning systems exhibit robust and reliable behavior. Specifically, he is interested in constraints relating to privacy/security, bias/fairness, and robustness to distribution shift for agents and systems deployed in the real world. Julius received masters degrees in computer science and technology policy from MIT, where he looked at bias and interpretability of machine learning models. For more information, visit his website.
Lydia T. Liu
Lydia T. Liu is a PhD student in Computer Science at the University of California, Berkeley, advised by Moritz Hardt and Michael I. Jordan. Her research aims to establish the theoretical foundations for machine learning algorithms to have reliable and robust performance, as well as positive long-term societal impact. She is interested in developing learning algorithms with multifaceted guarantees and understanding their distributional effects in dynamic or interactive settings. Lydia graduated with a Bachelor of Science in Engineering degree from Princeton University. She is the recipient of an ICML Best Paper Award (2018) and a Microsoft Ada Lovelace Fellowship. For more information, visit her website.
Max Simchowitz is a PhD student in Electrical Engineering and Computer Science at UC Berkeley, co-advised by Benjamin Recht and Michael Jordan. He works on machine learning problems with temporal structure: either because the learning agent is allowed to make adaptive decisions about how to collect data, or because the agent’s the environment dynamically reacts to measurements taken. He received his A.B. in mathematics from Princeton University in 2015, and is a co-recipient of the ICML 2018 best paper award. You can find out more about his research on his website.
Pratyusha “Ria” Kalluri is a second year PhD student in Computer Science at Stanford, advised by Stefano Ermon and Dan Jurafsky. She is working towards discovering and inducing conceptual reasoning inside machine learning models. This leads her to work on interpretability, novel learning objectives, and learning disentangled representations. She believes this work can help shape a more radical and equitable AI future. Ria received her Bachelors degree in Computer Science at MIT in 2016 and was a Visiting Researcher at Complutense University of Madrid before beginning her PhD. For more information, visit her website.
Sidd is an incoming PhD student in Computer Science at Stanford University. He is interested in grounded language understanding, with a goal of building agents that can collaborate with humans and act safely in different environments. He is finishing up a one-year residency at Facebook AI Research in New York. He received his Sc.B. from Brown University, where he did research in human-robot interaction and natural language processing advised by Professors Stefanie Tellex and Eugene Charniak. You can find more information on his website.
Smitha is a 2nd year PhD student in computer science at UC Berkeley, where she is advised by Moritz Hardt and Anca Dragan. Her research aims to create machine learning systems that are more value-aligned. She focuses, in particular, on difficulties that arise from complexities of human behavior. For example, learning what a user prefers the system to do, despite “irrationalities” in the user’s behavior, or learning the right decisions to make, despite strategic adaptation from humans. For links to publications and other information, you can visit her website.
Aditi is a second year PhD student in Computer Science at Stanford University, advised by Percy Liang. She is interested in making Machine Learning systems provably reliable and fair, especially in the presence of adversaries. Aditi received her Bachelors degree in Computer Science and Engineering from IIT Madras in 2016. For links to publications and more information, please visit her website.
Chris is a DPhil student at the University of Oxford, supervised by Yee Whye Teh and Arnaud Doucet, and a Research Scientist at DeepMind. Chris is interested in the tools used for inference and optimization in scalable and expressive models. He aims to expand the range of such models by expanding the toolbox needed to work with them. Chris received his MSc. from the University of Toronto, working with Geoffrey Hinton. He received a NIPS Best Paper Award in 2014 and was one of the founding members of the AlphaGo project. For more information, visit his website.
Felix is a PhD student in Computer Science at ETH Zurich, working with Andreas Krause and Angela Schoellig (University of Toronto). He is interested in enabling robots to safely and autonomously learn in uncertain real-world environments, which requires new reinforcement learning algorithms that respect the physical limitations and constraints of dynamic systems and provide theoretical safety guarantees. He received his Masters degree in Mechanical Engineering from ETH Zurich in 2015. You can find out more about his research on his website.
Jon is a PhD student at the Massachusetts Institute of Technology in the Department of Brain and Cognitive Sciences, where he works with Roger Levy and Joshua Tenenbaum to build computational models of how people acquire and understand language. His research bridges between artificial intelligence, cognitive science, and linguistics in order to specify better concrete objectives for building language understanding systems. Before joining MIT, Jon studied at Stanford University and worked with Christopher Manning in the Stanford Natural Language Processing Group. He also spent time at Google Brain and OpenAI, where his advisors included Ilya Sutskever and Oriol Vinyals. You can find out more about Jon, including his blog and research articles, at his website.
Michael is an incoming PhD student at UC Berkeley. He is interested in reproducing humans’ flexible problem-solving abilities in machines, in particular through compositional representations. In June 2018, he will receive his Bachelors degree in computer science from MIT, where he worked with Professors Joshua Tenenbaum and Regina Barzilay. More information can be found on his website.
Noam is a PhD student in computer science at Carnegie Mellon University advised by Tuomas Sandholm. His research applies computational game theory to produce AI systems capable of strategic reasoning in imperfect-information multi-agent interactions. He has applied this research to creating Libratus, the first AI to defeat top humans in no-limit poker. Noam received a NIPS Best Paper award in 2017 and an Allen Newell Award for Research Excellence. Prior to starting a PhD, Noam worked at the Federal Reserve researching the effects of algorithm trading on financial markets. Before that, he developed algorithmic trading strategies. His papers and videos are available on his website.
Ruth is a PhD student in Engineering Science at the University of Oxford, where she is advised by Andrea Vedaldi. She is interested in understanding, explaining, and improving the internal representations of deep neural networks. Ruth received her Bachelors degree in Computer Science from Harvard University in 2015; she also earned a Masters degree in Neuroscience from Oxford in 2016 as a Rhodes Scholar. For more information about her research, visit her website.
Example research topics
Caveats for the rest of this page:
- We expect and encourage applications on topics we haven’t listed. Our basic aim is to enable excellent researchers to think through for themselves which kinds of work are most likely to be valuable, both independently and as part of a community. We expect that AI Fellows will develop many promising research directions beyond those we happen to have listed here; we’re providing these topics as examples of research directions that seem promising to us in the hopes that this will be helpful to applicants.
- This topic list should not be mistaken for a literature review. Within each topic, we give a few examples of papers on that topic. These lists are probably not the best or most representative citations one would choose for a literature review; they were chosen because the author of this page (Daniel Dewey) had read them or because they had been suggested to us by a researcher, and it seems to us that our citations in most topics over-represent recent papers, deep learning papers, and papers by authors we have talked to. We strongly expect that AI Fellows will find relevant threads of work that are very different from the papers we list here.
The topics we list below are centered around “AI alignment,” the problem of creating AI systems that will reliably do what their users want them to do even when AI systems become much more capable than their users across a broad range of tasks. Several research groups have recently been formed around this problem, including the Center for Human-Compatible AI, teams at DeepMind and OpenAI [1, 2], and a project at the Montreal Institute for Learning Algorithms. We believe that there is important work to be done on this question in many parts of AI and machine learning.
We believe that AI alignment is important because we expect that progress in AI will eventually produce systems that are much more capable than any human or group of humans across a broad range of tasks, and that there is a reasonable chance that this will happen in the next few decades. In this case, AI systems acting on behalf of their users would probably come to be responsible for the vast majority of human influence over the world, likely causing dramatic and hard-to-predict changes in society. It would then be very important for AI systems to reliably do what their users want them to do; the risks highlighted by Nick Bostrom in the book Superintelligence are one example of the potential consequences of failing to align sufficiently transformative AI systems with human values, although given the huge variety of possible futures and the difficulty of making specific predictions about them, we think it’s worthwhile to consider a wide range of possible scenarios and outcomes beyond those Bostrom describes. In order to be satisfying, a method for making AI systems that “reliably do what their users want them to do” should apply to these kinds of highly capable and influential AI systems.
What kinds of research seem most likely to be important for AI alignment? We look at this question by imagining that these powerful AI systems are created using the same general methodologies that are most often used today, thinking about how these methods might not result in aligned AI systems, and trying to figure out what kinds of research progress would be most likely to mitigate these problems.
To give you an idea of the kinds of research that currently seem promising to us, we’ve gathered a non-exhaustive list of example research topics from our technical advisors (machine learning researchers at OpenAI, Google Brain, and Stanford) and from other AI and machine learning researchers interested in AI alignment. We found that many of these topics fit into three broad categories:
- Reward learning: Most AI systems today are trained to optimize a well-defined objective (e.g. reward or loss function); this works well in some research settings where the intended goal is very simple (e.g. Atari games, Go, and some robotics tasks), but for many real-world tasks that humans care about, the intended goal or behavior is too complex to be specified directly. For very capable AI systems, pursuit of an incorrectly specified goal would not only lead an AI system to do something other than what we intended, but could lead the system to take harmful actions – e.g. the oversimplified goal of “maximize the amount of money in this bank account” could lead a system to commit crimes. If we could instead learn complex objectives, we could apply techniques like reinforcement learning to a much broader range of tasks without incurring these risks. Can we design training procedures and objectives that will cause AI systems to learn what we want them to do?
- Reliability: Most training procedures optimize a model or policy to perform well on a particular training distribution. However, once an AI system is deployed, it is likely to encounter situations that are outside the training distribution or that are adversarially generated in order to manipulate the system’s behavior, and it may perform arbitrarily poorly on these inputs. As AI systems become more influential, reliability failures could be very harmful, especially if failures result in an AI system learning an objective incorrectly. Can we design training procedures and objectives that will cause AI systems to perform as desired on inputs that are outside their training distributions or that are generated adversarially?
- Interpretability: Learned models are often extremely large and complex; if a learning method can produce models whose internal workings can be inspected and interpreted, or if we can develop tools to visualize or analyze the dynamics of a learned model, we might be able to better understand how models work, which changes to inputs would result in changed outputs, how the model’s decision depends on its training and data, and why we should or should not trust the model to perform well. Interpretability could help us to understand how AI systems work and how they may fail, misbehave, or otherwise not meet our expectations. The ability to interpret a system’s decision-making process may also help significantly with validation or supervision; for example, if a learned reward function is interpretable, we may be able to tell whether or not it will motivate desirable behavior, and a human supervisor may be able to better supervise an interpretable agent by inspecting its decision-making process.
It’s important to note that not all of the research topics that seem promising to us fit obviously into one of these categories, and we expect researchers to find more categories and research topics in the future. Some additional problems are gathered in Concrete Problems in AI Safety, a research agenda published in 2016 by researchers at Google Brain, Stanford, Berkeley, and OpenAI (four of whom work with us as technical advisors). More thoughts can be found in blog posts by two of the authors [1, 2].
Despite the incompleteness of these categories, the rest of this document elaborates further on reward learning, reliability, and interpretability.
Can we design training procedures and objectives that will cause AI systems to learn what we want them to do? In our conversations with researchers, a few plausible desiderata for reward learning methods have come up:
- Convergence in the limit of data and computation: As our AI capabilities improve, it becomes more likely that AI systems will find solutions closer to global optima; for example, much more capable game-playing agents would be more likely to find bugs that allow them to set their scores directly. This means that we should be very careful about assumptions of convergence that depend on AI systems’ limitations, which puts pressure on the reward learning process to guarantee convergence on desirable behavior in the limit of data and computation.
- Corrigibility [1, 2]: At every stage of learning, it would be desirable for human operators to be able to correct or override an AI system’s decisions. This might be achieved via life-long reward learning (where human operators’ actions are used as data about the AI system’s objective).
- Scalable supervision: If objectives are very complex and human-generated data is an AI system’s main source of information about objectives, it will be important for AI systems to use this data efficiently; human-generated data is expensive, and human supervision (generated in response to an AI system’s actions) is especially expensive.
- Capacity to exceed human performance: An AI system that always takes the action a human would select can be thought of as a theoretical baseline for AI alignment; while it will never take actions that its operators would disapprove of, it will not outperform them at any task. A successful reward learning procedure should allow an AI system to strongly exceed its operators’ capabilities, while still doing “what they want”.
- Learning from side information: Some existing data sources, e.g. videos or descriptions of human behavior, human utterances, and human-generated text, seem to contain large amounts of information about likely objectives. It would be desirable for reward learning methods to be able to make use of this data.
Below we list some example topics that seem promising for reward learning work. Not all work on these topics will be relevant to AI alignment, but each seems to hold some potential for learning more about how we can design very capable AI systems that learn to do what their operators want them to do:
- Imitation learning [1, 2, 3, 4, 5]: One basic approach to reward learning is to learn to imitate human actions. Alignment-relevant problems include include surpassing human performance, deciding which details of a human’s actions are important to imitate and which should be ignored, and dealing with the differences between the actions available to a human and the actions available to a highly capable AI system.
- Inverse reinforcement learning [1, 2, 3]: Another possible approach is to model a human demonstrator as a semi-rational agent, infer the demonstrator’s goal from their actions, and then act to help them achieve that goal; this should allow an AI system to exceed the demonstrator’s capabilities. One alignment-relevant problem is that IRL is heavily dependent on a model relating the demonstrator’s actions to their values, so a mis-specified model could lead to arbitrarily bad behavior.
- Learning from human feedback [1, 2, 3]: Another option is to elicit human feedback about each action an AI system takes and to train it to choose actions that it predicts would receive positive feedback. Alignment-relevant problems include significantly surpassing human performance (since humans may have a limited ability to judge actions that are significantly better than those they’d choose themselves), the unreliability of human feedback, the difficulty of understanding an AI system’s behavior well enough to judge it, and the high expense of human feedback data.
- Other ways of learning underspecified tasks [1, 2]: It seems plausible that some new method for learning from humans or human-generated data could be more suitable for AI alignment than current methods are.
- Scalable supervision [1, 2, 3]: Many reward learning approaches depend on human-generated data. This data can be expensive, and human supervision (e.g. feedback on a system’s performance) is extremely expensive. In order to learn complex objectives, an AI system will need to use human-generated data very efficiently, and maximize the information about its objective that it gains from other data sources.
Can we design training procedures and objectives that will cause AI systems to perform as desired on inputs that are outside their training distributions or that are generated adversarially? Again, a few plausible desiderata have come up in our conversations with researchers:
- Robustness to distributional shift: Can we train systems to perform as desired on inputs that are sampled from distributions that are meaningfully distinct from the system’s training distribution?
- Robustness to adversarial inputs: Can we train systems to perform as desired on inputs that are specifically designed to cause incorrect decisions?
- Calibration: Can we train systems to accurately assess their own uncertainty, including in situations where their test data is drawn from a meaningfully different sort of data set than their training data? If so, we may be able to design very capable AI systems to act conservatively in situations where they are likely to make mistakes.
- Robustness to model mis-specification : Some kinds of learning problems could be caused by incorrect model classes. For example, in some cases we may want to learn about something that cannot be observed directly, but that instead is an internal or derived feature of a model of some observable data (e.g. the preferences of a human, which can hopefully be inferred from their actions). If we mis-specify a system’s model, it may not learn these latent variables correctly, or it might learn them correctly on a training distribution but not during deployment. To what extent can we make AI systems robust against or at aware of model mis-specification?
- Robustness to training errors or data-set poisoning : Training data will sometimes contain errors, or may be contaminated by an adversary seeking to manipulate the trained system’s behavior. Can we train AI systems in a way that is robust to some amount of compromised training data?
- Safety during learning and exploration: While an AI system is learning, and especially when it’s learning an objective, it could make poor decisions based on what it has learned so far. It would be desirable to learn the most important facts about the environment and objective first and to act “conservatively” in the face of uncertainty.
- Extreme reliability for critical errors: Certain kinds of errors may cause such significant harm that we would be willing to pay very high costs to make those errors extremely unlikely. What kinds of training procedures can we use for these cases, and what kinds of guarantees could we achieve?
Below we list some example topics that seem promising for reliability. As above, not all work on these topics will be relevant to AI alignment, but each seems to hold some potential for learning more about how we can design very capable AI systems that behave reliably enough to be trusted with high-impact tasks.
- Machine learning security [1, 2, 3]: ML security studies reliability through the lens of threat models, formally defined assumptions about the capabilities and goals of an attacker. Most ML testing can be thought of as giving evidence about a system’s robustness against an “attacker” that can only present samples from a certain fixed distribution and that does not act with any particular goal in mind; adversarial examples use a threat model with a more powerful attacker. Through the systematic study of different threat models and methods for defense, ML security has the potential to give us a systematic understanding of the kinds of guarantees we can have about ML systems’ reliability, the costs and trade-offs implied by choosing one method of defense or another, and the ultimate prospects of making machine learning systems reliable in many kinds of situations. For example, we might study data poisoning attacks not only to learn how to defend against them, but also to make systems more robust against human error during training.
- Robustness to distributional shift [1, 2, 3, 4, 5]: As noted above, machine learning systems trained on a particular distribution of inputs may perform arbitrarily poorly on inputs from outside that distribution. Such failures in very influential AI systems could cause significant harm, and more capable AI systems might be more likely to encounter large distributional shifts (since they could be deployed in a wider range of circumstances, and since their use might cause significant changes in the world). There are too many topics in machine learning relevant to distributional shift to review comprehensively here, but some examples include change or anomaly detection, unsupervised risk estimation, and KWIK learning.
- Understanding and defending against adversarial examples [1, 2, 3, 4, 5, 6, 7]: Adversarial examples are inputs that are optimized to cause models to perform incorrectly. Better understanding what makes models vulnerable to adversarial examples could help us understand reliability more broadly, and thwarting adversarial examples (via e.g. adversarial training or ensembling) could lead to models that are both resistant to attack and more reliable overall. Adversarial examples also indicate a mismatch between an AI system’s learned concepts and the desired human concepts, which could lead to problems in a variety of ways.
- Verification [1, 2, 3, 4]: One approach to reliability is to produce formal specifications of desirable properties and prove that an ML system meets those specifications. When verification is possible, it can provide more comprehensive guarantees than testing would; in addition, the process of developing formal specifications for intuitively appealing properties can clarify our understanding of those properties, and attempts to prove that a systems meets a specification may reveal bugs or design mistakes in the system itself. If properties like robustness against adversarial examples can be formalized, verification may be able to give us a higher level of confidence about a system’s reliability than would be possible through testing, and the ability to verify a wide range of properties in cutting-edge machine learning systems seems likely to be useful for building systems that can be trusted with high-impact tasks.
Interpretability of learned models [1, 2, 3, 4, 5, 6, 7]: Learned models are often extremely large and complex; if a learning method can produce models whose internal workings can be inspected and interpreted, or if we can develop tools to visualize or analyze the internal workings of a learned model, we might be able to better understand how models work, which changes to inputs would result in changed outputs, how the model’s decision depends on its training and data, and why we should or should not trust the model to perform well. An orthogonal type of interpretability is interpretability of the training process itself, where visualizations and other tools can help us understand how the training method works and when it will be reliable (see e.g. this article on momentum). Both types of interpretability seem to offer some access to the reasons that an AI system makes one decision or another, potentially revealing problems that would be difficult to uncover through testing.
In terms of its application to alignment, interpretability could help us to understand how AI systems work and how they may fail, misbehave, or otherwise not meet our expectations; for example, our understanding is that work on interpretability played a key role in the discovery of adversarial examples. The ability to interpret a system’s decision-making process may also help significantly with validation or supervision; for example, if a learned reward function is interpretable, we may be able to tell whether or not it will motivate desirable behavior, and a human supervisor may be able to better supervise an interpretable agent by inspecting its decision-making process. Finding new methods to train interpretable AI systems or to interpret trained models, especially methods that will scale to very complex and capable models, seems like a promising topic for AI alignment research.