Published: October 2015; Updated: April 2016, May 2016, and July 2016
To inform the Open Philanthropy Project’s investigation of potential risks from advanced artificial intelligence, I (Luke Muehlhauser) conducted a short study of what we know so far about likely timelines for the development of advanced artificial intelligence (AI) capabilities.
What are we trying to forecast?
From a risk-benefit perspective, we’re mostly interested in what machines can do, not how they do it. Moreover, for long-term forecasting it’s probably easier to forecast technological capabilities rather than technological solutions, since economic and other incentives are stronger for the former than for the latter. For example, it’s probably easier to predict that within 20 years we’ll be able to do X amount of computation for $Y than it is to predict which particular computing architecture we’ll be using to do X amount of computation for $Y in 20 years.
One key AI capabilities milestone we’d like to predict is something like Müller & Bostrom (2014)’s notion of “high-level machine intelligence” (HLMI), which they define as an AI system “that can carry out most human professions at least as well as a typical human.”1 To operationalize this a bit further, we might say that an HLMI is a machine that can,
- for 95% of human occupations currently enumerated by the U.S. Bureau of Labor and Statistics,
- achieve at least the performance of the median-skilled human who is being paid to do that job
- with 6 months of training or less
- at no more than 10,000 times the cost of the median-skilled human doing that job.
One could change these and other parameters to arrive at definitions of HLMI that have different implications. If you were primarily interested in AI’s effects on the job market, you might instead want to forecast when we’ll have machines that can replace median-skilled workers in 70% of occupations with 12 months of training at human cost. Or if you were primarily interested in loss of control scenarios, you might instead choose a particular subset of human occupations (AI research, computer security, negotiation, intelligence analysis, etc.) and try to forecast when a machine will be able to achieve the performance of the 90th-percentile-skilled worker (among those doing each of those jobs) with one month of training, at any cost.
Unfortunately, extant HLMI forecasts rarely define so precisely what level of technological capability they’re trying to forecast, so if we take the outputs of expert surveys and other forecast summaries as some evidence for true HLMI timelines, it’s not clear how we should integrate those forecasts, in part because they might have been trying to forecast radically different things.
Also, we should remember that, no matter what threshold we set for occupations at which the machine has achieved roughly human-level performance, it will probably demonstrate vastly superhuman performance at many or most other occupations by the time it achieves human-level performance at the last one specified by our operationalization.2 Such an operationalization might suggest that, subjectively, AI capabilities will “woosh” past most HLMI milestones defined in this way, unless the capabilities most relevant to AI self-improvement are the last capabilities to reach human-level, though a “fast takeoff” remains possible even in that scenario.3
I did not conduct any literature searches to produce this report. I have been following the small field of HLMI forecasting closely since 2011, and I felt comfortable that I already knew where to find most of the best recent HLMI forecasting work. In the past few years, much of it has been published by Katja Grace and Paul Christiano at AI Impacts.4 As such, this report leans heavily on their work.
Extant expert surveys on HLMI timelines are collected here.
Grace’s summary of Michie (1973), by far the earliest expert survey, is: “Almost all participants predicted human level computing systems would not emerge for over twenty years. They were roughly divided between 20, 50, and more.” Participants were British and American computer scientists working in or near AI.
For most of the surveys, either the participants mostly weren’t AI scientists, or the participants were primarily HLMI researchers/enthusiasts, or the participants weren’t selected for any kind of “objective” criteria (e.g. “attendance at the AI@50 conference” or “people Robin Hanson happened to ask about AI progress”).
For the purposes of this question, assume that human scientific activity continues without major negative disruption. By what year would you see a (10% / 50% / 90%) probability for HLMI to exist?
|Median responses||Mean responses||Standard deviation|
|10% chance of HLMI||2024||2034||33 years|
|50% chance of HLMI||2050||2072||110 years|
|90% chance of HLMI||2070||2168||342 years|
The medians strongly predict HLMI in the next 50 years, but there is wide disagreement among experts.5
I think it would be valuable to conduct roughly this same survey every 3 years, or maybe every 5 years, but try hard to boost the response rate, and maybe expand the sample to the top 300 most-cited living AI scientists.
I also think it would be valuable to conduct a Delphi-style timelines elicitation using roughly these same questions, where the participants are selected both from groups that have been thinking about these issues for a long time (e.g. FHI and MIRI), and also from among AI scientists who gave very diverse answers to the survey but who haven’t devoted their careers to studying these issues.
(The figure’s caption reads: “Predictions from the MIRI dataset (red = maxIY ≈ ‘AI more likely than not after …’, and green = minPY ≈ ‘AI less likely than not before …’) and surveys. This figure excludes one prediction of 3012 made in 2012, and the Hanson survey, which doesn’t ask directly about prediction dates.”)
After noting the clustering around 2040-2050, Grace lists some common complaints about AI forecasts:
Predictions about AI are frequently distrusted… [for example due to complaints] that people are biased to predict AI twenty years in the future, or just before their own deaths; that AI researchers have always been very optimistic and continually proven wrong… or that failed predictions of the past look like current predictions.7
Grace doesn’t find those complaints very compelling. While I don’t necessarily find them compelling in their stated form, there is a weaker version of the first concern that I think is worth noting. Maybe when people have no idea when a technology will be developed, they generally forecast it 10–50 years away, because 10 years feels like the soonest reasonable forecast given that there’s no clear path to the technology, while predicting that anything will take more than 50 years (without very specific obstacles in mind) feels generally overconfident. Eyeballing Grace’s chart above, it looks like the median forecast has been pushed out by about one year per calendar year, staying within the 10–50 year range.
One could further investigate this concern by collecting and analyzing long-term forecasts over time for other technological capabilities, for example ICBMs, space flight, self-driving cars, and quantum computers, if enough such forecasts exist.
Grace does have some concerns about AI forecasts:
She also worries that:
- Different people and surveys are predicting different notions of HLMI, as I mentioned above.
- The expert performance literature suggests that experts should be poor at forecasting HLMI. See e.g. table 1 of Armstrong & Sotala (2012) and, I would add, the key findings of Mullins (2012), the most exhaustive retrospective analysis of historical technology forecasts I have seen.
- HLMI predictions range over about a century. As Grace writes, “this strongly suggests that many individual predictions are inaccurate, though not that the aggregate distribution is uninformative.”
Nevertheless, these expert judgments seem to provide some information. A priori it could have been the case that experts widely agreed that HLMI is at least a century away, but that isn’t the case.
Mullins (2012) suggests that quantitative trend analyses typically yield more accurate technology forecasts than expert judgments do, or indeed than any other forecasting methodology does. What do quantitative trends suggest about HLMI timing?
(This section substantially revised in April 2016; original page here.)
The most common strategy for estimating HLMI timelines via trend extrapolation is to estimate how much computation the human brain does, then extrapolate computing trends to find out by which year we’ll have roughly the computing power of the human brain available for some reasonable cost.8 This approach suffers from serious weaknesses, and we are inclined to place very little weight on it. But first, here are some examples of this approach, taken from a 2015 post by AI Impacts: 9
- Using one methodology, associated with Hans Moravec, implies that human-brain-equivalent computing power is already available for ~$3/hour. In 2009, Prof. Moravec personally made a longer-term prediction, estimating that it would be 20-30 years from 2009 until a human-brain-equivalent computer could be built for $1000.
- Using a different set of assumptions drawn from the estimates of participants in a workshop about the whole brain emulation approach to HLMI, AI Impacts estimates that “an AI might compete with a human earning $100/hour in 12 years, 28 years or 40 years [depending on different assumptions]: between 2027 and 2055.”
- AI Impacts also lists a couple of other estimates, one putting the key date between 2042 and 2087 and one putting it around 2019.
Some weaknesses of this approach include:
- We don’t know how much computation the human brain does. The estimates I’ve seen differ by many orders of magnitude.
- The human brain doesn’t work much like today’s computers do.10
- In many domains, we can achieve the same level of performance with varying proportions of hardware/software advantage. Great algorithms compensate for lacking hardware, while mountains of computation can compensate for unsophisticated algorithms.11
- Most AI scientists I’ve spoken to seem to think that software, not hardware, will be the key bottleneck to HLMI. To illustrate: “We’ve [probably] had the computing power of a honeybee’s brain for quite a while now, but that doesn’t mean we know how to build tiny robots that fend for themselves outside the lab, find their own sources of energy, and communicate with others to build their homes in the wild.” Every “difficulty of software” fudge factor I’ve seen added to a hardware trend analysis seems to have been pulled from one’s gut. Note that the distinction between hardware and software is not a particularly clean one. Recent progress in deep learning is often attributed to the interaction of particular classes of machine learning algorithms with the computing architectures of graphical processing units originally built for other purposes — so what fraction of the credit should go to “hardware” vs. “software”? Similarly, application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs), which are sometimes used in machine learning, somewhat blur the line between “hardware” and “software.”
For these reasons, we find such trend extrapolations to be of very limited usefulness. To sum up perhaps the most important weakness of this approach, these trend extrapolations are based on a very narrow conception of computing power (e.g. “operations per second”), and there are many advances (e.g. better algorithms, better FPGAs) that could be very important for real-world importance while not necessarily affecting this narrow definition (“operations per second”) much, and we see no clear way to perform a relevant trend extrapolation for advances along these lines.
Nevertheless, it’s a bit interesting that such simple trend extrapolations agree so closely with expert opinion. Unfortunately, these lines of evidence are not fully independent, since it’s my impression that Moravec’s and Kurzweil’s hardware extrapolations have been widely known for some time, and may have strongly influenced many of the expert judgments made during the last ~15 years.
One method used to place a bound on how soon HLMI might arrive is to enumerate ways in which current AI systems fall short of HLMI, and then try to estimate the shortest possible path one can imagine between today’s capabilities and HLMI.
Thus one might say “Well, it seems to me that HLMI requires hierarchical planning, and long-term memory, and the integration of logical reasoning and statistical machine learning, and X, and Y, and we’re not anywhere close to solving any of those things, so we’re at least, I would guess, 20 years away.”
Or, if you think video game performance is a good benchmark for progress toward HLMI, then you might think “Right now the most general AI system can’t even beat the Atari 2600 library of games. It’ll probably be at least 15 years before it beats human-level performance for every game in the Playstation 3 library. Call me when it has beaten Grand Theft Auto V with nothing but pixel input.”
I haven’t seen anyone use much more than their gut for such lower bounds on time-to-HLMI, but it’s definitely part of the reasoning I’m using when I confidently predict we won’t have HLMI in 10 years.
In May 2013, I wrote that HLMI forecasting is further complicated by the fact that, over such long time scales, major disruptions may greatly impact HLMI timelines.
One potential disruption I named was “A tipping point in development incentives.” In retrospect, one might argue that this disruption was happening precisely as I was writing about it. As UC Berkeley AI scientist Stuart Russell has pointed out,12 in the past few years several major applications of AI have crossed a threshold of performance “from laboratory research to economically valuable technologies,” at which point “a virtuous cycle takes hold whereby even small improvements in performance are worth large sums of money, prompting greater investments in research,” with the result that “industry [has probably invested] more in the last 5 years than governments have invested since the beginning of the field.”
Other potential disruptions I describe briefly in my earlier post, with links to relevant literature, include:
- An end to Moore’s law.
- Depletion of low-hanging fruit.
- Societal collapse.
- Disinclination to proceed with AI development (e.g. due to widely held safety concerns).
- Breakthroughs in cognitive neuroscience which reveal the brain’s algorithms for general intelligence.
- Human enhancement which accelerates progress in many fields, including AI.
- Quantum computing.
What should we learn from past AI forecasts?
(This section added May 2016.)
An additional input into forecasting AI timelines is the question, “How have people predicted AI — especially HLMI (or something like it) — in the past, and should we adjust our own views today to correct for patterns we can observe in earlier predictions?” We’ve encountered the view that AI has been prone to repeated over-hype in the past, and that we should therefore expect that today’s projections are likely to be over-optimistic.
To investigate the nature of past AI predictions and cycles of optimism and pessimism in the history of the field, I read or skim-read several histories of AI and tracked down the original sources for many published AI predictions so I could read them in context. I also considered how I might have responded to hype or pessimism/criticism about AI at various times in its history, if I had been around at the time and had been trying to make my own predictions about the future of AI.
Some of my findings from this exercise are:
- The peak of AI hype seems to have been from 1956-1973. Still, the hype implied by some of the best-known AI predictions from this period is commonly exaggerated. [More]
- After ~1973, few experts seemed to discuss HLMI (or something similar) as a medium-term possibility, in part because many experts learned from the failure of the field’s earlier excessive optimism. [More]
- The second major period of AI hype, in the early 1980s, seems to have been more about the possibility of commercially useful, narrow-purpose “expert systems,” not about HLMI (or something similar). [More]
- The collection of individual AI forecasts graphed above is not very diverse: about 70% of them can be captured by three categories: (1) the earliest AI scientists, (2) a tight-knit group of futurists that emerged in the 1990s, and (3) people interviewed by Alexander Kruel in 2011-2012. [More]
- It’s unclear to me whether I would have been persuaded by contemporary critiques of early AI optimism, or whether I would have thought to ask the right kinds of skeptical questions at the time. The most substantive critique during the early years was by Hubert Dreyfus, and my guess is that I would have found it persuasive at the time, but I can’t be confident of that. [More]
I can’t easily summarize all the evidence I encountered that left me with these impressions, but I have tried to collect many of the important quotes and other data on another page titled What should we learn from past AI forecasts?
Why do people disagree?
Why do experts disagree so much about HLMI timelines? Unfortunately, virtually no HLMI forecaster provides enough detail from their forecasting reasoning for someone else to say “Okay, see, I disagree with the numbers used in the model of section 5 of your analysis, and that seems to be the main reason we disagree.” Expert forecasts often seem to be drawing in part on explicit factors such as survey results or hardware trend extrapolations, but there always seems to be an important element of gut intuition as well, which the forecaster has never articulated, and may not be able to articulate.
In conversation I have occasionally been able to pin down apparent reasons for disagreement between my AI timelines and someone else’s AI timelines, but in cases when I seem to change their mind about some piece of their model — e.g. about the significance of past wrong AI forecasts, or about how one should extrapolate rates of AI progress given that the field received very little investment for most of its history compared to today — this rarely seems to shift their overall estimate, which suggests the pieces of the forecasting model we discussed may have been epiphenomenal to their timeline estimates all along. Perhaps this is also true of my own AI forecasts.
So what do we know about HLMI timelines? Very little.
Expert surveys and quantitative hardware trend extrapolations suggest HLMI is likely to be developed sometime during the 21st century, but only weakly. In general, it seems that for experts and laypeople alike, only very wide confidence intervals for “years to HLMI” are appropriate. For example, my own 70% confidence interval for “years to HLMI” is something like 10–120 years, though that estimate is unstable and uncertain.13
- 1. Alternative operationalizations can be found in What is AGI? Various definitions of machine general intelligence are surveyed or cited What is intelligence?
- 2. Katja Grace illustrated this visually with a spider chart.
- 3. See Yudkowsky (2013) for an overview of AI takeoff speed arguments. Yudkowsky expects a fast takeoff scenario. For an example slow takeoff argument, see Grace’s The slow traversal of ‘human-level.’
- 4. Their work, in turn, builds on earlier HLMI forecasting work by the Future of Humanity Institute (FHI), the Machine Intelligence Research Institute (MIRI), Ray Kurzweil, Hans Moravec, and others.
- 5. What about selection bias among respondents? To test for this, they put extra pressure on a random sample of 17 non-respondents (for the TOP100 poll). The two non-respondents who replied under pressure gave HLMI dates earlier than the mean and median. This isn’t much, but it’s some evidence against there being substantial selection bias toward earlier dates.
- 6. The individual forecasts are from a cleaned-up version of a MIRI-collected dataset, plus some detailed analyses Grace collected. One caveat is that Grace hasn’t added all her collected timeline analyses to this chart yet, according to my personal communication with her.
- 7. In the July 2016 update of this page, I added an ellipsis between “AI researchers have always been very optimistic and continually proven wrong” and “failed predictions of the past look like current predictions.” This ellipsis now replaces the original phrase “that experts and novices make the same predictions,” because Grace has since found that claim to be in error.
- 8. Few trend analyses of software performance over time exist; one exception is Grace (2013). A few analyses of software+hardware performance (e.g. on speech recognition) exist, but I’m not aware of a thorough review. By far the most-discussed and most-measured trends are in computing hardware, e.g. see Muehlhauser & Rieber (2014), Nagy et al. (2013), Moore’s Law, Koomey’s Law.
- 9. Grace has since added another analysis based on TEPS.
- 10. See e.g. Chatham, “10 Important Differences Between Brains and Computers”, which I only partially agree with.
- 11. For more on this, see the section “The relationship between hardware and software” from Grace’s How AI timelines are estimated.
- 12. Quotes are taken from this Russell talk and from the FLI open letter written by Russell and others.
- 13. By “unstable” I mean that if I forgot what estimate I had given here, and somebody asked me again for my 70% confidence interval for “years to HLMI,” I expect I’d be fairly likely to check my knowledge and my intuitions and reply “15–90 years,” and I expect I’d be just as likely to reply “10–150 years.” By “uncertain” I mean that I have uncertainty about my probability statements. To illustrate: my probability that a fair coin will land heads-up is 50%, and I’m highly confident of that probability estimate. Meanwhile, my probability of a major European war in the next 40 years might also be 50%, but I’m not very confident of that probability estimate, because — in contrast to the case with the fair coin — I don’t have a clear sense for how to make such a probability estimate. Various formal models for such higher-order uncertainty have been proposed — e.g. Jaynes (2003, ch. 18), Jøsang (2013), GiveWell’s “Modeling Extreme Model Uncertainty” — but I haven’t studied them closely enough to endorse any of them in particular.