Open Philanthropy is interested in when AI systems will be able to perform various tasks that humans can perform (“AI timelines”). To inform our thinking, I investigated what evidence the human brain provides about the computational power sufficient to match its capabilities. I consulted with more than 30 experts, and considered four methods of generating estimates, focusing on floating point operations per second (FLOP/s) as a metric of computational power.
The full report on what I learned is here. This blog post is a medium-depth summary of some context, the approach I took, the methods I examined, and the conclusions I reached. The report’s executive summary is a shorter overview.
In brief, I think it more likely than not that 1015 FLOP/s is enough to perform tasks as well as the human brain (given the right software, which may be very hard to create). And I think it unlikely (<10%) that more than 1021 FLOP/s is required.1 But I’m not a neuroscientist, and the science here is very far from settled.2 I offer a few more specific probabilities, keyed to one specific type of brain model, in the report’s appendix.
For context: the Fugaku supercomputer (~$1 billion) performs ~4×1017 FLOP/s, and a V100 GPU (~$10,000) performs up to ~1014 FLOP/s.3 But even if my best-guesses are right, this doesn’t mean we’ll see AI systems as capable as the human brain anytime soon. In particular: actually creating/training such systems (as opposed to building computers that could in principle run them) is a substantial further challenge.
Acknowledgments: This project emerged out of Open Philanthropy’s engagement with some arguments suggested by one of our technical advisors, Dario Amodei, in the vein of the mechanistic and functional methods discussed below. However, my discussion should not be treated as representative of Dr. Amodei’s views; the project eventually broadened considerably; and my conclusions are my own. See the end of the executive summary for further acknowledgments, along with a list of experts consulted for the report. My thanks to Nick Beckstead, Ajeya Cotra, Tom Davidson, Owain Evans, Katja Grace, Holden Karnofsky, Michael Levine, Luke Muehlhauser, Zachary Robinson, David Roodman, Carl Shulman, and Jacob Trefethen for comments on this blog post in particular; and to Eli Nathan for extensive help with the webpage.
Some classic analyses of AI timelines (notably, by Hans Moravec and Ray Kurzweil) emphasize forecasts about when available computer hardware will be “equivalent,” in some sense (see below for discussion), to the human brain.4
A basic objection to predicting AI timelines on this basis alone is that you need more than hardware to do what the brain does.5 In particular, you need software to run on your hardware, and creating the right software might be very hard (Moravec and Kurzweil both recognize this, and appeal to further arguments).6
In the context of machine learning, we can offer a more specific version of this objection: the hardware required to run an AI system isn’t enough; you also need the hardware required to train it (along with other resources, like data).7 And training a system requires running it a lot. DeepMind’s AlphaGo Zero, for example, trained on ~5 million games of Go.8
Note, though, that depending on what sorts of task-performance will result from what sorts of training, a framework for thinking about AI timelines that incorporated training requirements would start, at least, to incorporate and quantify the difficulty of creating the right software more broadly.9 This is because training turns computation and data (along with other resources) into software you wouldn’t otherwise know how to code directly.
What’s more, the hardware required to train a system is related to the hardware required to run it.10 This relationship is central to Open Philanthropy’s interest in the topic of this report, and to an investigation my colleague Ajeya Cotra has been conducting, which draws on my analysis. That investigation focuses on what brain-related FLOP/s estimates, along with other estimates and assumptions, might tell us about when it will be feasible to train different types of AI systems. I don’t discuss this question here, but it’s an important part of the context. And in that context, brain-related hardware estimates play a different role than they do in forecasts like Moravec’s and Kurzweil’s.
I focus on floating point operations per second (FLOP/s) as a metric of computational power. These are arithmetic operations (addition, subtraction, multiplication, division) performed on a pair of numbers represented in a computer in a format akin to scientific notation. Performing tasks with computers requires resources other than FLOP/s (for example, memory and memory bandwidth), so this focus is narrow (see section 1.4 for more discussion). But FLOP/s are a key input to the investigation of training costs described above; and they’re one important resource more generally.
My aim in the report is to see what evidence the brain provides about what sorts of FLOP/s budgets would be enough to perform any cognitive task that the human brain can perform.11 Section 1.6 gives more details about the tasks I have in mind.
The project here is related to, but distinct from, directly estimating the minimum FLOP/s sufficient to perform any task the brain can perform. Here’s an analogy. Suppose you want to build a bridge across the local river, and you’re wondering if you have enough bricks. You know of only one such bridge (the “old bridge”), so it’s natural to look there for evidence. If the old bridge is made of bricks, you could count them. If it’s made of something else, like steel, you could try to figure out how many bricks you need to do what a given amount of steel does. If successful, you’ll end up confident that e.g. 100,000 bricks is enough to build such a bridge, and hence that the minimum is less than this. But how much less is still unclear. You studied an example bridge, but you didn’t derive theoretical limits on the efficiency of bridge-building.12
The project is also distinct from estimating the FLOP/s “equivalent” to the human brain. As I discuss in the report’s appendix, I think the notion of “the FLOP/s equivalent to the brain” requires clarification: there are a variety of importantly different concepts in the vicinity.
To get a flavor of this, consider the bridge analogy again, but assume that the old bridge is made of steel. What number of bricks would be “equivalent” to the old bridge? The question seems ill-posed. It’s not that bridges can’t be built from bricks. But we need to say more about what we want to know.
I group the salient possible concepts of the “FLOP/s equivalent to the human brain” into four categories, each of which, I argue, has its own problems (see section 7.5 for a summary chart). In the hopes of avoiding some of these problems, I have kept the report’s framework broad. The brain-based FLOP/s budgets I’m interested in don’t need to be uniquely “equivalent” to the brain. Nor need they accommodate any further constraints on the similarity between brain’s internal dynamics and those of the AI systems under consideration (see section 7.2); or on the training/engineering processes that could create such systems (see section 7.3). The budgets just need to be big enough, in principle, to perform the tasks in question.
I considered four methods of using the brain to generate FLOP/s budgets. They were:
- Estimate the FLOP/s required to model the brain’s low-level mechanisms at a level of detail adequate to replicate task-performance (the “mechanistic method”).13
- Identify a portion of the brain whose function we can approximate with computers, and then scale up to FLOP/s estimates for the whole brain (the “functional method”).
- Use the brain’s energy budget, together with physical limits set by Landauer’s principle, to upper-bound required FLOP/s (the “limit method”).
- Use the communication bandwidth in the brain as evidence about its computational capacity (the “communication method”). I discuss this method only briefly.
All these methods must grapple in different ways with the severe limits on our understanding of how the brain processes information – a consistent theme in my conversations with experts. Section 1.5.1 details some of the limits I have in mind. In many cases, central barriers include:
- we lack the tools to gather the data we need (for example, we can’t reliably measure the input-output transformation a neuron implements during live behavior),14 and/or
- we don’t know enough about the tasks that cells or groups of cells are performing to tell how different lower-level mechanisms contribute.15
These and other barriers counsel pessimism about the robustness of FLOP/s estimates based on our current neuroscientific understanding (see section 1.2 for further caveats). But the aim here is not to settle the question: it’s to make reasonable best-guesses, using the inconclusive evidence currently available.
I’ll say a few words about each method in turn, and the numbers that result.
The mechanistic method
The mechanistic method attempts to estimate the computation required to model the brain’s biological mechanisms at a level of detail adequate to replicate task-performance. This method receives the most attention in the report, and it’s the one I put most weight on.
Simulating the brain in extreme detail would require enormous amounts of computational power.16 The central question for the mechanistic method, then, is which details need to be included, and which can be left out or summarized.
The approach I pursue focuses on signaling between cells. Here, the idea is that for a process occurring in a cell to matter to task-performance, it needs to affect the type of signals that cell sends to other cells. Hence, a model of that cell that replicates its signaling behavior (that is, the process of receiving signals, “deciding” what signals to send out, and sending them) would replicate the cell’s role in task-performance, even if it leaves out or summarizes many other processes occuring in the cell. Do that for all the cells in the brain involved in task-performance, and you’ve got a model that can match the brain’s capabilities.17
I give a basic overview of the signaling processes in the brain in section 1.5. For the purposes of the mechanistic method, I divide these into three categories:
- Standard neuron signaling. This is the form of signaling in the brain that receives the most attention from neuroscientists and textbooks. In brief: cells called neurons signal to each other using electrical impulses called action potentials or spikes. These action potentials travel down a tail-like projection called an axon, which branches off to form connections called synapses with other neurons. When an action potential from one neuron reaches the synapse between that neuron and another, this can cause the first neuron to release chemicals called neurotransmitters, which in turn cause changes in the second neuron that influence whether it fires. These changes can proceed in part via activity in the neuron’s dendrites – tree-like branches that typically receive signals from other neurons. I use the term spike through synapse to refer to the event of a spike arriving at a synapse.
- Learning. Experience shapes neural signaling in a manner that improves task-performance and stores task-relevant information.18 Where not already covered by (1), I bucket the processes involved in this under “learning.” Salient examples include: changes at synapses that occur over time, other changes to the electrical properties of neurons, and the growth and death of neurons and synapses.
- Other signaling mechanisms. The brain contains a wide variety of signaling mechanisms (or candidate mechanisms) other than those included in the basic picture of standard neuron signaling. These include other types of chemical signals, other types of cells, synapses that don’t work via neurotransmitter release, local electric fields, and other forms of signaling along the axon. Where not already covered by (1) and (2), I lump all of these, known and unknown, under “other signaling mechanisms.”
Here’s a diagram of the basic framework I use for thinking about what models of these processes need to capture:
Here’s the mechanistic method formula that results:
Total FLOP/s = FLOP/s for standard neuron signaling +
FLOP/s for learning +
FLOP/s for other signaling mechanisms
I’m particularly interested in the following argument:
- You can capture standard neuron signaling and learning with somewhere between ~1013-1017 FLOP/s overall.
- This is the bulk of the FLOP/s burden (other signaling mechanisms may be important to task-performance, but they won’t require comparable FLOP/s to capture).
Why think (I)? In brief: there are roughly 1011 neurons in the brain, and roughly 1014-1015 synapses. On the estimates that seem most plausible to me, each neuron spikes about 0.1-1 times per second (this is lower than the rate assumed by many other mechanistic method estimates in the literature),19 suggesting ~1013-1015 spikes through synapses per second overall.20 So 1013-1017 FLOP/s budgets:
- 1-100 FLOP per spike through synapse, which would cover various simple models of the impact of a spike through synapse on the downstream neuron (~1 FLOP per spike through synapse), with two extra orders of magnitude to allow for some possible complexities.21
- 100-1,000,000 FLOP/s per neuron,22 which covers a variety of simplified models of a neuron’s “decision” about whether to fire (including some that incorporate computation taking place in dendrites) that various arguments suggest would be adequate, and which, at the high end, covers a level of modeling complexity (single-compartment Hodgkin-Huxley models23) that I expect many computational neuroscientists to think unnecessary.24
The FLOP/s budgets for learning are a significant source of uncertainty, but various models of learning in the brain plausibly fall within this range as well; and there are some additional reasons – for example, reasons related to the timescales of processes involved in learning – that we might think that learning will require fewer FLOP/s than standard neuron signaling. Various experts I spoke to (though not all) were also sympathetic towards (I).25
What about the other signaling mechanisms at stake in (II)? Here, the question is not whether these mechanisms matter. The question is whether they meaningfully increase a FLOP/s budget that already covers standard neuron signaling and learning. My best guess is that they don’t. This is mostly because:
- My impression is that most experts who have formed opinions on the topic (as opposed to remaining agnostic) do not expect these mechanisms to account for the bulk of the brain’s information-processing, even if some play an important role.26
- Relative to standard neuron signaling, each of the mechanisms I consider is some combination of (a) slower, (b) less spatially-precise, (c) less common in the brain (or, not substantially more common), or (d) less clearly relevant to task-performance.
To be clear: many of the questions at stake in these estimates remain very open. The models and assumptions covered by 1013-1017 FLOP/s seem to me reasonable defaults given what we know now. But there are also a variety of ways in which these numbers could be too low, or too high.
In particular, numbers larger than 1017 FLOP/s might be suggested by:
- Higher-precision temporal dynamics in the brain.27
- Very FLOP/s-intensive deep neural network (DNN) models of neuron behavior (see the discussion in section 18.104.22.168 of Beniaguev et al. (2020) – a model that could suggest that you need ~1021 FLOP/s for the brain overall).
- Estimates based on time-steps per relevant variable at synapses, instead of spikes through synapses per second (see discussion here).
- Larger FLOP/s budgets for processes like dendritic computation and learning.28
- Higher estimates of parameters like synapse count or average firing rate.29
- Background expectations that information-processing in biology will be extremely complex, efficient, and/or ill-suited to replication using digital computer hardware.30
Numbers smaller than 1013 FLOP/s might be suggested by:
- Noise, redundancy, and low-dimensional behavior amongst neurons, which suggest that modeling individual neurons/synapses might be overkill.
- Overestimates of FLOP/s capacity that result from applying analogs of the mechanistic method to human-engineered computers.
- Evolutionary constraints on the brain’s design (e.g., constraints on volume, energy consumption, growth/maintenance requirements, genome size, and speed/reliability of basic elements, as well as an inability to redesign the system from scratch), which suggest the possibility of improvements in efficiency.
Overall, I find the considerations pointing to the adequacy of budgets smaller than 1013-1017 FLOP/s more compelling than the considerations pointing to the necessity of larger ones (though it also seems easier, in general, to show that X is sufficient than that X is strictly required – an asymmetry present throughout the report). But the uncertainties in either direction rightly prompt dissatisfaction with the mechanistic method’s robustness.
The functional method
The functional method attempts to identify a portion of the brain whose function we can approximate with artificial systems, and then to scale up to an estimate for the brain as a whole.
Various attempts at this method have been made. I focus on two categories: estimates based on the retina, and estimates based on the visual cortex.
The retina is a thin layer of neural tissue in the eye. It performs the first stage of visual processing, and sends the results to the rest of the brain via spike patterns in the optic nerve – a bundle of roughly a million axons of neurons called retinal ganglion cells.
I consider two types of estimates for the FLOP/s sufficient to replicate retinal function.
- Hans Moravec estimates 109 calculations per second, based on the assumption that the retina’s function is to detect edges and motion.31 One problem here is that the retina does a lot more than this (for example, it can anticipate motion, it can signal that a predicted stimulus is absent, and it can adapt to different lighting conditions).32
- Recent deep neural networks used to predict ganglion cell firing patterns suggest higher estimates: ~1013-1014 FLOP/s (though I’m very uncertain about these numbers, as they depend heavily on the size of the visual input, and on how these models would scale up to a million ganglion cells).33 These, too, do not yet represent full replications of human retinal computation, but they outperform various other models on natural images.34
Moving from the retina to the whole brain introduces further uncertainty. There are a variety of possible ways of scaling up (e.g., based on mass, volume, neurons, synapses, and energy use), which result in scaling factors between 103 and 106.35 These factors imply the following ranges for the whole brain:
- Using Moravec’s retina estimate: 1012-1015 calculations per second
- Using DNN retina model estimates: 1016-1020 FLOP/s
But there are also differences between the retina and the rest of the brain, which weaken the evidence these numbers provide (for example, the retina is less plastic, more specialized, and subject to unique physical constraints).
Overall, I treat the DNN estimates here as some weak evidence that the mechanistic method range above (1013-1017 FLOP/s) is too low (and these could yet underestimate the retina’s complexity, or the complexity of the brain relative to the retina). But as noted, I feel very unsure about the estimates themselves. And it seems plausible to me that the relevant models use many more FLOP/s than are required to automate what ganglion cells do (for example, these models reflect specific implementation choices that haven’t been shown necessary; and Moravec’s estimate, even if incomplete in its coverage of all retinal computation, is much lower – see the end of section 3.1.2 for more discussion).
The visual cortex
A different application of the functional method treats deep neural networks trained on vision tasks as automating some portion of the information-processing in the visual cortex – the region of the brain that receives and begins to process visual signals sent from the retina (via the lateral geniculate nucleus).36
Such networks can classify full-color images into 1000 different categories with something like human-level accuracy.37 What’s more, they can be used as state-of-the-art predictors of neural activity in the visual cortex, and the features they detect bear interesting similarities to ones the visual cortex detects (see section 3.2 for discussion).
Using these networks for functional method estimates, though, introduces at least two types of uncertainty. First, there’s clearly a lot happening in the visual cortex other than image classification of the type these models perform. For example: the visual cortex is involved in motor processing, prediction, and learning. Indeed, the idea that different cortical regions are highly specialized for particular tasks seems to have lost favor in neuroscience. And vision as a whole seems closely tied to, for example, behavioral affordances, 3D models of an environment, and high-level interpretations of what’s significant.38
Second, even on the particular task of image classification, available DNN models do not yet clearly match human-level performance. For example:
- They’re vulnerable to adversarial examples and other types of generalization failures.
- They typically use smaller inputs than the visual cortex receives.
- They classify stimuli into a smaller number of categories (indeed, it is unclear to me, conceptually, how to bound the number of categories humans can recognize).
Suppose we try to forge ahead with a functional method estimate, despite these uncertainties. What results?
An EfficientNet-B2 takes 109 FLOP to classify a single image (though it may be possible to use even less than this).39 Humans can recognize ~ten images per second; running an EfficientNet-B2 at this frequency would require ~1010 FLOP/s.40
I estimate that the primary visual cortex (a large and especially well-studied part of the early visual system, also called V1) is ~.3-3% of the brain’s neurons, and that visual cortex as a whole is ~1-10% (though if we focused on percentage of volume, mass, energy consumption, or synapses, the relevant percentages might be larger).41
We also need to estimate two other parameters, representing the two categories of uncertainty discussed above:
- The percentage of the visual cortex’s information-processing capacity that it devotes to tasks analogous to image classification, when it performs them.42
- The factor increase in FLOP/s required to reach human-level performance on this task (if any), relative to the FLOP/s costs of an EfficientNet-B2 run at 10 Hz.
My estimates for these are very made-up. For (1), I use 1% of V1 as a more conservative estimate, and 10% of the visual cortex as a whole as a more aggressive one, with 1% of the visual cortex as a rough middle. For (2), I use 10× as a low end, and 1000× as a high end, with 100× as a rough middle. See section 3.2.3 for a bit more discussion of these numbers.
Combining these estimates for (1) and (2), we get:
|Estimate type||Assumed percentage of visual cortex capacity used for tasks analogous to image classification, when performed||Implied percentage of the whole brain’s capacity (based on neuron count)||Assumed factor increase in 10 Hz EfficientNet-B2 FLOP/s (1010) required to reach fully human-level image classification||Resulting FLOP/s estimate for the whole brain|
|High-end||0.3% (1% of V1)||0.003%-0.03%||1000×||3×1016-3×1017|
Overall, I hold these estimates very lightly. The question of how high (2) could go, for example, seems very salient. And the conceptual ambiguities involved in (1) caution against relying on what might appear to be conservative numbers.43
Still, I don’t think these estimates are entirely uninformative. For example, it is at least interesting to me that you need to treat a 10 Hz EfficientNet-B2 as running on e.g. ~0.1% of the FLOP/s of a model that would cover ~1% of V1, in order to get whole brain estimates substantially above 1017 FLOP/s – the top end of the mechanistic method range I discussed above. This weakly suggests to me that such a range is not way too low.
The limit method
The limit method attempts to upper bound required FLOP/s by appealing to physical limits.
I focus on limits imposed by “Landauer’s principle,” which specifies the minimum energy costs of erasing bits (see section 4.1.1 for more explanation). Standard FLOP (that is, those performed by human-engineered computers) erase bits, which means that an idealized computer running on the brain’s energy budget (~20W) can only perform so many standard FLOP/s: specifically, ~7×1021 (~1021 if we assume 8-bit FLOPs, and ~1019 if we assume current digital multiplier implementations).44
Does this upper bound the FLOP/s required to match the brain’s task-performance? Not on its own, because the brain need not be performing operations that resemble standard FLOPs.45 Indeed, in theory, it appears possible to perform arbitrarily complicated computations with very few bit erasures, with manageable increases in computation and memory burden.46
Absent a simple upper bound, then, the question is what, if anything, we can say about the ratio between the FLOP/s required to match the brain’s task-performance and the maximum bits per second the brain can erase. Various experts I spoke to about the limit method (though not all) were quite confident that the latter far exceed the former.47 They gave various arguments, which I group into:
- Algorithmic arguments (section 4.2.1), which focus on the bits we should expect the brain’s “algorithm” to erase, per FLOP required to replicate it; and
- Hardware arguments (section 4.2.2), which focus on the energy we should expect the brain’s hardware to dissipate, per FLOP required to replicate the computation it implements.
Of these, the hardware arguments seem to me stronger (though they also don’t seem to me to rely very directly on Landauer’s principle in particular). Both, though, appeal to general considerations that apply even if more specific assumptions from other methods are mistaken.
Overall, it seems unlikely to me that required FLOP/s exceeds the bounds suggested by the limit method. This is partly out of deference to various experts; partly because various algorithmic and hardware arguments seem plausible to me (regardless of whether they rely on Landauer’s principle or not); and partly because other methods generally point to lower numbers. But this doesn’t seem like a case of a physical limit imposing a clean upper bound.
The communication method
The communication method attempts to use the communication bandwidth in the brain as evidence about its computational capacity.
Communication bandwidth, here, refers to the speed with which a system can send different amounts of information different distances. This is distinct from the operations per second it can perform (computation). But estimating communication bandwidth might help with computation estimates, because the marginal value of additional computation and communication are related (e.g., too little communication and your computational units sit idle; too few computational units and it becomes less useful to move information around).
The basic form of the argument is roughly:
- The communication bandwidth in the brain is X.
- If the communication bandwidth in the brain is X, then Y FLOP/s is probably enough to match the brain’s task-performance.
I don’t examine attempts to use this method in any detail. But I note some examples in the hopes of inspiring future work.
- Dr. Paul Christiano, one of Open Philanthropy’s technical advisors, offers a loose estimate of the brain’s communication capacity, and suggests that it looks comparable (indeed, inferior) to the communication profile of a V100 GPU. Perhaps, then, the brain’s computational capacity is comparable (or inferior) to a V100 as well.48 This would suggest 1014 FLOP/s or less for the brain (though I think this argument gets more complicated if you also bring in comparisons based on memory and energy consumption).
- AI Impacts recommends using traversed edges per second (TEPS) – a metric used to assess communication capabilities of human-engineered computers, which measures the time required to perform a certain type of search through a random graph – to quantify the brain’s communication capacity.49 Treating spikes through synapses as traversals of an edge, they estimate ~2×1013-6×1014 TEPS for the brain. They then examine the ratio of TEPS to FLOP/s in eight top supercomputers, and find a fairly consistent ~500-600 FLOP/s per TEPS. Scaling up from their TEPS estimate for the brain, they get ~1016-3×1017 FLOP/s.
I haven’t vetted these estimates. And in general, efforts in this vein face a number of issues (see section 5.2 for examples). But I think they may well prove helpful.
Here’s a chart plotting the different estimates I discussed, along with a few others from the report.
As I’ve said, these numbers should be held lightly. They are back-of-the-envelope calculations, offered, in the report, alongside initial discussion of complications and objections. The science here is very far from settled.
Here’s a summary of the main conclusions discussed above:
- Mechanistic estimates suggesting that 1013–1017 FLOP/s would be enough to match the human brain’s task-performance seem plausible to me. Some considerations point to higher numbers; some, to lower numbers. Of these, the latter seem to me stronger.
- I give less weight to functional method estimates. However, I take estimates based on the visual cortex as some weak evidence that 1013–1017 FLOP/s isn’t much too low. Some estimates based on deep neural network models of retinal neurons point to higher numbers, but I take these as even weaker evidence.
- I think it unlikely that the required number of FLOP/s exceeds the bounds suggested by the limit method. However, I don’t think the method itself airtight.
- Communication method estimates may well prove informative, but I haven’t vetted them.
And remember, the minimum adequate budget could be lower than all these estimates. The brain is only one example of a system that performs these tasks.
Overall, I think it more likely than not that 1015 FLOP/s is enough to perform tasks as well as the human brain (given the right software, which may be very hard to create). And I think it unlikely (<10%) that more than 1021 FLOP/s is required. But there’s no consensus amongst experts.
I offer a few more specific probabilities, keyed to one specific type of brain model, in the appendix. My current best-guess median for the FLOP/s required to run that particular type of model is around 1015 (recall that none of these numbers are estimates of the FLOP/s uniquely “equivalent” to the brain).
As can be seen from the figure above, the FLOP/s capacities of current computers cover the estimates I find most plausible. However:
- Task-performance requires resources other than FLOP/s (for example, memory and memory bandwidth).
- Performing tasks on a particular machine can introduce further overheads and complications.
- Most importantly, matching the human brain’s task-performance requires actually creating sufficiently capable and computationally efficient AI systems, and this could be extremely (even prohibitively) difficult in practice even with computers that could run such systems in theory. Indeed, as noted above, the FLOP/s required to run a system that does X can be available even while the resources (including data) required to train it remain substantially out of reach. And what sorts of task-performance will result from what sorts of training is itself a further, knotty question.50
So even if my best-guesses are correct, this does not imply that we’ll see AI systems as capable as the human brain anytime soon.
- 1. The probabilities reported here should be interpreted as subjective levels of confidence or “credences,” not as claims about objective frequencies, statistics, or “propensities” (see Peterson (2009), Chapter 7, for discussion of various alternative interpretations of probability judgments). See Muehlhauser (2017a), section 2, for discussion of some complexities involved in using these probabilities in practice.
- 2. My academic background is in philosophy.
- 3. Google’s TPU supercomputer, which recently broke records in training ML systems, can also do ~4×1017 FLOP/s. NVIDIA’s newest SuperPOD can deliver ~7×1017 of AI performance. The A100, for ~$200,000, can do 5×1015 FLOP/s.
- 4. See, for example, Moravec (1998), chapter 2; and Kurzweil (2005), chapter 3. See this list from AI Impacts for related forecasts.
- 5. See, for example, Malcolm (2000), Lanier (2000) (“Belief # 5”), and Russell (2019) (p. 78). AI Impacts offers a framework that I find helpful, which uses indifference curves indicating which combinations hardware and software capability yield the same overall task-performance. This framework (see especially Figure 3) makes clear that the first human-level AI systems could use much more or much less hardware than the amount “equivalent” to the human brain (at least assuming that this amount is not the absolute minimum) – though see Figure 4 for a scenario in which brain-equivalent hardware is a better basis for forecasts.
- 6. Moravec argues here that “under current circumstances, I think computer power is the pacing factor for AI” (see his second reply to Robin Hanson). Kurzweil (2005) devotes all of Chapter 4 to the question of software. Here I am using “software” in a way that includes trained models in addition to hand-coded programs. Some forms of hardware (including neuromorphic hardware, see Mead (1989)) complicate traditional distinctions between hardware and software, but the broader consideration at stake here – e.g., that task-performance requires organizing available computational power in the right way – remains applicable.
- 7. For example: a ResNet-152 uses ~1010 FLOP to classify an image, but took ~1019 FLOP (a billion times more) to train, according to Hernandez and Amodei (2018) (see appendix, though see also Hernandez and Brown (2020) for discussion of decreasing training costs for vision models over time).
- 8. Silver et al. (2017): “Over the course of training, 4.9 million games of self-play were generated” (see “Empirical analysis of AlphaGo Zero training”). A bigger version of the model trained on 29 million games.
- 9. The question of what sorts of task-performance will result from what sorts of training is centrally important in this context, and I am not here assuming any particular answers to it.
- 10. The fact that training a model requires running it a lot makes this clear. But there are also more complex relationships between e.g. model size and amount of training data. See Kaplan et al. (2020) and Hestness et al. (2017).
- 11. As I discuss in section 1.4, I assume that non-FLOP resources like memory and memory bandwidth are available in amounts adequate to make full use of the FLOP/s in question, but not in unrealistically extreme abundance.
- 12. Though Dr. Paul Christiano expected there to be at least some tasks such that (a) the brain’s methods of performing them are close to maximally efficient, and (b) these methods use most of the brain’s resources. See Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano, p. 7. I don’t investigate this claim, but if true, it makes data about the brain more directly relevant to the minimum adequate FLOP/s budget.
- 13. The names “mechanistic method” and “functional method” were suggested by our technical advisor Dr. Dario Amodei, though he uses a somewhat more specific conception of the mechanistic method.
- 14. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter: “At this point, we have no way to reliably measure the input-output transformation of a neuron, where the input is defined as a specific spatio-temporal pattern of synaptic input” (p. 2). See also Ujfalussy et al. (2018), and Open Philanthropy’s non-verbatim notes from a conversation with Prof. Shaul Druckmann (p. 2).
- 15. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Eve Marder (p. 4):
It’s been hard to make progress in understanding neural circuits, because in order to know what details matter, you have to know what the circuit is doing, and in most parts of the brain, we don’t know this…It’s not that you can’t make simplifying assumptions. It’s that absent knowledge of what a piece of nervous system needs to be able to do, you have no way of assessing whether you’ve lost something fundamental or not.
- 16. For example: Henry Markham, in a 2018 video (18:28), estimates that running a “real-time molecular simulation of the human-brain” would require 4×1029 FLOP/s. Sandberg and Bostrom (2008) (p. 80-81), estimate 1043 FLOP/s for emulating the stochastic behavior of single molecules.
- 17. I first encountered the idea that the computational relevance of processes within the neuron are bottlenecked by intercellular signaling via some arguments offered by Dr. Dario Amodei. See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. Dong Song (p. 2).
- 18. Note that I am here construing learning how to perform new tasks as itself a task that the human brain can perform. Thus, even though we are only focused on the FLOP/s required to run a trained system that can match the human brain’s task-performance (rather than the costs of training one), that trained system itself needs to be capable of learning new tasks as well as the brain does, so the FLOP/s costs of this learning capability need to be included.
- 19. Bostrom (1998) uses 100 Hz (“Hardware requirements”); Malickas (2007) uses 100-1000 Hz; Dix (2005) uses 100 Hz as the “clock speed” (though it’s unclear whether he takes this as the average firing rate); Kurzweil (1999) uses 200 “synaptic transactions” per synapse per second, based on a 5 ms reset time (p. 124); Mead (1990) (p. 1629), Merkle (1989) (“Other estimates”), Freitas (1996) use 10 spikes per second.
- 20. 1014-1015 synapses × 0.1-1 spikes though synapse per synapse per second = 1013-1015 spikes through synapse per second.
- 21. 1013-1015 spikes through synapses per second × 1-100 FLOP per spike through synapse = 1013-1017 FLOP/s.
- 22. 1011 neurons × 102-106 FLOP/s per neuron = 1013-1017 FLOP/s.
- 23. See Izhikevich (2004): “[Hodgkin-Huxley] takes 120 floating point operations to evaluate 0.1 ms of model time (assuming that each exponent takes only ten operations), hence, 1200 operations/1 ms” (p. 1069). 1200 operations/1ms × 1000 ms/s = 1,200,000 operations/s.
- 24. A number of experts we engaged with indicated that many in the field are sympathetic to the adequacy of models less compute-intensive than single-compartment Hodgkin-Huxley (though we have very few comments in this respect publicly documented), and it fits with my impressions more broadly. For particular examples of experts who think models simpler than Hodgkin-Huxley adequate, see the quotes in section 22.214.171.124.
- 25. See sections 126.96.36.199 and sections 2.2.4 for more on expert opinion.
- 26. A number of experts we engaged with indicated that many computational neuroscientists would not emphasize these mechanisms very much (though their comments in this respect are not publicly documented); and the experts I interviewed didn’t tend to emphasize such mechanisms either.
- 27. For example, neurons in the owl auditory system can detect auditory stimulus timing at a precision of less than ten microseconds (see Funabiki et al. (2011), p. 15245). These cases may be sufficiently rare, or require combining a sufficient number of less-precise inputs, that they wouldn’t make much of a difference to the overall budget. However, if they are indicative of a need for much finer temporal precision across the board, they could imply large increases.
- 28. See e.g. Bhalla (2014) for some examples of complex intracellular signaling mechanisms involved in learning and memory.
- 29. In general, I’ve been surprised at how uncertain many very basic facts about the brain appear to be. For example, a 10:1 ratio between glia and neurons was widely accepted until it was corrected to roughly 1:1 (see von Bartheld et al. (2016): “The recently validated isotropic fractionator demonstrates a glia:neuron ratio of less than 1:1 … We review how the claim of one trillion glial cells originated, was perpetuated, and eventually refuted.” (p. 1)).
- 30. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Erik De Schutter (p. 4):
The brain was not engineered. Rather, it evolved, and evolution works by adding complexity, rather than by simplification… if you want to understand the brain, you need to embrace the fact that it works because it is so complicated. Otherwise, it will be impossible to understand the system.
Various discoveries in biology have altered Prof. Jonas’s sense of the complexity of what biological systems can be doing…The class of computation that Prof. Jonas can imagine a single cell doing now seems multiple orders of magnitude more complex than it did 20 years ago.
Possible resources more readily available to the brain than to digital computers include continuous (or very fine-grained) temporal dynamics, analog physical primitives, or complex biophysical interactions that are cheap for the brain, but difficult to simulate. See e.g. Daniel et al. (2013): “Because analog computation exploits powerful biochemical mathematical basis functions that are naturally present over the entire continuous range of input operation, they are an advantageous alternative to digital logic when resources of device count, space, time or energy are constrained” (p. 619). And from Open Philanthropy’s non-verbatim notes from a conversation with Prof. Rosa Cao (p. 1-2):
Digital computers achieve speed and reliability by ignoring many dimensions of what is happening in the system. In such a context, you only care about whether the voltage in the transistors is above or below a certain threshold, and designers try hard to shield this variable from disruptive physical fluctuations. The brain is built on fairly different principles. Its functional processes are not shielded from the dynamics of the brain’s biochemistry. Rather, the brain exploits this biochemistry to perform efficient computation. This makes the brain difficult to simulate. In nature, biochemical processes like protein-protein interactions just happen, so they are “free” for the brain to run. Simulating them, however, can be quite computationally expensive.
- 31. See Moravec (1988), Chapter 2 (p. 51-74). See also Moravec (1998) and Moravec (2008). Merkle (1989) uses a broadly similar methodology.
- 32. See Gollisch and Meister (2010).
- 33. See Section 3.1.1 for more discussion of how I arrived at these numbers. The estimate for Maheswaranathan et al. (2018) is based on numbers included in Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Baccus (p. 6). I have discussed my estimate for Batty et al. (2017) with two people with ML expertise, but it has not been confirmed by any of the paper’s authors.
- 34. Central gaps include: their accuracy can still be improved, and they’ve only been trained on a narrow class of stimuli. See section 3.1.1 for more discussion. See the papers themselves (Maheswaranathan et al. (2019), and Batty et al. (2017)) for discussion of the improvements they represent over other models. There is also a fuller review of relevant results (along with other neuron modeling results) in section 188.8.131.52.2.
- 35. See section 3.1.2.
- 36. My discussion of this type of functional method estimate is inspired by some arguments suggested by Dr. Dario Amodei, to the effect that ML vision models might be put into productive comparison with parts of the visual cortex (and in particular, conservatively, the primary visual cortex). See also Drexler (2019), which inspired some of Dr. Amodei’s analysis.
- 37. See a chart of progress here. Russakovsky et al. (2014): “Annotator A1 evaluated a total of 1500 test set images …The human error was estimated to be 5.1%.” Though as Karpathy (2014) notes, “human accuracy lives on a tradeoff curve.”
- 38. See Karpathy (2012) for discussion of all of the complex understanding that goes into appreciating a funny picture.
- 39. See the Cerebras whitepaper for some discussion of possible efficiency gains from using hardware that avoids multiplying by 0 (this is a point suggested by Dr. Dario Amodei); and see Ravi (2018) and Frankle and Carbin (2018) for some discussion of compression techniques (though EfficientNet-B2 is already optimized for efficiency).
- 40. From Open Philanthropy’s non-verbatim notes from a conversation with Prof. Won Mok Shim: “There is a fair amount of consensus in the field that the human visual system can recognize about ten images per second (e.g., one image per 100 ms)” (p. 3).
- 41. This is because the cerebellum, which accounts for ~80% of the brain’s neurons (Azevedo et al. (2009), p. 536), is more like ~10% of its volume (Swanson (1995), p. 473), ~10% of its mass (Azevedo et al. (2009), p. 536), <10% of its energy consumption (Howarth et al. (2012) (p. 1232) suggests that energy consumption per gram is higher in the cerebral cortex than in the cerebellar cortex), and plausibly much less than 80% of the synapses (most of the neurons in the cerebellum are cerebellar granule cells, which appear to have a comparatively small number of synapses each: see Llinás et al. (2004), p. 277. See section 3.2.3 for more on my neuron estimates for V1 and the visual cortex as a whole.
- 42. Drexler (2019) (p. 184) appeals to a similar parameter: the percentage of the “immediate neural activity” devoted to a given task.
- 43. Consider: “what fraction of a horse’s legs does a wheelbarrow automate?” (this is a variant of an analogy suggested by Nick Beckstead). It’s not clear that “of course it’s hard to say precisely, but surely at least a millionth, right?” is a sensible answer – and the problem isn’t that the true answer is a billionth instead. It seems possible that comparisons between DNNs and the visual cortex are similar.
- 44. A standard FLOP takes two n-bit numbers, and produces another n-bit number. So absent active steps to save the inputs, you’ve erased at least n bits (see Jared Kaplan’s notes on Statistical Mechanics & Thermodynamics, p. 59). 7×1021/8 = ~1020. Hänninen et al. (2011) estimate the bit-erasures implicated by various proposed multiplier implementations. The array multiplier is the most efficient, at 8n2 for n-bit words (see Table II, p. 2372). 83 = 512. 7×1021/512= ~1019. Though see Sarpeshkar (1998) for discussion of more efficient, analog implementations (p. 1605).
- 45. Here’s an analogy. Suppose that you’re wondering how many bricks you need to build a bridge across the local river, and you know that a single brick always requires a pound of mortar. You learn that the “old bridge” across the river was built using no more than 100,000 pounds of mortar. If the old bridge is made of bricks, then you can infer that 100,000 bricks is enough. If the old bridge is made of steel, though, you can’t: even assuming that a brick can do anything y units of steel can do, y units of steel might require less (maybe much less) than a pound of mortar, so the old bridge could still be built with more than 100,000*y units of steel. Obviously, the connection between FLOPs, bit-erasures, and the brain’s operations may be tighter than that between bricks, mortar, and steel. But conceptually, the point stands: unless we assume that the brain performs standard FLOPs, moving from bit-erasures to FLOPs requires further arguments.
- 46. See Bennett (1989). From Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel: “the algorithm described in Charles Bennett’s 1989 paper ‘Time/Space Trade-Offs for Reversible Computation’ involves slow-downs of at worst a multiplicative factor, around 2-3X as slow” (p. 4). See also Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano: “The algorithmic overhead involved in reversible computing (specifically, the overhead involved in un-computing what you have already computed) is not that bad. Most of the difficulty lies in designing such efficient hardware” (p. 4). That said, there is at least some ongoing disagreement about this, related to the bit-erasures implied by repeatedly reading/writing inputs and outputs. See Wolpert (2019), Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert (p. 2), and Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel (p. 5).
- 47. See Open Philanthropy’s non-verbatim notes from a conversation with Prof. Jared Kaplan, p. 2; Open Philanthropy’s non-verbatim notes from a conversation with Dr. Jess Riedel, p. 2-3; and Open Philanthropy’s non-verbatim notes from a conversation with Dr. Paul Christiano, p. 5. See also Open Philanthropy’s non-verbatim notes from a conversation with Prof. David Wolpert, p. 3-4, for more discussion. For skepticism/lesser confidence, see Open Philanthropy’s non-verbatim notes from a conversation with Dr. Michael Frank, p. 2, and Open Philanthropy’s non-verbatim notes from a conversation with Dr. Stephen Larson, p. 4. Prof. David Wallace was also skeptical that Landauer’s principle could be used to generate an informative upper bound on required FLOP/s. See endnotes in the beginning of section 4 for quotations from the notes in question.
- 48. The principle at work here is something like: if you have two computers comparable on one dimension important to performance (e.g., communication), but you can’t measure how they compare on some other dimension (e.g., computation), then other things equal, your median guess should be that they are comparable on this other dimension as well. Here, the assumption would be that the known dimension reflects the overall skill of the engineer, which was presumably applied to the unknown dimension as well. As an analogy: if all we know is that Bob’s pie crusts are about as good as Sally’s, the best median guess is that they’re comparable pie chefs, and hence that his pie fillings are about as good as hers as well.
- 49. Dr. Paul Christiano also assisted with the AI Impacts TEPS estimate.
- 50. My colleague Ajeya Cotra’s investigation focuses on these issues.