Technical Updates to Our Global Health and Wellbeing Cause Prioritization Framework

In 2019, we wrote a blog post about how we think about the “bar” for our giving and how we compare different kinds of interventions to each other using back-of-the-envelope calculations, all within the realm of what we now call Global Health and Wellbeing (GHW). This post updates that one and:

• Explains how we previously compared health and income gains in comparable units. In short, we use a logarithmic model of the utility of income, so a 1% change in income is worth the same to everyone, and a dollar of income is worth 100x more to someone who has 100x less. We measure philanthropic impact in units of the welfare gained by giving a dollar to someone with an annual income of $50,000, which was roughly US GDP per capita when we adopted this framework. Under the logarithmic model, this means we value increasing 100 people’s income by 1% (i.e. a total of 1 natural log unit increase in income) at$50,000. We have previously also valued averting a disability-adjusted life year (DALY; roughly, a year of healthy life lost) at $50,000, so we valued increasing income by one natural-log unit as equal to averting 1 DALY. This would imply that a charity that could avert a DALY for$50 would have a “1,000x” return because the benefits would be $50,000 relative to the costs of$50. (More)
• Reviews our previous “bar” for what level of cost-effectiveness a grant needed to hit to be worth making. Overall, having a single “bar” across multiple very different programs and outcome measures is an attractive feature because equalizing marginal returns across different programs is a requirement for optimizing the overall allocation of resources1, and we are devoted to doing the most good possible with our giving. Prior to 2019, we used a “100x” bar based on the units above, the scalability of direct cash transfers to the global poor, and the roughly 100x ratio of high-income country income to GiveDirectly recipient income. As of 2019, we tentatively switched to thinking of “roughly 1,000x” as our bar for new programs, because that was roughly our estimate of the unfunded margin of the top charities recommended by GiveWell (which we used to be part of and remain closely affiliated with), and we thought we would be able to find enough other opportunities at that cost-effectiveness to hit our overall spending targets. (More)
• Updates our ethical framework to increase the weight on life expectancy gains relative to income gains. We’re continuing to use the log income utility model, but, after reviewing several lines of evidence, we’re doubling the weight on health relative to income in low-income settings, so we will now value a DALY at 2 natural log units of income or $100,000. We’re also updating how we measure the DALY burden of a death; our new approach will accord with GiveWell’s moral weights, which value preventing deaths at very young ages differently than implied by a DALY framework. (More) • Articulates our tentative “bar” for giving going forward, of roughly 1,000x (which is ~20% lower than our old bar given new units – explanation in footnote2). The (offsetting) changes in the bar come from new units, our available assets growing, more sophisticated modeling of how we expect cost-effectiveness and asset returns to interact over time, growth in GiveWell’s other funding sources, and slightly increased skepticism about our ability to spend as much as needed at much higher levels of cost-effectiveness. Due to the increased assets and lower bar, we’re planning to substantially increase our funding for GiveWell’s recommended charities, which we will write more about next week. However, we still expect most of our medium-term growth in GHW to be in new causes that can take advantage of the leveraged returns to research and advocacy, and could imagine that we’ll eventually find enough room for more funding in those interventions that we will need to raise the bar again. (More) This post focuses exclusively on how we value different outcomes for humans within Global Health and Wellbeing; when it comes to other outcomes like farm animal welfare or the far future, we practice worldview diversification instead of trying to have a single unified framework for cost-effectiveness analysis. We think it’s an open question whether we should have more internal “worldviews” that are diversified over within the broad Global Health and Wellbeing remit (vs everything being slotted into a unified framework as in this post). This post is unusually technical relative to our others, and we expect it may make sense for most of our usual blog readers to skip it. 1. How we previously compared health and income We often use “marginal value of a dollar of income to someone with baseline income of$50K” as our unified outcome variable, so by definition giving a dollar to someone with $50k of annual income has a cost-effectiveness of 1x.3 We value income using a logarithmic utility function, so$100 in extra income for a rich person generates as much utility as $1 in extra income for a person with 1/100th the income of that rich person. In order for a grant’s cost-effectiveness to be, say, 1000x, it must be 1000 times more cost-effective than giving a dollar to someone with$50k of annual income.

This can be confusing because GiveWell, which we used to be part of and remain closely affiliated with, uses contributions to GiveDirectly as their baseline unit of “1x.” In our framework, those contributions are very roughly worth 100x, because GiveDirectly recipients are roughly 100x poorer (after overhead) than our $50K income benchmark. This section of our 2019 blog post reviews how our logarithmic income model works and the “100x” calculation. We quantify health outcomes using disability-adjusted life years (DALYs). The DALY burden of a disease is the sum of the years of life lost (YLL) due to the disease, and the weighted years lived with disability (YLD) due to the disease. If you save the life of someone who goes on to live for 10 years, your intervention averted 10 YLLs. If you prevent an illness that would have caused someone to live in a condition 20% as bad as death for 10 years, your intervention averted 20%*10=2 YLDs. (We don’t necessarily endorse the disability weights used to measure YLDs, and in principle we might prefer other methodologies, such as quality-adjusted life years (QALYs) or just focusing on YLLs. We use DALYs because global-health data is much more widely available in DALYs than in other frameworks, especially from the Global Burden of Disease project.) The health interventions that we support are primarily lifesaving interventions (with the exception of deworming where the modeled impacts run entirely through economic outcomes) – so although we talk in terms of DALYs, most of our health impact is in YLLs. (For instance, according to the GBD, ~80% of all DALYs in Sub-Saharan Africa are from YLLs, and amongst under-5 children the same figure is 97%. For malaria DALYs the share from YLLs is even higher, at 94% overall and 98% amongst under-5 children.) We also sometimes use quasi-disability-weights for valuing other outcomes (e.g., the harm of a year of prison). There are different approaches to calculating the YLLs foregone with a death at a given age. For instance, the life-expectancy tables in Kenya suggest that an average child malaria death shortens the child’s life by 68 years (i.e. that is the average remaining life expectancy of a 0-5 year old in Kenya).4 This is not a floor on plausible YLLs – one could imagine that those prevented from dying by some intervention are unusually sick relative to the broader population, and accordingly not likely to live as long as a population life table would suggest – and as explained below GiveWell uses moral weights for child deaths that would be consistent with assuming 51 years of foregone life in the DALY framework (though that is not how they reach the conclusion). On the other extreme, the GBD takes a normative approach to life expectancy, saying in effect that everyone’s life expectancy at birth should be 88 years.5 Therefore any death under age 5 has a DALY burden of at least 84 years under the GBD approach.6 We have previously been inconsistent in this regard – following the methodology of the GBD in much of our own analysis, while deferring to GiveWell’s approach (and moral weights) on their recommendations. (In the analysis further below of our current bar, we assume for simplicity that our status quo approach splits the difference between the GiveWell and GBD approaches, and uses expected life years remaining for someone who lives to age 5 based on national-level life tables for Kenya.) Below, we explain our plan to follow GiveWell’s approach more closely going forward. Historically we haven’t written much about how we came to value a DALY at$50,000 (or equivalently to one unit of log-income), which is less than many other actors would suggest. We don’t have a well-documented account of this historical choice, but the recollection of one of us (Alexander) is:

• We had seen numbers like $50K cited in the (older) health economics literature. It was also close to the British government’s widely-cited cost-effectiveness threshold of £20k-£30k (roughly$50k as of 2007; less now) per QALY (though that is more based on an opportunity cost frame than a valuation frame).
• A $50K DALY valuation roughly reconciled our relative valuation on the life-saving GiveWell top charities against GiveDirectly (i.e., we had GiveDirectly at 100x in our units and a cost per DALY averted of ~$50 for GiveWell top charities, which at $50K/DALY, implied a 1,000x ROI, or a ~10x difference with GiveDirectly) with GiveWell’s (which thought their life-saving charities were ~10x as cost-effective as GiveDirectly at the time).7 • Given that we were already prioritizing health heavily relative to income in terms of the total set of interventions we were supporting, opting for a lower valuation than some other parts of the literature seemed conservative. • Using a number that was significantly larger than GDP per capita (which was ~$50K in the US at the time we adopted this valuation) implied there was more value at stake than total resources in the economy, and that seemed wrong to me at the time. We now think that I was mistaken and there’s no in-principle reason there can’t be substantially more value at stake than the sum of the world economy.

2. Our previous “bar”

It is useful for us to have a single “bar” across multiple very different programs, years, and outcome measures, because equalizing marginal returns across different programs and across time is a requirement for optimizing the overall allocation of resources, and our mission is to give as effectively as we can. The basic idea of this “bar” is that it tells us what level of cost-effectiveness of grant is “good enough” to justify spending on some specific grant or program vs. saving for future years or allocating more funding to another program instead. If instead we set different bars across programs or years (in present value terms, i.e., after adjusting for investment returns), then that would mean we could have more impact by changing the allocation of resources (to put more into the program with the higher bar and less into the program with the lower bar), and we’d, in principle, want to do that. (However, in practice, we often see a case for practicing worldview diversification.)

Prior to 2019, we often used a “100x” bar, based on the units above and the very roughly 100x ratio of $50K to GiveDirectly recipient income (net of transaction costs). We thought “that such giving was quite cost-effective and likely extremely scalable and persistently available, so we should not generally make grants that we expected to achieve less benefit per dollar than that.” As of 2019, we switched to tentatively thinking of “roughly 1,000x” as our bar for new programs, because that was roughly our estimate of the unfunded margin of the GiveWell top charities, and we thought we would be able to find enough other opportunities at that cost-effectiveness to hit our overall spending targets. We wrote: “Overall, given that GiveWell’s numbers imply something more like “1,000x” than “100x” for their current unfunded opportunities, that those numbers seem plausible (though by no means ironclad), and that they may find yet-more-cost-effective opportunities in the future, it looks like the relevant “bar to beat” going forward may be more like 1,000x than 100x.” However, that bar was rough, and we never made it very precise in large part because we don’t expect to be able to make back-of-the-envelope calculations that could reliably distinguish between, say, 800x and 1,000x. 3. New moral weights We now think our previous approach to valuing health placed too little value on lifesaving interventions relative to income interventions. Our new approach values a DALY averted twice as highly, equal to a 2-unit (rather than 1-unit) increase in the natural log of any individual’s income. (This is equivalent to increasing 200 people’s incomes by 1% – i.e., in our favored units, equal to$100K in units of marginal dollars to individuals making $50K.) This updated value is more consistent with empirical estimates of beneficiary preferences, the subjective wellbeing literature, and the practice of other actors in the field. These are all judgment calls subject to uncertainty, and we could readily imagine revising our weights again in the future based on further argument. 3.1 Beneficiary preferences There is a large body of research on how people tend to trade off mortality risks against income gains, in particular the Value of a Statistical Life (VSL) literature. Some analyses in this literature use the stated preferences of individuals – i.e. responses to survey questions like, “would you rather reduce your risk of dying this year by 1%, or increase your income this year by$500?” Others use the revealed preferences of individuals: are the study subjects, on average, willing to take a job with a 1% higher rate of annual mortality, in exchange for $500 higher annual income?8 The research findings are clearest in high-income countries, where they tend to find that respondents value a year of life expectancy 2.5 to 4 times more than annual income.9 (Since these valuations are mostly based on marginal tradeoffs, and since we model utility as a logarithmic function of income, we can interpret these findings to say that “respondents value a year of life expectancy as much as the utility gained from increasing income by 2.5 to 4 natural-log units.”) In low-income countries, the evidence is sparser and the findings vary widely.10 For example, in this chart we plot all the estimates found by the literature search in Robinson et al. 2019’s meta-analysis – they searched for all VSL analyses, whether stated or revealed, in any country that had been classified as low- or middle-income in the last 20 years.11 (We divide the VSLs by adult12 life expectancy to get the VSLY – value of a statistical life-year. We express this in terms of local income, which tells us how many units of log-income are worth as much to the respondents as an extra year of life expectancy. We plot stated-preference studies as circles, and revealed-preference studies as diamonds.) As you can see, the results vary extremely widely. One paper (Mahmud 2009) finds that subjects in rural Bangladesh traded off mortality risk against income in a way that suggests they valued a year of life expectancy at, very roughly, just 62% of annual income. Another paper (Qin et al. 2013) finds that subjects in China (who reported only$840 in annual per-capita income) valued a year of life expectancy at roughly 7x annual income.

In the face of this huge variation, some sources13 recommend estimating LMIC preferences via an “elasticity” approach: statistically estimate a function mapping VSL to income, anchoring it off the better-validated VSL figures in high-income countries. This elasticity approach is reviewed in Appendix A. Mainstream versions of it predict that individuals at the global poverty line would trade off 1 year of life expectancy for anywhere between 0.5x-4x of a year’s income14 (which we can interpret as 0.5 to 4 units of log-income).

Our beneficiaries (especially for lifesaving interventions) are mostly in low and lower middle income countries, and for now we’re focusing on this context in setting our moral weights. (As we’ll discuss in Appendix A, there are empirical and theoretical reasons to think that the exchange rate at which people trade off mortality risks against income gains differs systematically across income levels, with richer people valuing mortality more relative to income.)

3.5 Measurement of DALYs

We’re also switching our approach to be more consistent with GiveWell’s framework in how we translate deaths into DALYs. GiveWell assigns moral weights to deaths at various ages, rather than to DALYs. But we can use their moral weights to derive a mapping of deaths to DALYs, by dividing GiveWell’s moral weight for each death by GiveWell’s moral weight for a year lived with disability (which is defined by WHO so as to be equivalent to a DALY).27

The resulting model cares about child mortality more than adult mortality, but not by as much as remaining-population-life-expectancy would suggest. For example, GiveWell places 60% more weight on a child malaria death than on an adult death, and we can fairly straightforwardly interpret their process as counting an average of 32 DALYs per adult malaria death,28 so the GiveWell-based DALY model would implicitly count 32*160% = 51 DALYs for an under-5 malaria death. In contrast, a direct remaining-population-life-expectancy approach in Kenya would count 68 DALYs for an under-5 malaria death29, and the Global Burden of Disease approach (explained above) would count more than 84 DALYs.

GiveWell has written about the process for reaching their current moral weights here.

We see a normative case for the Global Burden of Disease’s uniform global approach to DALY attribution, but given our commitment to maximizing (expected) counterfactual impact, we think the national life table approach represents a plausible upper bound on attributable DALYs, and even those seem aggressive as an estimate of counterfactual lifespan for children whose lives are saved on the margin (who are presumably less advantaged and less healthy than the national average). Overall, we’re not sure where precisely this consideration should leave us, but it seems to argue for lower numbers.

We also haven’t reached any settled thoughts on the impact of population ethics or the second-order consequences of saving a life (e.g., on economic or population growth) on how to translate between deaths and DALYs.

For now, in order to be more consistent in our practices, we’re going to defer to GiveWell and start to use the number of DALYs that would be implied by extrapolating their moral weights. (In practice, we already defer to GiveWell for their own recommendations, so this would mainly change how we use GBD figures in our BOTECs for other grants, especially in science. In the section immediately below comparing our new values to old values, we assume for simplicity that we were using the national life table approach, which splits the difference between the GiveWell approach and the GBD in describing our status quo practices.) This means fewer DALYs averted per child death averted, which offsets some of the apparent gains from doubling our value on health. We expect to revisit this and try to form a more confident independent view about the balance of all these considerations in the future.

4. Expanding our spending, and modestly lowering our cost-effectiveness bar

We think there are four major buckets of updates that affect our “bar” going forward:

• The change to our weight on health, described just above.
• Other secular changes to GiveWell’s expected cost-effectiveness.
• Cross-cutting changes to our estimate of future available assets.
• Updates to our estimate of the likely “last dollar” cost-effectiveness of our non-GiveWell spending.

We walk through more detail below, but overall these factors leave us with a bar of very roughly 1,000x going forward for now.

4.1 Increased value on health

Overall, our new moral weights put more emphasis on health than we did before, which in some sense increases the amount of value at stake according to our moral framework, and should raise our cost-effectiveness bar, at least expressed in terms of marginal dollars to someone making $50K. This table shows how we would rate the impact of various interventions, according to our new moral weights. (You can see the calculations in Google Sheets here.) Here’s how to read the first column: • GiveWell estimates that HKI’s vitamin A supplementation program in Kenya averts 14 child deaths per$100k granted, and also increases incomes by 500 units of log-income.30
• Per GiveWell’s moral weights, this is 7 times more cost-effective than cash transfers to the global poor.
• Open Philanthropy’s old moral weights would have rated this intervention as 747 times more cost-effective than giving a dollar to someone with $50K of income. • Under our new moral weights, we would rate this intervention as 977 times more cost-effective than giving a dollar to someone making$50K.

Note that the update in cost-effectiveness depends on the mix of beneficial outcomes an intervention generates. A charity that just increases income (as in the second column of this table) will have the same cost-effectiveness under our new or old moral framework. An intervention that simply averted DALYs (say, averting 1 DALY for every $100 spent) would be twice as cost-effective under our new moral weights. Because of the change in how we measure DALYs, an intervention that averts child deaths at a fixed rate (say, averting 1 child death for every$5000 spent) would only be roughly 1.5x more cost-effective under our new moral weights than under the old framework.31 The program in column 1 gets some of its moral impact from income interventions and some from averting child deaths, so its cost-effectiveness changes by a weighted average of 1x and 1.5x.

The third and fourth columns show the mix of outcomes that could be achieved by a typical dollar to GiveWell top charities – that is, roughly 50% of its impact from income effects and 50% from mortality effects.32 Our new weights rate the cost-effectiveness of this mix of outcomes ~25% higher than our old weights did.33 (To foreshadow a bit, this means that if we lower the bar by ~20%, simultaneously with changing our moral weights, then our nominal bar will not change.)

For the past few years GiveWell’s rough margin has been interventions that they rate as 10x more cost-effective than cash transfers to the global poor – the third column shows how a dollar could be spent at this margin. This was our prior bar, and you can see that in our old units it was ~1000x. If the only updates were to our valuation on DALYs (and our framework for translating deaths into DALYs), our bar would go to roughly 1300x (our rough old bar expressed in new units).

However, this change to our valuations is not the only change here; below we address others which, taken together, lower our expected cost-effectiveness of our last dollar (assuming the GiveWell mix of outcomes) by roughly 20%. This means our nominal bar is staying roughly constant.

4.2 Changes to GiveWell’s expected cost-effectiveness

Our expectation of future funding for GiveWell top charities, including both our support and that from others, has grown much faster than we would have expected in 2019. We didn’t have a precise model of expected future funding to GiveWell at that point, but very roughly we think it’s reasonable to model expected future funding for GiveWell’s recommendations as having doubled relative to our 2019 expectations. We currently model the GiveWell opportunity set as isoelastic with eta=.375,34 which implies that a doubling of expected funding should reduce marginal cost-effectiveness (and the bar) by 23% (1 – 2^-.375 ≈ 23%).

On the other hand, GiveWell has found slightly more cost-effective opportunities over the last year than we would have expected them to. This year, they’re expecting roughly $400M of spending capacity at least ~8x as cost-effective as GiveDirectly according to their modeling. This is roughly as much as we would have expected if they had already explored the space of “8x” opportunities as thoroughly as they’ve explored the opportunities that are ~10x as cost-effective as GiveDirectly.35 Given that they have only recently focused on exploring this space of slightly-less-cost-effective interventions, this is a very promising amount of spending capacity, and suggests the potential for even more capacity in the near future. This should marginally raise our bar. We’ve also done more sophisticated modeling work on how we expect the cost-effectiveness of direct global health aid and asset returns to interact over time, and how we should optimally spread spending across time to maximize expected impact while hitting Cari and Dustin’s (our main funders) stated goal of spending down within their lifetimes. We’re still hoping to share that analysis and code, but the top level conclusion is that for opportunities like the GiveWell top charities and with a ~50 year spenddown target, it’s optimal to spend something like 9% of assets per year. That would imply a significantly faster pace of spending for the assets we expect to recommend to the GiveWell top charities than we’ve reached in the past, which would in turn imply a lowering of the bar. Two other interesting implications of the model for GiveWell spending are that: (a) we should be trying to get to our optimal spending and then spending down with a decreasing dollar amount each year (which may be a more a flaw/simplification of the model than accurate conclusion); and (b) we should expect our bar to fall by roughly 4% per year in nominal terms (largely reflecting asset returns – this helps equalize our “real”36 bar over time). Overall, we currently expect GiveWell’s marginal cost-effectiveness to end up around 7-8x GiveDirectly (in their units), which, assuming their current distribution across health and income benefits, translates to ~900-1,100x in our new units,37 though our understanding is that GiveWell does not necessarily endorse this extrapolation. Assuming that we continue to support the GiveWell recommendations and have a correctly-implemented uniform bar, that implies a similar bar across all our other work, though it could turn out to be too low if we’re able to find many more cost-effective opportunities in other work. One complication for extrapolating from the GiveWell bar to our other work is that GiveWell is much more thorough in their cost-effectiveness calculations than we typically are in our back-of-the-envelope calculations, which might mean that the results aren’t really comparable. We linked to some examples of our back-of-the-envelope calculations from our 2019 post, and they compare very unfavorably to the thoroughness of GiveWell’s cost-effectiveness analyses. That said, GiveWell also counts some second-order benefits (e.g., the expected income benefits of health interventions) that we typically don’t, so it isn’t totally obvious which direction this adjustment would end up pointing on net. (It’s also not clear how we would want to make the appropriate adjustment even in principle since there’s some division of labor going on where GiveWell has more conservative/skeptical epistemics, but we intentionally don’t consistently apply those epistemics across our work.) Overall, we think we should probably use a somewhat higher bar for our other BOTECs rather than just applying the same bar from GiveWell, but we’re not currently making a mechanical adjustment for this and don’t have a good sense of how big it should be. 4.3 Increases to our estimate of future available assets Our available funding has increased significantly as a result of stock market moves over the last few years. (This is not independent of the assumption above of GiveWell’s available resources doubling, since that assumes a substantial increase in our giving to their top charities.) We’ve also become more optimistic about future funders contributing to highly-effective opportunities of the sort we may recommend, which would also lower our current bar on the margin. Some of this is driven by the emergence of other billionaires self-identifying with effective altruism, but it also reflects GiveWell’s increased funding from other donors, increasingly concentrated wealth at the top end of the global distribution, and the cryptocurrency boom. (Yes, that is weird, we know.) By the same logic as above, increasing expected resources should lower the bar, but we don’t have as good of a model for how cost-effectiveness scales with resources in other Global Health and Wellbeing causes as we do for GiveWell recommendations, especially not for causes that we haven’t identified yet. If the only thing that changed were GiveWell autonomously lowering its bar and accordingly having less cost-effective marginal recommendations, we should in principle marginally reallocate away from GiveWell and to other opportunities. But GiveWell isn’t independently lowering its bar; our overall plans and assessment of our bar contribute to the update. And given the composition of GiveWell’s top charities, made up of scalable, commodities-driven global health interventions, we expect them to have a lower eta (i.e., decline less in cost-effectiveness with more funding) than opportunities like R&D or advocacy that are more people-intensive (where we have a prior that returns tend to be more like logarithmic, which is more steeply declining than our model of the GiveWell top charities). That should mean that as resources rise, a larger portion of the total should flow to GiveWell. And that is reflected in our most recent plans: we previously wrote that we expected something like 10% of Open Phil assets/spending to go to “straightforward charity” exemplified by the GiveWell top charities, but now anticipate likely giving a modestly higher proportion (which, combined with asset increases, will mean substantially increasing our support for them in dollar terms). 4.4 Updates to our estimate of the likely “last dollar” cost-effectiveness of our non-GiveWell spending Above, we argued that the GiveWell bar should go down in terms of our old weights and stay roughly nominally flat (at ~1,000x) in our new units. While the first-order implication is that the bar should uniformly decline across all of our Global Health and Wellbeing work, there are some complicating considerations. Our new higher valuation on health and the GiveWell bar going down makes non-GiveWell global health R&D and advocacy opportunities look more promising than before, and we really don’t know how many of these opportunities we could find. If, hypothetically, we could find billions of dollars a year in global health R&D opportunities with marginal cost-effectiveness above the new GiveWell bar, then that should be the bar instead (and, by implication, we shouldn’t fund the GiveWell recommendations going forward). That said, it seems unlikely a priori that marginal cost-effectiveness for billions of dollars of global health R&D or advocacy spending would end up right in between the old and new GiveWell bars (which fall by one third for child health interventions and 50% for adult health interventions38), so the most likely implications of this are that either (less likely) our bar has always been too low and we should have always been doing this hypothetical global health R&D or advocacy instead of supporting GiveWell top charities, or (more likely) we will find many good opportunities better than the marginal GiveWell dollar but not enough that they independently drive our marginal dollar. Our new valuation on health (relative to income) is also a little higher than GiveWell’s, which in principle means that there could be things above our bar but below theirs, though in practice the valuations are close enough that we don’t think this is likely to be a big deal. More concretely, we’ve been continuing to explore new areas that we think might be more leveraged than the GiveWell top charities, including recently making hires (announcements coming soon!) to lead new programs in global aid policy and South Asian air quality. The basic thesis on these new causes is to try to “multiply sources of leverage for philanthropic impact (e.g., advocacy, scientific research, helping the global poor) to get more humanitarian impact per dollar (for instance via advocacy around scientific research funding or policies, or scientific research around global health interventions, or policy around global health and development).” With our first hires in these new causes starting next year, it’s too soon for us to have a major update on the marginal cost-effectiveness of spending far out the curve from where we are now, but a few modest updates: • The a priori case for the existence of large scale leveraged interventions more cost-effective than the GiveWell margin continues to seem compelling to us. The Bill and Melinda Gates Foundation spends well over half a billion dollars a year on global health research and development39 and we find it plausible (though far from obvious, and haven’t been able to get great data on the view) that the marginal dollar there is better than the GiveWell margin. • But we haven’t found anything obviously more cost-effective than the GiveWell margin and scalable to billions of dollars a year. We’re far from being done looking, and have only covered a small part of the conceptual space, but we’re also not prepared to bet that we will succeed at that scale in the future. • For a more pessimistic prior, consider that at GiveWell’s bar of 8x GiveDirectly, the cost per outcome-as-good-as-averting-an-under-5 death is about$4000-$4500, and the cost per actual under-5 life saved (ignoring other benefits) for a charity focused on saving kids’ lives, is about$6000-$7000.40 There are about 5 million under-5 deaths per year, which implies that they would all be eliminated for$20-35B/year if GiveWell’s cost-effectiveness level could be maintained to an arbitrary scale. Total development assistance for health is more than that. If there were more than 3B/year of room-for-more-funding an order of magnitude more cost-effective than GiveWell’s margin, it would need to be as effective as eliminating all child mortality. This argument isn’t decisive by any means but can help give a sense of how hard it would be to beat the GiveWell margin by a lot at massive scale: there just aren’t that many orders of magnitude to work with. • We still don’t think all of our existing grantmaking necessarily hits this bar, and are continuing to try to move flexible portfolios towards higher expected-return areas while learning more about our returns and considering reducing our spending in others. Our current best guess is that we won’t be able to spend as much as we need to within the Global Health and Wellbeing frame at a significantly higher marginal cost-effectiveness rate (though average might be higher) than the GiveWell top charities, so the marginal cost-effectiveness of the GiveWell recommendations continues to be a relevant metric for our overall bar. However, we still expect most of our medium-term growth in GHW to be in new causes that can take advantage of the leveraged returns to research and advocacy, and could imagine that we’ll eventually find enough room for more funding in those interventions that we will need to raise the bar again. We haven’t done a thorough analysis of the costs of over- vs under-shooting “the bar” for all of our causes, but one important takeaway from our analysis of that for the GiveWell setting, which may or may not properly extrapolate, is that saving and collecting investment returns isn’t “free” (or positive) from an impact perspective. That is because, very roughly, the world is getting better (and accordingly opportunities to improve the world are getting worse) over time, and saved funding doesn’t have that long to compound and has to be spent later at a higher rate given the “spend down within our primary donors’ lifetimes” constraint (which in turn likely means it will be spent at a lower cost-effectiveness further out the annual spending curve). We also think we will be in a much better place to raise more funding from other donors if we’re spending down our existing resources, and the expected benefits of that, along with the ex ante possibility of raising a lot more money, makes it theoretically ambiguous whether the costs of under- or over-estimating the “true” bar are higher. Accordingly we’re just going with our very rough and round current best guess for the bar for now, rather than doing a full expected-value calculation, and we will revisit in the future as we learn more. 4.5 Bottom line We are now treating our bar as “roughly 1,000x” (with our new weights on health) for the GiveWell top charities and in our new cause selection and grantmaking, though we retain considerable uncertainty and expect to continue to revisit that over the coming years. For the typical mix of GiveWell interventions, this bar is about 20% lower given our new moral weights.41 We think it’s important to note that the bar is very rough – we aren’t very confident that it, or the BOTECs we consider against it, are even within a factor of 2 of correct – and we will continue to put considerable weight on factors not included in our rough back-of-the-envelope calculations in making major decisions. Due to this analysis and the lower forward-looking bar, we’re planning to give more to the GiveWell top charities this year and going forward – more on that next week. 5. Appendix A The available direct service interventions in health, like the ones GiveWell recommends, are far more cost-effective in low-income countries than high-income countries, so the discussion above focuses on what value we should place on lifesaving interventions in low-income countries. If we were focused instead on saving lives in the developed world, likely via advocacy of some sort, we might trade off differently between lifesaving vs income-enhancing interventions – we are uncertain over a range of rich-world DALY valuations between 2-6 units of log-income. There are theoretical and empirical reasons to think that the exchange rate at which people trade off mortality risks against income gains differs systematically across income levels, with richer people valuing mortality more relative to income. 5.1 VSL elasticity to income The main empirical evidence is from the VSL literature described above. As discussed, economists often attempt to statistically estimate a function mapping VSL to income, anchoring it off the better-validated VSL figures in high-income countries. This literature generally finds that individuals’ willingness to pay for life expectancy increases with income, which is unsurprising – a dollar matters a lot more to a rich person, so if everyone valued a DALY equally then VSLYs would increase linearly with income. Most reviews also find that this willingness to pay increases at a faster pace than income. For example, Robinson et al. 2019 review the mainstream literature, which finds that the elasticity of VSL to income is between 1.0-1.2 across LICs (and a bit below 1 across the developed world), though Robinson’s own analysis suggests an elasticity of 1.5 for extrapolating to LMICs.42 An elasticity of 1.2 would mean that if two individuals’ income differs by 10%, then on average the dollar value they place on a year of life expectancy will differ by 12%. Lisa Robinson chaired a commission, sponsored by the Gates Foundation, that recommended the following ensemble approach: • VSL is anchored to US at 160x income, with an income elasticity of 1.5, and a lower bound of 20x income. • VSL is anchored to US at 160x income, with an income elasticity of 1.0 • VSL is anchored to OECD at 100x income, with an income elasticity of 1.0 These yield the following VSLY estimates for someone at an income 1/100th that of the US43: • 0.5x income, via lower bound44 • 4x income45 • 2.5x income46 Here again is the chart from our discussion of LIC VSLYs above. As noted above, in this chart we plot all the estimates found by the literature search in Robinson et al. 2019’s meta-analysis – they searched for all VSL analyses, whether stated or revealed, in any country that had been classified as low- or middle-income in the last 20 years.47 I’ve added a few lines to show various elasticities you could use to predict LMIC VSLYs based on US VSLYs. An elasticity of 1 would say that willingness to pay for a year of life expectancy varies 1:1 with income – combining that with the US VSLY of 4 years of income, we’d predict that individuals in any country are willing, on average, to trade 4 units of log-income for 1 year of life expectancy. If instead we combined an elasticity of 1.1 with the US VSLY, we’d predict that individuals at the global poverty line would be willing to trade roughly 2.5 years of income for a year of life expectancy – this is roughly in line with the VSLYs from IDInsight’s surveys of communities demographically similar to GiveWell beneficiaries. It seems to me that any elasticity between, say, 0.9 and 1.3 is potentially compatible with this data. Analysts are often interested in the relationship between VSL and national-level income statistics like per-capita GDP, even though we often have measures of the respondents’ incomes, because it is often more practical to apply heuristics to national-level income statistics, especially when deriving VSL estimates to inform national policies. When we analyze VSLY measured in multiples of GNI per capita, rather than respondents’ income, we see a stronger relationship between income and VSLY – this data could be compatible with elasticities between, say, 1.0 and 1.5. This seems to be because the respondents in LMIC VSL studies report lower average incomes than their respective national average – presumably because the social scientists performing these studies are interested in somewhat poorer populations.48 We suspect that comparing LMIC VSL estimates to national-level income averages, rather than to respondents’ (lower) incomes, biases analyses like Robinson et al. toward finding higher elasticities of VSL to income. 5.2 Theoretical arguments for DALY valuations to vary by income One theoretical argument would go like this: If you think we can improve individuals’ lives by improving their incomes, and you also think the moral impact of saving a life varies somewhat with the quality of that life (i.e. it’s better to extend a happy life than a miserable life), then it follows that it is more valuable in theory to extend the life of a typical high-income country resident than that of a typical person at the global poverty line.49 Many people (including us) find this a deeply concerning line of reasoning – and critically, this theoretical dynamic is in reality swamped by the fact that it is dramatically less expensive to extend the life of someone at the global poverty line, which is why the overwhelming majority of our GHW portfolio is focused on extending lives and increasing incomes in low and lower middle income countries. 5.3 Bringing these considerations together We’ve been unsettled about how to aggregate these lines of argument, but ended up concluding that we didn’t need to reach a resolution on this because the expected cost of mistakes if we were wrong (i.e., if we assumed 2x and the true answer were 6x, or vice versa) were low.50 For now, we haven’t decided specifically how to weigh lives-vs-income tradeoffs in high-income countries, and when we face decisions that might depend on the specifics, will test a range of values between 2 and 6 units of log-income. How Feasible Is Long-range Forecasting? How accurate do long-range (≥10yr) forecasts tend to be, and how much should we rely on them? As an initial exploration of this question, I sought to study the track record of long-range forecasting exercises from the past. Unfortunately, my key finding so far is that it is difficult to learn much of value from those exercises, for the following reasons: 1. Long-range forecasts are often stated too imprecisely to be judged for accuracy. [More] 2. Even if a forecast is stated precisely, it might be difficult to find the information needed to check the forecast for accuracy. [More] 3. Degrees of confidence for long-range forecasts are rarely quantified. [More] 4. In most cases, no comparison to a “baseline method” or “null model” is possible, which makes it difficult to assess how easy or difficult the original forecasts were. [More] 5. Incentives for forecaster accuracy are usually unclear or weak. [More] 6. Very few studies have been designed so as to allow confident inference about which factors contributed to forecasting accuracy. [More] 7. It’s difficult to know how comparable past forecasting exercises are to the forecasting we do for grantmaking purposes, e.g. because the forecasts we make are of a different type, and because the forecasting training and methods we use are different. [More] We plan to continue to make long-range quantified forecasts about our work so that, in the long run, we might learn something about the feasibility of long-range forecasting, at least for our own case. [More] 1. Challenges to learning from historical long-range forecasting exercises Most arguments I’ve seen about the feasibility of long-range forecasting are purely anecdotal. If arguing that long-range forecasting is feasible, the author lists a few example historical forecasts that look prescient in hindsight. But if arguing that long-range forecasting is difficult or impossible, the author lists a few examples of historical forecasts that failed badly. How can we do better? The ideal way to study the feasibility of long-range forecasting would be to conduct a series of well-designed prospective experiments testing a variety of forecasting methods on a large number of long-range forecasts of various kinds. However, doing so would require us to wait ≥10 years to get the results of each study and learn from them. To learn something about the feasibility of long-range forecasting more quickly, I decided to try to assess the track record of long-range forecasts from the past. First, I searched for systematic retrospective accuracy evaluations for large collections of long-range forecasts. I identified a few such studies, but found that they all suffered from many of the limitations discussed below.[1]E.g. Kott & Perconti (2018); Fye et al. (2013); Albright (2002), which I previously discussed here; Parente & Anderson-Parente (2011). I also collected past examples of long-range forecasting exercises I might evaluate for accuracy myself, but quickly determined that doing so would require more effort than the results would likely be worth. Finally, I reached out to the researchers responsible for a large-scale retrospective analysis with particularly transparent methodology,[2]This was Fye et al. (2013). See Mullins (2012) for an extended description of the data collection and analysis process, and attached spreadsheets of all included sources and forecasts and how they were evaluated in the study. and commissioned them to produce a follow-up study focused on long-range forecasts. Its results were also difficult to learn from, again for some of the reasons discussed below (among others).[3]The commissioned follow-up study is Mullins (2018). A few notes on this study: The study was pre-registered at OSF Registries here. Relative to the pre-registration, Mullins (2018) extracted forecasts from a slightly different set of source documents, because one of the planned source documents … Continue reading 1.1 Imprecisely stated forecasts If a forecast is phrased in a vague or ambiguous way, it can be difficult or impossible to subsequently judge its accuracy.[4]For further discussion of this point, see e.g. Tetlock & Gardner (2015), ch. 3. This can be a problem even for very short-range forecasts, but the challenge is often greater for long-range forecasts, since they often aim to make a prediction about circumstances, technologies, or measures that … Continue reading For example, consider the following forecasts:[5]The forecasts in this section are taken from the forecasts spreadsheet attached to Mullins (2018). In some cases they are slight paraphrases of the forecasting statements from the source documents. • From 1975: “By 2000, the tracking and data relay satellite system (TDRSS) will acquire and relay data at gigabit rates.” • From 1980: “The world’s population will increase 55 percent, from 4.1 billion people in 1975 to 6.35 billion in 2000.” • From 1977: “The average fuel efficiency of automobiles in the US will be 27 to 29 miles per gallon in 2000.” • From 1972: “The CO2 concentration will reach 380 ppm by the year 2000.” • From 1987: “In Germany, in the year 1990, 52.0% of women aged 15 – 64 will be registered as employed.” • From 1967: “The installed power in the European Economic Community will grow by a factor of a hundred from a programmed 3,700 megawatts in 1970 to 370,000 megawatts in 2000.” Broadly speaking, these forecasts were stated with sufficient precision to now judge them as correct or incorrect. In contrast, consider the low precision of these forecasts: • From 1964: “Operation of a central data storage facility with wide access for general or specialized information retrieval will be in use between 1971 and 1991.” What counts as “a central data storage facility”? What counts as “general or specialized information retrieval”? Perhaps most critically, what counts as “wide access”? Given the steady growth of (what we now call) the internet from the late 1960s onward, this forecast might be considered true for different decades depending on whether we interpret “wide access” to refer to access by thousands, or millions, or billions of people. • From 1964: “In 2000, general immunization against bacterial and viral diseases will be available.” What is meant by “general immunization?” Did the authors mean a universal vaccine? Did they mean widely-delivered vaccines protecting against several important and common pathogens? Did they mean a single vaccine that protects against several pathogens? • From 1964: “In 2000, automation will have advanced further, from many menial robot services to sophisticated, high-IQ machines.” What counts as a “menial robot service,” and how many count as “many”? How widely do those services need to be used? What is a high-IQ machine? Would a machine that can perform well on IQ tests but nothing else count? Would a machine that can outperform humans on some classic “high-IQ” tasks (e.g. chess-playing) count? • From 1964: “Reliable weather forecasts will be in use between 1972 and 1988.” What accuracy score counts as “reliable”? • From 1983: “Between 1983 and 2000, large corporate farms that are developed and managed by absentee owners will not account for a significant number of farms.” What counts as a “large” corporate farm? What counts as a “significant number”? In some cases, even an imprecisely phrased forecast can be judged as uncontroversially true or false, if all reasonable interpretations are true (or false). But in many cases, it’s impossible to determine whether a forecast should be judged as true or false. Unfortunately, it can often require substantial skill and effort to transform an imprecise expectation into a precisely stated forecast, especially for long-range forecasts.[6]Technically, it should be possible to transform almost any imprecise forecast into a precise forecast using a “human judge” approach, but this can often be prohibitively expensive. In a “human judge” approach, one would write down an imprecise forecast, perhaps along with … Continue reading In such cases, one can choose to invest substantial effort into improving the precision of one’s forecasting statement, perhaps with help from someone who has developed substantial expertise in methods for addressing this difficulty (e.g. the “Questions team” at Good Judgment Inc.). Or, one can make the forecast despite its imprecision, to indicate something about one’s expectations, while understanding that it may be impossible to later judge as true or false. Regardless, the frequent imprecision of historical long-range forecasts makes it difficult to assess them for accuracy. 1.2 Practically uncheckable forecasts Even if a forecast is stated precisely, it might be difficult to check for accuracy if the information needed to judge the forecast is non-public, difficult to find, untrustworthy, or not available at all. This can be an especially common problem for long-range forecasts, for example because variables that are reliably measured (e.g. by a government agency) when the forecast is made might no longer be reliably measured at the time of the forecast’s “due date.” For example, in the study we recently commissioned,[7]See the forecasts spreadsheet attached to Mullins (2018). the following forecasts were stated with relatively high precision, but it was nevertheless difficult to find reliable sources of “ground truth” information that could be used to judge the exact claim of the original forecast: • From 1967: “By the year 2000, the US will include approximately 232 million people age 14 and older.” The commissioned study found two “ground truth” sources for judging this forecast, but some guesswork was still required because the two sources disagreed with each other substantially, and one source had info on the population of those 15 and older but not of those 14 and older. • From 1980: “In 2000, 400 cities will have passed the million population mark.” In this case there is some ambiguity about what counts as a city, but even if we set that aside, the commissioned study found two “ground truth” sources for judging this forecast, but some guesswork was still required because those sources included figures for some years (implying particular average trends that could be extrapolated) but not for 2000 exactly. 1.3 Non-quantified degrees of confidence In most forecasting exercises I’ve seen, forecasters provide little or no indication of how confident they are in each of their forecasts, which makes it difficult to assess their overall accuracy in a meaningful way. For example, if 50% of a forecaster’s predictions are correct, we would assess their accuracy very differently if they made those forecasts with 90% confidence vs. 50% confidence. If degrees of confidence are not quantified, there is no way to compare the forecaster’s subjective likelihoods to the objective frequencies of events.[8]One recent proposal is to infer forecasters’ probabilities from their imprecise forecasting language, as in Lehner et al. (2012). I would like to see this method validated more extensively before I rely on it. Unfortunately, in the long-range forecasting exercises I’ve seen, degrees of confidence are often not mentioned at all. If they are mentioned, forecasters typically use imprecise language such as “possibly” or “likely,” terms which can be used to refer to hugely varying degrees of confidence.[9]E.g. see figure 18 in chapter 12 of Heuer (1999); a replication of that study by Reddit.com user zonination here; Wheaton (2008); Mosteller & Youtz (1990); Mauboussin & Mauboussin (2018) (original results here); table 1 of Mandel (2015). I haven’t vetted these studies. Such imprecision can sometimes lead to poor decisions,[10]Tetlock & Gardner (2015), ch. 3, gives the following (possible) example: In 1961, when the CIA was planning to topple the Castro government by landing a small army of Cuban expatriates at the Bay of Pigs, President John F. Kennedy turned to the military for an unbiased assessment. The Joint … Continue reading and means that such forecasts cannot be assessed using calibration and resolution measures of accuracy. 1.4 No comparison to a baseline method or null model is feasible One way to make a large number of correct forecasts is to make only easy forecasts, e.g. “in 10 years, world population will be larger than 5 billion.” One can also use this strategy to appear impressively well-calibrated, e.g. by making forecasts like “With 50% confidence, when I flip this fair coin it will come up heads.” And because forecasts can vary greatly in difficulty, it can be misleading to compare the accuracy of forecasters who made forecasts about different phenomena.[11]One recent proposal for dealing with this problem is to use Item Response Theory, as described in Bo et al. (2017): The conventional estimates of a forecaster’s expertise (e.g., his or her mean Brier score, based on all events forecast) are content dependent, so people may be assigned higher or … Continue reading For example, forecasters making predictions about data-rich domains (e.g. sports or weather) might have better Brier scores than forecasters making predictions about data-poor domains (e.g. novel social movements or rare disasters), but that doesn’t mean that the sports and weather forecasters are better or “more impressive” forecasters — it may just be that they have limited themselves to easier-to-forecast phenomena. To assess the ex ante difficulty of some set of forecasts, one could compare the accuracy of a forecasting exercises’ effortfully produced forecasts against the accuracy of forecasts about the same statements produced by some naive “baseline” method, e.g. a simple poll of broadly educated people (conducted at the time of the original forecasting exercise), or a simple linear extrapolation of the previous trend (if time series data are available for the phenomenon in question). Unfortunately, such naive baseline comparisons are often unavailable. Even if no comparison to the accuracy of a naive baseline method is available, one can sometimes compare the accuracy of a set of forecasts to the accuracy predicted by a “null model” of “random” forecasts. For example, for the forecasting tournaments described in Tetlock (2005), all forecasting questions came with answer options that were mutually exclusive and mutually exhaustive, e.g. “Will [some person] still be President on [some date]?” or “Will [some state’s] borders remain the same, expand, or contract by [some date]?”[12]See the Methodological Appendix of Tetlock (2005). Because of this, Tetlock knew the odds that a “dart-throwing chimp” (i.e. a random forecast) would get each question right (50% chance for the first question, 1/3 chance for the second question). Then, he could compare the accuracy of expert forecasters to the accuracy of a random-forecast “null model.” Unfortunately, the forecasting questions of the long-range forecasting exercises I’ve seen are rarely set up to allow for the construction of a null model to compare against the (effortful) forecasts produced by the forecasting exercise.[13]This includes the null models used in Fye et al. (2013) and Mullins (2018), which I don’t find convincing. 1.5 Unclear or weak incentives for accuracy For most long-range forecasting exercises I’ve seen, it’s either unclear how much incentive there was for forecasters to strive for accuracy, or the incentives for accuracy seem clearly weak. For example, in many long-range forecasting exercises, there seems to have been no concrete plan to check the accuracy of the study’s forecasts at a particular time in the future — and in fact, the forecasts from even the most high-profile long-range forecasting studies I’ve seen were never checked for accuracy (as far as I can tell), at least not by anyone associated with the original study or funded by the same funder or funder(s). Without a concrete plan to check the accuracy of the forecasts, how strong could the incentive for forecaster accuracy be? Furthermore, long-range forecasting exercises are rarely structured as forecasting tournaments, with multiple individuals, groups, or methods competing to make the most accurate forecasts about the same forecasting questions (or heavily overlapping sets of forecasting questions). As such, there’s no way to compare the accuracy of one individual or group or method against another, and again it’s unclear whether the forecasters had much incentive to strive for accuracy. Also, some studies that were set up to eventually check the accuracy of the forecasts made didn’t use a scoring rule that reliably incentivized reporting one’s true probabilities, i.e. a proper scoring rule. 1.6 Weak strategy for causal identification Even if a study passes the many hurdles outlined above, and there are clearly demonstrated accuracy differences between different forecasting methods, it can still be difficult to learn about which factors contributed to those accuracy differences if the study was not structured as a randomized controlled trial, and no other strong causal identification strategy was available.[14]On the tricky challenge of robust causal identification from observational data, see e.g. Athey & Imbens (2017) and Hernán & Robins (forthcoming). 1.7 Unclear relevance to our own long-range forecasting I haven’t yet found a study that (1) evaluates the accuracy of a large collection of somewhat-varied[15]By “somewhat varied,” I mean to exclude studies that are e.g. limited to forecasting variables for which substantial time series data is available, or variables in a very narrow domain such as a handful of macroeconomic indicators or a handful of environmental indicators. long-range (≥10yr) forecasts and that (2) avoids the limitations above. If you know of such a study, please let me know. Tetlock’s “Expert Political Judgment” project (EPJ; Tetlock 2005) and his “Good Judgment Project” (GJP; Tetlock & Gardner 2015) might come closest to satisfying those criteria, and that is a major reason we have prioritized learning what we can from Tetlock’s forecasting work specifically (e.g. see here) and have supported his ongoing research. Tetlock’s work hasn’t focused on long-range forecasting specifically, but because Tetlock’s work largely (but not entirely) avoids the other limitations above, I will briefly explore what I think we can and can’t learn from his work about the feasibility of long-range forecasting, and use it to explore the more general question of how studies of long-range forecasting can be of unclear relevance to our own forecasting even when they largely avoid the other limitations discussed above. 1.7.1 Tetlock, long-range forecasting, and questions of relevance Most GJP forecasts had time horizons of 1-6 months,[16]See figure 3 of this December 2015 draft of a paper eventually published (without that figure) as Friedman et al. (2018). and thus can tell us little about the feasibility of long-range (≥10yr) forecasting.[17]Despite this, I think we can learn a little from GJP about the feasibility of long-range forecasting. Good Judgment Project’s Year 4 annual report to IARPA (unpublished), titled “Exploring the Optimal Forecasting Frontier,” examines forecasting accuracy as a function of … Continue reading In Tetlock’s EPJ studies, however, forecasters were asked a variety of questions with forecasting horizons of 1-25 years. (Forecasting horizons of 1, 3, 5, 10, or 25 years were most common.) Unfortunately, by the time of Tetlock (2005), only a few 10-year forecasts (and no 25-year forecasts) had come due, so Tetlock (2005) only reports accuracy results for forecasts with forecasting horizons he describes as “short-term” (1-2 years) and “long-term” (usually 3-5 years, plus a few longer-term forecasts that had come due).[18]Forecasting horizons are described under “Types of Forecasting Questions” in the Methodological Appendix of Tetlock (2005). The definitions of “short-term” and “long-term” were provided via personal communication with Tetlock, as was the fact that only a few … Continue reading The differing accuracy scores for short-term vs. long-term forecasts in EPJ are sometimes used to support a claim that the accuracy of expert predictions declines toward chance five years out.[19]E.g. Tetlock himself says “there is no evidence that geopolitical or economic forecasters can predict anything ten years out beyond the excruciatingly obvious — ‘there will be conflicts’ — and the odd lucky hits that are inevitable whenever lots of forecasters make lots of … Continue reading While it’s true that accuracy declined “toward” chance five years out, the accuracy differences reported in Tetlock (2005) are not as large as I had assumed upon initially hearing this claim (see footnote for details[20]Tetlock (2005) reports both calibration scores and discrimination (aka resolution) scores, explaining that: “A calibration score of .01 indicates that forecasters’ subjective probabilities diverged from objective frequencies, on average, by about 10 percent; a score of .04, an average gap … Continue reading). Fortunately, we might soon be in a position to learn more about long-range forecasting from the EPJ data, since most EPJ forecasts (including most 25-year forecasts) will have resolved by 2022.[21]Personal communication with Phil Tetlock. And according to the Acknowledgements section at the back of Tetlock (2005), all EPJ forecasts will come due by 2026. Perhaps more importantly, how analogous are the forecasting questions from EPJ to the forecasting questions we face as a grantmaker, and how similar was the situation of the EPJ forecasters to the situation we find ourselves in? For context, some (paraphrased) representative example “long-term” forecasting questions from EPJ include:[22]Here is an abbreviated summary of EPJ’s forecasting questions, drawing and quoting from Tetlock (2005)’s Methodological Appendix: Each expert was asked to make short-term and long-term predictions about “each of four nations (two inside and two outside their domains of expertise) … Continue reading • Two elections from now, will the current majority in the legislature of [some stable democracy] lose its majority, retain its majority, or strengthen its majority? • In the next five years, will GDP growth rates in [some nation] accelerate, decelerate, or remain about the same? • Over the next ten years, will defense spending as a percentage of [some nation’s] expenditures rise, fall, or stay about the same? • In the next [ten/twenty-five] years, will [some state] deploy a nuclear or biological weapon of mass destruction (according to the CIA Factbook)? A few observations come to mind as I consider analogies and disanalogies between EPJ’s “long-term” forecasting and the long-range forecasting we do as a grantmaker:[23]Some of these observations overlap with the other limitations listed above. • For most of our history, we’ve had the luxury of knowing the results from EPJ and GJP and being able to apply them to our forecasting, which of course wasn’t true for the EPJ forecasters. For example, many of our staff know that it’s often best to start one’s forecast from an available base rate, and that many things probably can’t be predicted with better accuracy than chance (e.g. which party will be in the majority two elections from now). Many of our staff have also done multiple hours of explicit calibration training, and my sense is that very few (if any) EPJ forecasters are likely to have done calibration training prior to making their forecasts. Several of our staff have also participated in a Good Judgment Inc. forecasting training workshop. • EPJ forecasting questions were chosen very carefully, such that they (a) were stated precisely enough to be uncontroversially judged for accuracy, (b) came with prepared answer options that were mutually exclusive and collectively exhaustive (or continuous), (c) were amenable to base rate forecasting (though base rates were not provided to the forecasters), and satisfied other criteria necessary for rigorous study design.[24]On the other criteria, see the Methodological Appendix of Tetlock (2005). In contrast, most of our forecasting questions (1) are stated imprecisely (because the factors that matter most to the grant decision are ~impossible or prohibitively costly to state precisely), (2) are formulated very quickly by the forecaster (i.e. the grant investigator) as they fill out our internal grant write-up template, and thus don’t come with pre-existing answer options, and (3) rarely have clear base rate data to learn from. Overall, this might suggest we should (ignoring other factors) expect lower accuracy than was observed in EPJ, e.g. because we formulate questions and make forecasts about them so quickly. It also means that we are less able to learn from the forecasters we make, because many of them are stated too imprecisely to judge for accuracy. • I’m unsure whether EPJ questions asked about phenomena that are “intrinsically” easier or harder to predict than the phenomena we try to predict. E.g. party control in established democracies changes regularly and is thus very difficult to predict even one or two elections in advance, whereas some of our grantmaking is premised substantially on the continuation of stable long-run trends. On the other hand, many of our forecasts are (as mentioned above) about phenomena which lack clearly relevant base rate data to extrapolate, or (in some cases) about events that haven’t ever occurred before. • How motivated were EPJ forecasters to strive for accuracy? Presumably the rigorous setup and concrete plan to measure forecast accuracy provided substantial incentives for accuracy, though on the other hand, the EPJ forecasters knew their answers and accuracy scores would be anonymous. Meanwhile, explicit forecasting is a relatively minor component of Open Phil staffers’ work, and our less rigorous setup means that incentives for accuracy may be weak, but also our (personally identified) forecasts are visible to many other staff. Similar analogies and disanalogies also arise when comparing our forecasting situation to that of the forecasters who participated in other studies of long-range forecasting. This should not be used an excuse to avoid drawing lessons from studies when we should, but it does mean that it may be tricky to assess what we should learn about our own situation from even very well-designed studies of long-range forecasting. 2. Our current attitude toward long-range forecasting Despite our inability to learn much (thus far) about the feasibility of long-range forecasting, and therefore also about best practices for long-range forecasting, we plan to continue to make long-range quantified forecasts about our work so that, in the long run, we might learn something about the feasibility of long-range forecasting, at least for our own case. We plan to say more in the future about what we’ve learned about forecasting in our own grantmaking context, especially after a larger number of our internal forecasts have come due and then been judged for accuracy. Footnotes Footnotes 1 E.g. Kott & Perconti (2018); Fye et al. (2013); Albright (2002), which I previously discussed here; Parente & Anderson-Parente (2011). 2 This was Fye et al. (2013). See Mullins (2012) for an extended description of the data collection and analysis process, and attached spreadsheets of all included sources and forecasts and how they were evaluated in the study. 3 The commissioned follow-up study is Mullins (2018). A few notes on this study: • The study was pre-registered at OSF Registries here. Relative to the pre-registration, Mullins (2018) extracted forecasts from a slightly different set of source documents, because one of the planned source documents didn’t fit the study’s criteria upon examination, and we needed to identify additional source documents to ensure we could reach our target of ≥400 validated long-range forecasts. • Three spreadsheets are attached to the PDF of Mullins (2018): one with details on all source documents, one with details on all evaluated forecasts, and one with details on the “ground truth evidence” used to assess the accuracy of each forecast. • I chose the source documents based on how well they seemed (upon a quick skim) to meet as many of the following criteria as possible (the first two criteria were necessary, the others were ideal but not required): • One of the authors’ major goals was to say something about which events/scenarios were more vs. less likely, as opposed to merely aiming to e.g. “paint possible futures.” • The authors made forecasts of events/scenarios ≥10yrs away, that were expected to be somewhat different from present reality. (E.g. not “vacuum cleaners will continue to exist.”) • The authors expressed varying degrees of confidence for many of their forecasts, quantitatively or at least with terms such as “likely,” “unlikely,” “highly likely,” etc. • The authors made some attempt to think about which plans made sense given their forecasts. (I.e., important decisions were at stake, or potentially at stake.) • The authors’ language suggests they had some degree of self-awareness about the difficulty of long-range forecasting. • The authors seemed to have a decent grasp of not just the domain they were trying to forecast, but also of broadly applicable reasoning tools such as those from economics. • The authors made their forecasts after ~1965 (so they had access to a decent amount of “modern” science) but before 2007 (so that we’d have some ≥10yr forecasts evaluable for accuracy). • The authors seemed to put substantial effort into their forecasts, e.g. with substantial analysis, multiple lines of argument, thoughtful caveats, engagement with subject-matter experts, etc. • The authors were writing for a fairly serious audience with high expectations, e.g. an agency of a leading national government. Since Mullins (2018) is modeled after Fye et al. (2013), we knew in advance it would have several of the limitations described in this post, but we hoped to learn some things from it anyway, especially given the planned availability of the underlying raw data. Unfortunately, upon completion we discovered additional limitations of the study. For example, Mullins (2018) implicitly interprets all forecasts as “timing forecasts” of the form “event X will first occur in approximately year Y.” This has some advantages (e.g. allowing one to operationalize some notion of “approximately correct”), but it also leads to counterintuitive judgments in many cases: • In some cases, forecasts that seem to be of the form “X will be true in year Y” are interpreted for evaluation as “event X will first occur in approximately year Y.” For example, consider the following forecast made in 1975: “In 1985, deep-space communication stations on Earth will consist of two 64-meter antennas plus one 26-meter antenna at Goldstone, California; Madrid, Spain; and Canberra, Australia” (Record ID #2001). This forecast was judged incorrect, and with a temporal forecasting error of 13 years, on the grounds that the forecasted state of affairs was already true 13 years earlier (in 1972), rather than having come to be true in approximately 1985. • In other cases, forecasts that seem to be of the form “parameter P will have approximately value V in year Y” are interpreted for evaluation as “parameter P will first approximately hit value V in year Y.” For example, consider the following forecast made in 1978: “In Canada, in the year 1990, 55.2% of women aged 15 – 64 will be registered as employed” (Record ID #2748). The forecast was judged as incorrect because the true value in 1990 was 58.5%, and had reached 55% in 1985, just barely outside the “within 30%” rule for judging a forecast as a success. In this example, it seems more reasonable to say that the original forecast was nearly (but not quite) correct for 1990, rather than interpreting the original forecast as being primarily about the timing of when the female labor force participation rate would hit exactly 55.2%. (The forecast is correctly marked as “Mostly realized,” but the analytic setup doesn’t give much room this label to have much effect on the top-line quantitative results.) • Some forecasts aren’t interpretable as timing forecasts at all, and thus shouldn’t have been included when comparing the success rate of the evaluated forecasts against BryceTech’s “null model” (i.e. random forecast) success rate, which assumes forecasts are timing forecasts. Example forecasts that can’t be interpreted as timing forecasts include negative forecasts (e.g. Record ID #2336: “In the year 2000, fusion power will not be a significant source of energy”), no-change forecasts (e.g. Record ID #2364: “The world’s population in the year 2000 will be less than the seven billion”), and whole-period forecasts (e.g. Record ID #2370: “The continent of Africa will have a population growth rate of 2.7 per cent over the 1965-2000 period”). Many of these forecasts were assigned a temporal forecasting error of 0 despite not being interpretable as timing forecasts. There are other limits to the data and analysis in Mullins (2018), and we don’t think one should draw major substantive conclusions from it. It may, however, be a useful collection of long-range forecasts that could be judged and analyzed for accuracy using alternate methods. My thanks to Kathleen Finlinson and Bastian Stern for their help evaluating this report. 4 For further discussion of this point, see e.g. Tetlock & Gardner (2015), ch. 3. This can be a problem even for very short-range forecasts, but the challenge is often greater for long-range forecasts, since they often aim to make a prediction about circumstances, technologies, or measures that aren’t yet well-defined at the time the forecast is made. 5 The forecasts in this section are taken from the forecasts spreadsheet attached to Mullins (2018). In some cases they are slight paraphrases of the forecasting statements from the source documents. 6 Technically, it should be possible to transform almost any imprecise forecast into a precise forecast using a “human judge” approach, but this can often be prohibitively expensive. In a “human judge” approach, one would write down an imprecise forecast, perhaps along with some accompanying material about motivations and reasoning and examples of what would and wouldn’t satisfy the intention of the forecast, and then specify a human judge (or panel of judges) who will later decide whether one’s imprecise forecast should be judged true or false (or, each judge could give a Likert-scale rating of “how accurate” or “how clearly accurate” the forecast was). Then, one can make a precise forecast about the future judgment of the judge(s). The precise forecast, then, would be a forecast both about the phenomenon one wishes to forecast, and about the psychology and behavior of the judge(s). Of course, one’s precise forecast must also account for the possibility that one or more judges will be unwilling or unable to provide a judgment at the required time. An example of this “human judge” approach is the following forecast posted to the Metaculus forecasting platform: “Will radical new ‘low-energy nuclear reaction’ technologies prove effective before 2019?” In this case, the exact (but still somewhat imprecise) forecasting statement was: “By Dec. 31, 2018, will Andrea Rossi/Leonardo/Industrial Heat or Robert Godes/Brillouin Energy have produced fairly convincing evidence (> 50% credence) that their new technology […] generates substantial excess heat relative to electrical and chemical inputs?” Since there remains some ambiguity about e.g. what should count as “convincing evidence,” the question page also specifies that “The bet will be settled by [Huw] Price and [Carl] Shulman by New Years Eve 2018, and in the case of disagreement shall defer to majority vote of a panel of three physicists: Anthony Aguirre, Martin Rees, and Max Tegmark.” 7 See the forecasts spreadsheet attached to Mullins (2018). 8 One recent proposal is to infer forecasters’ probabilities from their imprecise forecasting language, as in Lehner et al. (2012). I would like to see this method validated more extensively before I rely on it. 9 E.g. see figure 18 in chapter 12 of Heuer (1999); a replication of that study by Reddit.com user zonination here; Wheaton (2008); Mosteller & Youtz (1990); Mauboussin & Mauboussin (2018) (original results here); table 1 of Mandel (2015). I haven’t vetted these studies. 10 Tetlock & Gardner (2015), ch. 3, gives the following (possible) example: In 1961, when the CIA was planning to topple the Castro government by landing a small army of Cuban expatriates at the Bay of Pigs, President John F. Kennedy turned to the military for an unbiased assessment. The Joint Chiefs of Staff concluded that the plan had a “fair chance” of success. The man who wrote the words “fair chance” later said he had in mind odds of 3 to 1 against success. But Kennedy was never told precisely what “fair chance” meant and, not unreasonably, he took it to be a much more positive assessment. Of course we can’t be sure that if the Chiefs had said “We feel it’s 3 to 1 the invasion will fail” that Kennedy would have called it off, but it surely would have made him think harder about authorizing what turned out to be an unmitigated disaster. 11 One recent proposal for dealing with this problem is to use Item Response Theory, as described in Bo et al. (2017): The conventional estimates of a forecaster’s expertise (e.g., his or her mean Brier score, based on all events forecast) are content dependent, so people may be assigned higher or lower “expertise” scores as a function of the events they choose to forecast. This is a serious shortcoming because (a) typically judges do not forecast all the events and (b) their choices of which events to forecast are not random. In fact, one can safely assume that they select questions strategically: Judges are more likely to make forecasts about events in domains where they believe (or are expected to) have expertise or events they perceive to be “easy” and highly predictable, so their Brier scores are likely to be affected by this self-selection that, typically, leads to overestimation of one’s expertise. Thus, all comparisons among people who forecast distinct sets of events are of questionable quality. A remedy to this problem is to compare directly the forecasting expertise based only on the forecasts to the common subset of events forecast by all. But this approach can also run into problems. As the number of forecasters increases, comparisons may be based on smaller subsets of events answered by all and become less reliable and informative. As an example, consider financial analysts who make predictions regarding future earnings of companies that are traded on the market. They tend to specialize in various areas, so it is practically impossible to compare the expertise of an analyst that focuses on the automobile industry and another that specialize in the telecommunication area, since there is no overlap between their two areas. Any difference between their Brier scores could be a reflection of how predictable one industry is, compared to the other, and not necessarily of the analysts’ expertise and forecasting ability. An IRT model can solve this problem. Assuming forecasters are sampled from a population with some distribution of expertise, a key property of IRT models is invariance of parameters (Hambleton & Jones, 1993): (1) parameters that characterize an individual forecaster are independent of the particular events from which they are estimated; (2) parameters that characterize an event are independent of the distribution of the abilities of the individuals who forecast them (Hambleton, Swaminathan & Rogers, 1991). In other words, the estimated expertise parameters allow meaningful comparisons of all the judges from the same population as long as the events require the same latent expertise (i.e., a unidimensional assumption). …we describe an IRT framework in which one can incorporate any proper scoring rule into the model, and we show how to use weights based on event features in the proper scoring rules. This leads to a model-based method for evaluating forecasters via proper scoring rules, allowing us to account for additional factors that the regular proper scoring rules rarely consider. I have not evaluated this approach in detail and would like to see it critiqued and validated by other experts. On this general challenge, see also the discussion of “Difficulty-adjusted probability scores” in the Technical Appendix of Tetlock (2005). 12 See the Methodological Appendix of Tetlock (2005). 13 This includes the null models used in Fye et al. (2013) and Mullins (2018), which I don’t find convincing. 14 On the tricky challenge of robust causal identification from observational data, see e.g. Athey & Imbens (2017) and Hernán & Robins (forthcoming). 15 By “somewhat varied,” I mean to exclude studies that are e.g. limited to forecasting variables for which substantial time series data is available, or variables in a very narrow domain such as a handful of macroeconomic indicators or a handful of environmental indicators. 16 See figure 3 of this December 2015 draft of a paper eventually published (without that figure) as Friedman et al. (2018). 17 Despite this, I think we can learn a little from GJP about the feasibility of long-range forecasting. Good Judgment Project’s Year 4 annual report to IARPA (unpublished), titled “Exploring the Optimal Forecasting Frontier,” examines forecasting accuracy as a function of forecasting horizon in this figure (reproduced with permission): This chart uses an accuracy statistic known as AUC/ROC (see Steyvers et al. 2014) to represent the accuracy of binary, non-conditional forecasts, at different time horizons, throughout years 2-4 of GJP. Roughly speaking, this chart addresses the question: “At different forecasting horizons, how often (on average) were forecasters on ‘the right side of maybe’ (i.e. above 50% confidence in the binary option that turned out to be correct), where 0.5 represents ‘no better than chance’ and 1 represents ‘always on the right side of maybe’?” For our purposes here, the key results shown above are, roughly speaking, that (1) regular forecasters did approximately no better than chance on this metric at ~375 days before each question closed, (2) superforecasters did substantially better than chance on this metric at ~375 days before each question closed, (3) both regular forecasters and superforecasters were almost always “on the right side of maybe” immediately before each question closed, and (4) superforecasters were roughly as accurate on this metric at ~125 days before each question closed as they were at ~375 days before each question closed. If GJP had involved questions with substantially longer time horizons, how quickly would superforecaster accuracy declined with longer time horizons? We can’t know, but an extrapolation of the results above is at least compatible with an answer of “fairly slowly.” Of course there remain other questions about how analogous the GJP questions are to the types of questions that we and other actors attempt to make long-range forecasts about. 18 Forecasting horizons are described under “Types of Forecasting Questions” in the Methodological Appendix of Tetlock (2005). The definitions of “short-term” and “long-term” were provided via personal communication with Tetlock, as was the fact that only a few 10-year forecasts could be included in the analysis of Tetlock (2005). 19 E.g. Tetlock himself says “there is no evidence that geopolitical or economic forecasters can predict anything ten years out beyond the excruciatingly obvious — ‘there will be conflicts’ — and the odd lucky hits that are inevitable whenever lots of forecasters make lots of forecasts. These limits on predictability are the predictable results of the butterfly dynamics of nonlinear systems. In my EPJ research, the accuracy of expert predictions declined toward chance five years out” (Tetlock & Gardner 2015, p. 243). 20 Tetlock (2005) reports both calibration scores and discrimination (aka resolution) scores, explaining that: “A calibration score of .01 indicates that forecasters’ subjective probabilities diverged from objective frequencies, on average, by about 10 percent; a score of .04, an average gap of 20 percent. A discrimination score of .01 indicates that forecasters, on average, predicted about 6 percent of the total variation in outcomes; a score of .04, that they captured 24 percent” (Tetlock 2005, ch. 2). See the book’s Technical Appendix for details on how Tetlock’s calibration and discrimination scores are computed. Given this scoring system, Tetlock’s results on the accuracy of short-term vs. long-term forecasts are: Sample of forecasts Calibration score Discrimination score Expert short-term forecasts .023 .027 Expert long-term forecasts .026 .021 Non-expert short-term forecasts .024 .023 Non-expert long-term forecasts .020 .021 The data above are from figure 2.4 of Tetlock (2005). I’ve renamed “dilettantes” to “non-experts.” See also this spreadsheet, which contains additional short-term vs. long-term accuracy comparisons in data points estimated from figure 3.2 of Tetlock (2005) using WebPlotDigitizer. See ch. 3 and the Technical Appendix of Tetlock (2005) for details on how to interpret these data points. Also note that there is a typo in the caption for figure 3.2; I confirmed with Tetlock that the phrase which reads “long-term (1, 2, 5, 7…)” should instead be “long-term (1, 3, 5, 7…).” 21 Personal communication with Phil Tetlock. And according to the Acknowledgements section at the back of Tetlock (2005), all EPJ forecasts will come due by 2026. 22 Here is an abbreviated summary of EPJ’s forecasting questions, drawing and quoting from Tetlock (2005)’s Methodological Appendix: • Each expert was asked to make short-term and long-term predictions about “each of four nations (two inside and two outside their domains of expertise) on seventeen outcome variables (on average), each of which was typically broken down into three possible futures and thus required three separate probability estimates.” (Experts didn’t respond to all questions, though.) • Most forecasting questions asked about the possible futures of ~60 nations, clustered into nine regions: the Soviet bloc, the Europian Union, North America, Central and Latin America, the Arab world, sub-Saharan Africa, China, Northeast Asia, and Southeast Asia. • Most forecasting questions fell into one of four content categories: • Continuity of domestic political leadership: “For established democracies, should we expect after either the next election (short-term) or the next two elections (longer-term) the party that currently has the most representatives in the legislative branch(es) of government will retain this status, will lose this status, or will strengthen its position (separate judgments for bicameral systems)? For democracies with presidential elections, should we expect that after the next election or next two elections, the current incumbent/party will lose control, will retain control with reduced popular support, or will retain control with greater popular support? …For states with shakier track records of competitive elections, should we expect that, in either the next five or ten years, the individuals and (separate judgment) political parties/movements currently in charge will lose control, will retain control but weather major challenges to their authority (e.g., coup attempts, major rebellions), or will retain control without major challenges? Also, for less stable polities, should we expect the basic character of the political regime to change in the next five or ten years and, if so, will it change in the direction of increased or reduced economic freedom, increased or reduced political freedom, and increased or reduced corruption? Should we expect over the next five or ten years that interethnic and other sectarian violence will increase, decrease, or remain about the same? Finally, should we expect state boundaries — over the next ten or twenty-five years — to remain the same, expand, or contract and — if boundaries do change — will it be the result of peaceful or violent secession by a subnational entity asserting independence or the result of peaceful or violent annexation by another nation-state?” • Domestic policy and economic performance: “With respect to policy, should we expect — over the next two or five years — increases, decreases, or essentially no changes in marginal tax rates, central bank interest rates, central government expenditures as percentage of GDP, annual central government operating deficit as percentage of GDP, and the size of state-owned sectors of the economy as percentage of GDP? Should we expect — again over the next two or five years — shifts in government priorities such as percentage of GDP devoted to education or to health care? With respect to economic performance, should we expect — again over the next two or five years — growth rates in GDP to accelerate, decelerate, or remain about the same? What should our expectations be for inflation and unemployment over the next two or five years? Should we expect — over the next five or ten years — entry into or exit from membership in free-trade agreements or monetary unions?” • National security and defense policy: “Should we expect — over the next five or ten years — defense spending as a percentage of central government expenditure to rise, fall, or stay about the same? Should we expect policy changes over the next five to ten years with respect to military conscription, with respect to using military force (or supporting insurgencies) against states, with respect to participation in international peacekeeping operations (contributing personnel), with respect to entering or leaving alliances or perpetuation of status quo, and with respect to nuclear weapons (acquiring such weapons, continuing to try to obtain such weapons, abandoning programs to obtain such weapons or the weapons themselves)?” • Special-purpose exercises: In these eight exercises, experts made forecasts about: (1) “the likelihood of twenty-five states acquiring capacity to produce weapons of mass destruction, nuclear or biological, in the next five, ten, or twenty-five years as well as the possibility of states — or subnational terrorist groups — using such weapons”; (2) “whether there would be a war [in the Persian Gulf] (and, if so, how long it would last, how many Allied casualties there would be, whether Saddam Hussein would remain in power, and, if not, whether all or part of Kuwait would remain under Iraqi control)”; (3) the likelihood — over the next three, six, or twelve years — of “both economic reform (rate of divesting state-owned enterprises; degree to which fiscal and monetary policy fit templates of “shock therapy”) and subsequent economic performance (unemployment, inflation, GDP growth)”; (4) the likelihood of “human-caused or -facilitated disasters in the next five, ten, or twenty-five years, including refugee flows, poverty, mass starvation, massacres, and epidemics (HIV prevalence) linked to inadequate public health measures”; (5) adoption of the Euro and “prospects of former Soviet bloc countries, plus Turkey, in meeting [Europian Union] entry requirements”; (6) who will win the American presidential elections of 1992 and 2000 and by how much; (7) “the overall performance of the NASDAQ (Is it a bubble? If so, when will it pop?) as well as the revenues, earnings, and share prices of selected ‘New Economy’ firms, including Microsoft, CISCO, Oracle, IBM, HP, Dell, Compaq, Worldcom, Enron, AOL Time Warner, Amazon, and e-Bay”; (8) “CO2 emissions per capita (stemming from burning fossil fuels and manufacturing cement) of twenty-five states over the next twenty-five years, and on the prospects of states actually ratifying an international agreement (Kyoto Protocol) to regulate such emissions.” 23 Some of these observations overlap with the other limitations listed above. 24 On the other criteria, see the Methodological Appendix of Tetlock (2005). Questions We Ask Ourselves Before Making a Grant Although we have typically emphasized the importance for effective philanthropy of long-term commitment to causes and getting the right people in place, the most obvious day-to-day decision funders face is whether to support specific potential giving opportunities. As part of our internal guidance for program officers, we’ve collected a series of questions that we like to ask ourselves about potential funding opportunities, including: This post, which I adapted from the internal guidance for program officers, reviews the value we get from these questions and some of our approaches to answering them. This is a list of questions we’ve provided to staff to help them think about how to structure their internal grant writeups, and that grant reviewers tend to ask ourselves as we review grants. We don’t have these questions in our standard template or ask that they be itemized for each grant (there are many individual cases where particular questions are inapplicable or unimportant). Evaluating a grant’s place in the ecosystem As we try to determine whether a grant is likely to have a positive impact, we ask ourselves key questions including: What does this grant do for the overall ecosystem of organizations working on the cause in question? What key need is it addressing and how? What other ways might there be to fill/maintain/expand this need, and does this grant seem like the best way? If not, is it compatible with pursuing other approaches to the same need simultaneously? Examples of ecosystem “needs” might be: intellectually solid analysis of what policies should be (What should sentencing laws be? How can the government improve existing programs to strengthen pandemic preparedness?); “insider” advocacy building support for the right policies (e.g., making the case to legislators, businesses, and other influential leaders); “grassroots” advocacy to build support for the right policies, or building the power of aligned constituencies. That said, we recognize there are stark limits to our own knowledge of “what the field needs,” especially when we’re supporting organizations that primarily serve other organizations in the field (e.g. by providing training or support). In these cases, we try to think about how to create dynamics where a service-providing grantee will be accountable to the organizations that consume its services (e.g. the people who attend the trainings). One specific way we do this is by asking people in our fields whether they’d rather have grants for their own organizations or for the “public good” service provider we’re considering providing. We often ask, “who is the right conceptual ‘buyer’ for these services and do they actually think it would be worth paying this much for?” One thing we generally try to avoid in assessing an ecosystem is creating an elaborate “theory of change” that requires our grantees to work together in highly specific ways as more than the sum of their parts. We tend to think that unless we’re effectively the only funder in a field, our grants are each relatively marginal and can be fairly accurately assessed on their own terms. Sometimes we see opportunities to help grantees coordinate around a particular strategy, but that is not our default approach. And, like other funders, we’re open to convening grantees (and non-grantees) to try to help develop shared strategies or goals; we’re just typically more skeptical of the idea that we need to fund all parts of a strategy or ecosystem for the whole to be effective. Evaluating a grantee’s leadership When deciding whether to make a grant and how to structure it, we ask ourselves: Who is being empowered by this grant? Are they an effective leader who we’re happy to bet on? Are they underfunded? What are their strengths and weaknesses? Are we empowering them to do what they already want to do, or are we asking them to do something different than they might have chosen on their own? If the latter, why is that a good idea, and have we considered deferring to their judgment instead? Since our grantees tend to know their work much better than we do, we believe we’re usually better off finding people who share our key goals and letting them work out the details. Many of our favorite grants are in the model of “Find a person who is fantastic at doing something the field needs, and give them no-strings-attached support to help them do more, faster.” Some of the funders we think are most impressive focus much more on supporting the right people than on the details of those people’s plans. (For example, see our post on the Sandler Foundation.) Because we are often trying to support the work of particular people housed within larger organizations, we might place restrictions on funding to make it conditional to those people’s involvement in, and control of, the work. However, once the right people are empowered, we try to be skeptical of our own impulses to narrowly direct how the funding is used. If we find ourselves asking a person to spend more time and energy on a specific project we prefer, we ask ourselves, “Are we sure we’re right, or is it possible that they know what the field needs better than we do, and we should be funding them to do their first-choice work?” We are sometimes wary of potential grants where a well-funded organization offers to do a project that’s particularly appealing to us. In many cases, we believe they’re proposing a project they would do with or without our support, and our funding is effectively paying for other things they want to do. We often refer to this concept as “fungibility.” We try to avoid these cases by asking ourselves, “Why can’t this person do the project they’re pitching with money they already have?” If the answer is that they don’t have much money, it’s possible we should be giving unrestricted support. If the answer is that they don’t actually value the project as much as other things they could do, we ask whether there’s a good reason they value it less, and whether we can expect them to do good work on the project if we fund it. Evaluating comparative advantage As we assess whether a specific grantee is the right partner, we believe it’s helpful to ask ourselves whether it is the best organization to do this work, and if its work is the most valuable use of this organization’s additional resources. When we’re considering funding a research organization that wants to do public advocacy around its research, we consider looking for an advocacy organization that could potentially do the same work more effectively. When we’re considering funding an organization that will provide services to other organizations, we tend to ask if those “client” organizations are adequately funded, since they are often better positioned than we are to decide whether new services are worth paying for. The question of comparative advantage applies to us as well. We often ask ourselves: Are we the best-placed organization to evaluate and fund this? Who else could fund it, and why aren’t they? If we find ourselves considering a lot of similar grants, especially small ones, we tend to look for opportunities make a bigger grant and have someone else take the time to strategically regrant. For instance, rather than five small grants to five individuals doing similar activities in five different states, we may be better off finding (or creating) an organization that is well-placed to evaluate those activities, and giving them one big grant. This thinking informed our decision to support The Humane League’s Open Wing Alliance, a new coalition of promising farm animal welfare groups that are trained in corporate campaigning so they can achieve cage-free wins in new countries. For us, determining whether other funders are better positioned to support a project isn’t just about saving money. It’s helpful for checking the basic case for the grant, identifying considerations we might have missed, and continuing to refine our theory of our own comparative advantage as a funder. If we can think of someone who seems like they logically ought to be willing to fund a potential grant, we try to ask them what they think. Sometimes it will turn out that they know more about this type of grant than we do, and they will raise new considerations. Sometimes it will turn out that the grant is not actually a fit for what they do, and we’ll learn more about them and improve our model of them as a funder. Accurate models of other funders help us assess our own comparative advantage, and potentially help the other funders by allowing us to refer potential grantees who seem like a good fit for their goals. Evaluating a grant’s size and duration When we review grantees’ budgets and determine how much funding to provide, we ask ourselves: What are we paying for, above and beyond what’s already funded by others? Could we get the same work done with a lot less money? Are we sure this grant shouldn’t be bigger? What is our “willingness to pay” for success here? While sometimes people will ask for much more funding than we think appropriate, we are also often concerned about the times when people ask for less than they should. After conversations with many funders and many nonprofits, some of whom are our grantees and some of whom are not, our best model is that many grantees are constantly trying to guess what they can get funded, won’t ask for as much money as they should ask for, and, in some cases, will not even consider what they would do with some large amount because they haven’t seriously considered the possibility that they might be able to raise it. We’ve had multiple experiences where a grantee asks for X, and we say “What would you do with 2X?” They usually say, “Never thought about it — let me get back to you,” and we often end up with a much better grant in the end. This was the case in our grant to support research at the University of Washington. Through ongoing conversations, the original grant proposal focusing on the development of a universal flu vaccine evolved into an expanded grant incorporating work on a computational protein design system that we believe could have much broader utility if it makes it possible to rapidly design new vaccines or antiviral drugs. We sometimes ask grantees what activities would occur at several different funding levels and consider these scenarios and tradeoffs as part of the decision-making process. When we are considering higher salaries or more ambitious proposals, we need to ask ourselves (and our grantees) how many years we should commit to. Long-term commitments often help grantees plan, so if it seems like we’re funding something that’s going to take several years to play out, we generally aim to commit to more than one year up front. In the case of the new Center for Security and Emerging Technology, we think it will take some time to develop expertise on key questions relevant to policymakers and want to give CSET the commitment necessary to recruit key people, so we provided a five-year grant. That said, for small organizations, a few years can be a very long time in terms of learning new things, changing organizational direction and finding new funders, so a two- to three-year commitment will often (in our view) be sufficient to achieve the goal of helping with planning. Evaluating a grant’s cost-effectiveness When deciding whether to make a grant or hold the money in reserve to distribute elsewhere, we think about cost-effectiveness. Do we get good bang-for-the-buck? Grant investigators sometimes include a “back-of-the-envelope-calculation” (“BOTEC”) to roughly estimate the expected cost-effectiveness of the potential grant in their writeup. We recently shared an update, including some example BOTECs, on our thinking about how to evaluate giving aimed at helping people alive today, including a mix of direct aid, policy work, and scientific research. While we previously used unconditional cash transfers to people living in extreme poverty as “the bar” for being willing to make a grant, GiveWell has continued to find more and larger opportunities over time, which has the implication that we may raise “the bar” to something closer to the current estimated cost-effectiveness of GiveWell’s unfunded top charities. Longtermist” BOTECs focus more on how much a given grant might reduce the chances of a global catastrophic risk. Noting and considering reservations We generally try to ask (and our investigation encourages asking): If we had a smart, thoughtful friend who thought this grant was going to end up having no impact, what would they tell us and how would we respond? Grant investigators include for decision-makers’ consideration the devil’s advocate argument against a grant’s potential impact, and explain why they don’t believe that argument to be decisive. We hope that grant investigators explicitly considering and vocalizing these objections helps reduce the chances that they follow a faulty line of reasoning too far, or that a crucial objection to a grant is never brought up. Evaluating predictions about how a grant will play out We hope to learn from our past grantmaking. Looking back on a grant, how might we determine whether it has gone well or badly? By asking grant investigators to consider this question before recommending a grant, and by encouraging them to make quantified and objectively evaluable predictions about how a grant will play out over time, we try to make our future evaluation of a grant’s performance easier. While we no longer publish these predictions in our public grant writeups, we continue to track them internally and use them in renewal decisions to update our expectations in light of facts on the ground. (More on this practice here.) GiveWell’s Top Charities Are (Increasingly) Hard to Beat Our thinking on prioritizing across different causes has evolved as we’ve made more grants. This post explores one aspect of that: the high bar set by the best global health and development interventions, and what we’re learning about the relative performance of some of our other grantmaking areas that seek to help people today. To summarize: • When we were getting started, we used unconditional cash transfers to people living in extreme poverty (a program run by GiveDirectly) as “the bar” for being willing to make a grant, on the grounds that such giving was quite cost-effective and likely extremely scalable and persistently available, so we should not generally make grants that we expected to achieve less benefit per dollar than that. Based on the roughly 100-to-1 ratio of average consumption between the average American and GiveDirectly cash transfer recipients, and a logarithmic model of the utility of money, we call this the “100x bar.” So if we are giving to, e.g., encourage policies that increase incomes for average Americans, we need to increase them by100 for every $1 we spend to get as much benefit as just giving that$1 directly to GiveDirectly recipients. More.
• GiveWell (which we used to be part of and remain closely affiliated with) has continued to find more and larger opportunities over time, and become more optimistic about finding more cost-effective ones in the future. This has the implication that we should raise “the bar” to something closer to the current estimated cost-effectiveness of GiveWell’s unfunded top charities, which they believe to be in the range of 5-15x more cost-effective than 100x cash transfers, meaning a bar of benefits 500-1,500 times the cost of our grants (which we approximate to a “1,000x bar”). More.
• Since adopting cash transfers as the relevant benchmark for our giving aimed at helping people alive today, we’ve given ~$100M in U.S. policy, ~$100M in scientific research, and ~$300M based on GiveWell recommendations. According to our extremely rough internal calculations, we do expect many of our grants in scientific research and U.S. policy to exceed the “100x bar” represented by unconditional cash transfers, but relatively few to clear a “1,000x bar” roughly corresponding to high-end estimates for GiveWell’s unfunded top charities. This would imply that it’s quite difficult/rare for work in these categories to look more cost-effective than GiveWell’s top charities. However, these calculations are extraordinarily rough and uncertain. More. • In spite of these calculations, we think there are some good arguments to consider in favor of our current grantmaking in these areas. More. • We continue to think it is likely that there are causes aimed at helping people today (potentially including our current ones) that could be more cost-effective than GiveWell’s top charities, and we are hiring researchers to work on finding and evaluating them. More. • We are still thinking through the balance of these considerations. We are not planning any rapid changes in direction. More. Cash transfers to people in extreme poverty In 2015, when we were still part of GiveWell, we wrote: By default, we feel that any given grant of$X should look significantly better than making direct cash transfers (totaling $X) to people who are extremely low-income by global standards – abbreviated as “direct cash transfers.” We believe it will be possible to give away very large amounts, at any point in the next couple of decades, via direct cash transfers, so any grant that doesn’t meet this bar seems unlikely to be worth making…. It’s possible that this standard is too lax, since we might find plenty of giving opportunities in the future that are much stronger than direct cash transfers. However, at this early stage, it isn’t obvious how we will find several billion dollars’ worth of such opportunities, and so – as long as total giving remains within the … budget – we prefer to err on the side of recommending grants when we’ve completed an investigation and when they look substantially better than direct cash transfers. It is, of course, often extremely unclear how to compare the good accomplished by a given grant to the good accomplished by direct cash transfers. Sometimes we will be able to do a rough quantitative estimate to determine whether a given grant looks much better, much worse or within the margin of error. (In the case of our top charities, we think that donations to AMF, SCI and Deworm the World look substantially better.) Other times we may have little to go on for making the comparison other than intuition. Still, thinking about the comparison can be informative. For example, when considering grants that will primarily benefit people in the U.S. (such as supporting work on criminal justice reform), benchmarking to direct cash transfers can be a fairly high standard. Based on the idea that the value of additional money is roughly proportional to the logarithm of income,[1]See e.g. Subjective Well‐Being and Income: Is There Any Evidence of Satiation? (archive) For instance Deaton (2008) and Stevenson and Wolfers (2008) find that the well-being–income relationship is roughly a linear-log relationship, such that, while each additional dollar of income yields a … Continue reading and the fact that mean American income is around 100x annual consumption for GiveDirectly recipients, we assume that a given dollar is worth ~100x as much to a GiveDirectly recipient as to the average American. Thus, in considering grants that primarily benefit Americans, we look for a better than “100x return” in financial terms (e.g. increased income). Of course, there are always huge amounts of uncertainty in these comparisons, and we try not to take them too literally. To walk through the logic of how this generates a “100x” bar a bit more clearly: • We want to be able to compare philanthropic opportunities that will save the U.S. or state governments money, or increase incomes for average Americans, against opportunities to directly help the global poor (or deliver other benefits) in a somewhat consistent fashion. For instance, we could imagine a hypothetical domestic advocacy opportunity that might be able to save the government$100 million, or increase productivity by $100 million, for a cost of$1 million; we would call that opportunity roughly “100x” because the benefit in terms of income to the average American is $100 for every$1 we spend.[2]We’re eliding a huge amount of complexity here in terms of modeling the domestic welfare impacts of various policy changes, which we recognize. In practice, our calculations are often very crude, though we try to be roughly consistent in considering distributional issues and weighing whether … Continue reading If we just directly gave a random person in the U.S. $1,000, we’d expect to get “1x” because the benefit to them is equal to the cost to us (ignoring transaction costs). That is, we take our core unit of measurement for this exercise as “dollars to the average American.” Then we face the question: how should we compare transfers to the global poor (or other programs) to transfers to the average American? • GiveWell reports that the income of GiveDirectly recipients averages$0.79 per day[3]See footnote 33 in GiveWell’s writeup on GiveDirectly. — so approximately $290 per person per year, compared to more than$34,000 per capita per year in the U.S.[4]2017 average U.S., per capita income was $34,489, per the U.S. Census. (archive) This means$34,000 could double one individual’s income for a year in the U.S., or (after ~10% overhead is taken out) double the income of about 106 GiveDirectly recipients for a year.[5]$34,000 / ($288.35 / 0.9) = ~106. Using median U.S. income rather than mean would reduce this ~20% but seems less apt as a comparison since we’re partially modeling foregone spending and taxes are moderately progressive.
• In this context we assume a logarithmic utility function for income, which is a fairly common simplification and assumes that doubling a person’s income contributes the same amount to their well-being regardless of how much income they started with. We think this is a plausible starting point based on evidence from life satisfaction surveys. However, it is worth noting that there are credible arguments that a logarithmic utility function places either too much or too little weight on income at the high end.[7]Too much: there is some evidence of satiation (archive) in terms of self-reported wellbeing even in log terms as incomes get very high by global standards. Additionally, if you think very high incomes carry net negative externalities (e.g., through carbon emissions or excess political influence), … Continue reading

• A logarithmic utility function implies that $1 for someone with 100x less income/consumption is worth 100x as much. This implies direct cash transfers to the extreme global poor go about 100x as far as the same money spent in the U.S., on average, and means any potential grant should create an expected value at least 100x the cost of the grant if it is to be considered a better use of money than such direct cash transfers. • With other causes, in addition to looking at monetary savings or gains, we also use “value of a statistical life” techniques to try to account for health and quality-of-life benefits. That yields more cost-effectiveness estimates, all generally framed in the language of “This seems roughly as good as saving an average American$N for each $1 we spend” or simply “Nx.” Obviously, calculations like this remain deeply uncertain and vulnerable to large mistakes, so we try to not put too much weight on them in any one case. But the general reality that they reflect — of vast global inequalities, and the relative ease of moving money from people who have a lot of it to people who have little — seems quite robust. Although we stopped formally using this 100x benchmark across all of our giving a couple of years ago because of considerations relating to animals and future generations, we have continued to find it a useful benchmark against which “near-termist, human-centric” grants — those that aim to improve the lives of humans on a relatively short time horizon, including a mix of direct aid, policy work, and scientific research — can be measured. The best programs are even harder to beat In 2015, when we first wrote about adopting the cash transfer benchmark, it looked like GiveWell could plausibly “run out” of their more-cost-effective-than-cash giving opportunities. At the time, they had three non-cash-transfer top charities they estimated to be in the 5-10x cash range (i.e., 5 to 10 times more cost-effective than cash transfers),[8]This 5-10x cash range translated to roughly ~$2,000-4,000 per “life saved equivalent” in the 2015 cost-effectiveness calculation – XLSX. with ~$145 million of estimated short-term room for more funding. That, plus uncertainty about the amount of weight to put on these figures, led us to adopt the cash transfer benchmark. (In the remainder of this post, I occasionally shorten “cash transfer” to just “cash.”) But by the end of 2018, GiveWell had expanded to seven non-cash-transfer top charities estimated to be in the ~5-15x cash range, with$290 million of estimated short-term room for more funding, and with the top recommended unfilled gaps at ~8x cash transfers.[9]Based on the median results from GiveWell’s final 2018 cost-effectiveness calculation, 8x cash implies a “cost-per outcome as good as saving an under-5 life” of ~$1,500. This is not directly comparable to the figures from 2015 because GiveWell made some changes in the values and framework … Continue reading If we combine cash transfers at “100x” and large unfilled opportunities at ~5-15x cash transfers, the relevant “bar to beat” going forward may be more like 500-1,500x.[10]Another way to get similarly high overall ROI figures is from comparing GiveWell’s top charity “cost per life saved equivalent” figures to rich world “value of a statistical life” figures: GiveWell estimates that bednets and seasonal malaria chemoprevention and vitamin A supplementation … Continue reading And earlier this year GiveWell suggested that they expected to find more cost-effective opportunities in the future, and they are staffing up in order to do so. Another approach to this question is to ask, how much better than direct cash transfers should we expect the best underfunded interventions to be? I find scalable interventions worth ~5-15x cash a bit surprising, but not wildly so. It’s not obvious where to look for a prior on this point, and it seems to correlate strongly with general views about broad market efficiency: if you think broad “markets for doing good” are efficient, finding a scalable ~5-15x baseline intervention might be especially surprising; conversely if you think markets for doing good are riddled with inefficiencies, you might expect to find many even more cost-effective opportunities. One place to potentially look for priors on this point might be compilations of the cost-effectiveness of various evidence-based interventions. I know of five compilations of the cost-effectiveness of different interventions within a given domain that contain easily available tabulations of the interventions reviewed:[11]Since this post was first written, we came across Five-Hundred Life-Saving Interventions and Their Cost-Effectiveness (archive). For this purpose, I was just curious about the general distribution of the estimates, and didn’t attempt to verify any of them, and was very rough in discarding estimates that were negative or didn’t have numerical answers, which may bias my conclusions. In general, we regard the calculations included in these compilations as challenging and error-prone, and we would caution against over-reliance on them.[12]When we looked closely at one of the calculations in the the DCP2, we found serious errors. We haven’t looked closely at the other sources at all. Overall, we expect the project of trying to estimate the cost-effectiveness of many different interventions in uniform terms to be extremely difficult … Continue reading I made a sheet summarizing the sources’ estimates here. All five distributions appear to be (very roughly) log-normal, with standard deviations of ~0.7-1, implying that a one-standard-deviation increase in cost-effectiveness would equate to a 5-10x improvement. However, any errors in these calculations would typically inflate that figure, and we think they are structurally highly error-prone, so these standard deviations likely substantially overstate the true ones.[13]Some discussion of this in the comments of GiveWell’s 2011 post on errors in the DCP2. We don’t know what the mean of the true distribution of cost-effectiveness of global development opportunities might be, but assuming it’s not more than a few times different from cash transfers (in either direction), and that measurement error doesn’t make up more than half of the variance in the cost-effectiveness compilations reviewed above (a non-trivial assumption), then these figures imply we shouldn’t be too surprised to see top opportunities ~5-15x cash. A normal distribution would imply that an opportunity two standard deviations above the mean is in the ~98th percentile. These figures would support more skepticism towards an opportunity from the same rough distribution (evidence-based global health interventions) that is claimed to be even more cost-effective (e.g., 100x or 1,000x cash rather than 10x). Stepping back from the modeling, given the vast difference in treatment costs per person for different interventions (~$5 for bednets, $0.33-~$1 for deworming, ~$250 for cash transfers), it does seem plausible to have large (~10x) differences in cost-effectiveness. Even if scalable global health interventions were much worse than we currently think, and, say, only ~3x as cost-effective as cash transfers, I expect GiveWell’s foray into more leveraged interventions to yield substantial opportunities that are at least several times more cost-effective, pushing back towards ~10x cash transfers as a more relevant future benchmark for unfunded opportunities. Overall, given that GiveWell’s numbers imply something more like “1,000x” than “100x” for their current unfunded opportunities, that those numbers seem plausible (though by no means ironclad), and that they may find yet-more-cost-effective opportunities in the future, it looks like the relevant “bar to beat” going forward may be more like 1,000x than 100x. Our other grantmaking aimed at helping people today While we think a lot of our “near-termist, human-centric” grantmaking clears the 100x bar, we see less evidence that it will clear a ~1,000x bar. Since we initially adopted the cash transfer benchmark in 2015, we’ve made roughly 300 grants totalling almost$200 million in our near-termist, human-centric focus areas of criminal justice reform, immigration policy, land use reform, macroeconomic stabilization policy, and scientific research. To get a sense of our estimated returns for these grants, we looked at the largest grants and found 33 grants totalling $73M for which the grant investigator conducted an ex ante “back-of-the-envelope-calculation” (“BOTEC”) to roughly estimate the expected cost-effectiveness of the potential grant for Open Philanthropy decision-makers’ consideration. All of these 33 grants were estimated by their investigator to have an expected cost-effectiveness of at least 100x. This makes sense given the existence of our “100x bar.” Of those 33, only eight grants, representing approximately$32 million, had BOTECs of 1,000x or greater. Our large grant to Target Malaria accounts for more than half of that.

Although we don’t typically make our internal BOTECs public, we compiled a set here (redacted somewhat to protect some grantees’ confidentiality) to give a flavor of what they look like. As you can see, they are exceedingly rough, and take at face value many controversial and uncertain claims (e.g., the cost of a prison-year, the benefit of a new housing unit in a supply-constrained area, the impact of monetary policy on wages, the likely impacts of various other policy changes, stated probabilities of our grantees’ work causing a policy change).

We would guess that these uncertainties would generally lead our BOTECs to be over-optimistic (rather than merely adding unbiased noise) for a variety of reasons:

• Program officers do the calculations themselves, and generally only do the calculations for grants they’re already inclined to recommend. Even if there’s zero cynicism or intentional manipulation to get “above the bar,” grantmakers (including me) seem likely to be more charitable to their grants than others would be.
• Many of these estimates don’t adjust for relatively straightforward considerations that would systematically push towards lower estimated cost-effectiveness, like declining marginal returns to funding at the grantee level, time discounting, or potential non-replicability of the research our policy goals are based on. The comparison with the level of care in the GiveWell cost-effectiveness models on these features is pretty stark.
• Holden made some more general arguments along these lines in 2011.

We think it’s notable that despite likely being systematically over-optimistic in this way, it’s still rare for us to find grant opportunities in U.S. policy and scientific research that appear to score better than GiveWell’s top charities.

Of course, compared to GiveWell, we make many more grants, to more diverse activities, and with an explicit policy of trying to rely more on program officer judgment than these BOTECs. So the idea that our models look less robust than GiveWell’s is not a surprise — we’ve always expected that to be the case — but combining that with GiveWell’s rising bar is a more substantive update.

Some counter-considerations in favor of our work

As we’re grappling with the considerations above, we don’t want to give short shrift to the arguments in favor of our work. We see two broad categories of arguments in this vein: (a) this work may be substantively better than the BOTECs imply; and (b) it’s a worthwhile experiment.

This work may be better than the BOTECs imply

There are a couple big reasons why Open Phil’s near-termist, human-centric work could turn out to be better than implied by the figures above:

• Values/moral weights. A logarithmic utility function and view that “all lives have equal value” push strongly towards work focused on the global poor. But many people endorse much flatter utility functions in money and the use of context-specific “value of a statistical life” figures, both of which would make work in the U.S. generally look much more attractive. And of course many people think we have stronger normative obligations to attend to our neighbors and fellow citizens, which would also make our non-GiveWell near-termist work look more valuable (though we have historically been skeptical of such normative views). (You could make similar arguments on instrumental rather normative grounds too, e.g., by arguing that flow-through effects from work in the U.S. would be larger.) Arguably we should put some weight on ideas like these in our worldview diversification process.
• Hits. We are explicitly pursuing a hits-based approach to philanthropy with much of this work, and accordingly might expect just one or two “hits” from our portfolio to carry the whole. In particular, if one or two of our large science grants ended up 10x more cost-effective than GiveWell’s top charities, our portfolio to date would cumulatively come out ahead. In fact, the dollar-weighted average of the 33 BOTECs we collected above is (modestly) above the 1,000x bar, reflecting our ex ante assessment of that possibility. But the concerns about the informational value of those BOTECs remain, and most of our grants seems noticeably less likely to deliver such “hits.”
• Mistaken analysis. As we’ve noted, we consider our BOTECs to be extremely rough. We think it’s more likely than not that better-quality BOTECs would make the work discussed above look still weaker, relative to GiveWell top charities – but we are far from certain of this, and it could go either way, especially if our policy reform efforts could contribute meaningfully to “tipping points” that lead to accelerating policy changes in the future.

It’s a worthwhile experiment

Efforts to Improve the Accuracy of Our Judgments and Forecasts

Our grantmaking decisions rely crucially on our uncertain, subjective judgments — about the quality of some body of evidence, about the capabilities of our grantees, about what will happen if we make a certain grant, about what will happen if we don’t make that grant, and so on.

In some cases, we need to make judgments about relatively tangible outcomes in the relatively near future, as when we have supported campaigning work for criminal justice reform. In others, our work relies on speculative forecasts about the much longer term, as for example with potential risks from advanced artificial intelligence. We often try to quantify our judgments in the form of probabilities — for example, the former link estimates a 20% chance of success for a particular campaign, while the latter estimates a 10% chance that a particular sort of technology will be developed in the next 20 years.

We think it’s important to improve the accuracy of our judgments and forecasts if we can. I’ve been working on a project to explore whether there is good research on the general question of how to make good and accurate forecasts, and/or specialists in this topic who might help us do so. Some preliminary thoughts follow.

In brief:

• There is a relatively thin literature on the science of forecasting.1 It seems to me that its findings so far are substantive and helpful, and that more research in this area could be promising.
• This literature recommends a small set of “best practices” for making accurate forecasts that we are thinking about how to incorporate into our process. It seems to me that these “best practices” are likely to be useful, and surprisingly uncommon given that.
• In one case, we are contracting to build a simple online application for credence calibration training: training the user to accurately determine how confident they should be in an opinion, and to express this confidence in a consistent and quantified way. I consider this a very useful skill across a wide variety of domains, and one that (it seems) can be learned with just a few hours of training. (Update: This calibration training app is now available.)

I first discuss the last of these points (credence calibration training), since I think it is a good introduction to the kinds of tangible things one can do to improve forecasting ability.

1. Calibration training

An important component of accuracy is called “calibration.” If you are “well-calibrated,” what that means is that statements (including predictions) you make with 30% confidence are true about 30% of the time, statements you make with 70% confidence are true about 70% of the time, and so on.

Without training, most people are not well-calibrated, but instead overconfident. Statements they make with 90% confidence might be true only 70% of the time, and statements they make with 75% confidence might be true only 60% of the time.2 But it is possible to “practice” calibration by assigning probabilities to factual statements, then checking whether the statements are true, and tracking one’s performance over time. In a few hours, one can practice on hundreds of questions and discover patterns like “When I’m 80% confident, I’m right only 65% of the time; maybe I should adjust so that I report 65% for the level of internally-experienced confidence I previously associated with 80%.”

I recently attended a calibration training webinar run by Hubbard Decision Research, which was essentially an abbreviated version of the classic calibration training exercise described in Lichtenstein & Fischhoff (1980). It was also attended by two participants from other organizations, who did not seem to be familiar with the idea of calibration and, as expected, were grossly overconfident on the first set of questions.3 But, as the training continued, their scores on the question sets began to improve until, on the final question set, they both achieved perfect calibration.

For me, this was somewhat inspiring to watch. It isn’t often the case that a cognitive skill as useful and domain-general as probability calibration can be trained, with such objectively-measured dramatic improvements, in so short a time.

The research I’ve reviewed broadly supports this impression. For example:

• Rieber (2004) lists “training for calibration feedback” as his first recommendation for improving calibration, and summarizes a number of studies indicating both short- and long-term improvements on calibration.4 In particular, decades ago, Royal Dutch Shell began to provide calibration for their geologists, who are now (reportedly) quite well-calibrated when forecasting which sites will produce oil.5
• Since 2001, Hubbard Decision Research trained over 1,000 people across a variety of industries. Analyzing the data from these participants, Doug Hubbard reports that 80% of people achieve perfect calibration (on trivia questions) after just a few hours of training. He also claims that, according to his data and at least one controlled (but not randomized) trial, this training predicts subsequent real-world forecasting success.6

I should note that calibration isn’t sufficient by itself for good forecasting. For example, you can be well-calibrated on a set of true/false statements, for which about half the statements happen to be true, simply by responding “True, with 50% confidence” to every statement. This performance would be well-calibrated but not very informative. Ideally, an expert would assign high confidence to statements that are likely to be true, and low confidence to statements that are unlikely to be true. An expert that can do so is not just well-calibrated, but also exhibits good “resolution” (sometimes called “discrimination”). If we combine calibration and resolution, we arrive at a measure of accuracy called a “proper scoring rule.”7 The calibration trainings described above sometimes involve proper scoring rules, and likely train people to be well-calibrated while exhibiting at least some resolution, though the main benefit they seem to have (based on the research and my observations) pertains to calibration specifically.

The primary source of my earlier training in calibration was a game intended to automate the process. The Open Philanthropy Project is now working with developers to create a more extensive calibration training game for training our staff; we will also make the game available publicly.

2. Further advice for improving judgment accuracy

Below I list some common advice for improving judgment and forecasting accuracy (in the absence of strong causal models or much statistical data) that has at least some support in the academic literature, and which I find intuitively likely to be helpful.8

1. Train probabilistic reasoning: In one especially compelling study (Chang et al. 2016), a single hour of training in probabilistic reasoning noticeably improved forecasting accuracy.9 Similar training has improved judgmental accuracy in some earlier studies,10 and is sometimes included in calibration training.11
2. Incentivize accuracy: In many domains, incentives for accuracy are overwhelmed by stronger incentives for other things, such as incentives for appearing confident, being entertaining, or signaling group loyalty. Some studies suggest that accuracy can be improved merely by providing sufficiently strong incentives for accuracy such as money or the approval of peers.12
3. Think of alternatives: Some studies suggest that judgmental accuracy can be improved by prompting subjects to consider alternate hypotheses.13
4. Decompose the problem: Another common recommendation is to break each problem into easier-to-estimate sub-problems.14
5. Combine multiple judgments: Often, a weighted (and sometimes “extremized”15) combination of multiple subjects’ judgments outperforms the judgments of any one person.16
6. Correlates of judgmental accuracy: According to some of the most compelling studies on forecasting accuracy I’ve seen,17 correlates of good forecasting ability include “thinking like a fox” (i.e. eschewing grand theories for attention to lots of messy details), strong domain knowledge, general cognitive ability, and high scores on “need for cognition,” “actively open-minded thinking,” and “cognitive reflection” scales.
7. Prediction markets: I’ve seen it argued, and find it intuitive, that an organization might improve forecasting accuracy by using prediction markets. I haven’t studied the performance of prediction markets yet.
8. Learn a lot about the phenomena you want to forecast: This one probably sounds obvious, but I think it’s important to flag, to avoid leaving the impression that forecasting ability is more cross-domain/generalizable than it is. Several studies suggest that accuracy can be boosted by having (or acquiring) domain expertise. A commonly-held hypothesis, which I find intuitively plausible, is that calibration training is especially helpful for improving calibration, and that domain expertise is helpful for improving resolution.18

Another interesting takeaway from the forecasting literature is the degree to which – and consistency with which – some experts exhibit better accuracy than others. For example, tournament-level bridge players tend to show reliably good accuracy, whereas TV pundits, political scientists, and professional futurists seem not to.19 A famous recent result in comparative real-world accuracy comes from a series of IARPA forecasting tournaments, in which ordinary people competed with each other and with professional intelligence analysts (who also had access to expensively-collected classified information) to forecast geopolitical events. As reported in Tetlock & Gardner’s Superforecasting, forecasts made by combining (in a certain way) the forecasts of the best-performing ordinary people were (repeatedly) more accurate than those of the trained intelligence analysts.

3. How commonly do people seek to improve the accuracy of their subjective judgments?

Certainly many organizations, from financial institutions (e.g. see Fabozzi 2012) to sports teams (e.g. see Moneyball), use sophisticated quantitative models to improve the accuracy of their estimates. But the question I’m asking here is: In the absence of strong models and/or good data, when decision-makers must rely almost entirely on human subjective judgment, how common is it for those decision-makers to explicitly invest substantial effort into improving the (objectively-measured) accuracy of those subjective judgments?

Overall, my impression is that the answer to this question is “Somewhat rarely, in most industries, even though the techniques listed above are well-known to experts in judgment and forecasting accuracy.”

Why do I think that? It’s difficult to get good evidence on this question, but I provide some data points in a footnote.20

4. Ideas we’re exploring to improve accuracy for GiveWell and Open Philanthropy Project staff

Below is a list of activities, aimed at improving the accuracy of our judgments and forecasts, that are either ongoing, under development, or under consideration at GiveWell and the Open Philanthropy Project:

• As noted above, we have contracted a team of software developers to create a calibration training web/phone application for staff and public use. (Update: This calibration training app is now available.)
• We encourage staff to participate in prediction markets and forecasting tournaments such as PredictIt and Good Judgment Open, and some staff do so.
• Both the Open Philanthropy Project and GiveWell recently began to make probabilistic forecasts about our grants. For the Open Philanthropy Project, see e.g. our forecasts about recent grants to Philip Tetlock and CIWF. For GiveWell, see e.g. forecasts about recent grants to Evidence Action and IPA. We also make and track some additional grant-related forecasts privately. The idea here is to be able to measure our accuracy later, as those predictions come true or are falsified, and perhaps to improve our accuracy from past experience. So far, we are simply encouraging predictions without putting much effort into ensuring their later measurability.
• We’re going to experiment with some forecasting sessions led by an experienced “forecast facilitator” – someone who helps elicit forecasts from people about the work they’re doing, in a way that tries to be as informative and helpful as possible. This might improve the forecasts mentioned in the previous bullet point.

I’m currently the main person responsible for improving forecasting at the Open Philanthropy Project, and I’d be very interested in further ideas for what we could do.

Three Key Issues I’ve Changed My Mind About

Philanthropy – especially hits-based philanthropy – is driven by a large number of judgment calls. At the Open Philanthropy Project, we’ve explicitly designed our process to put major weight on the views of individual leaders and program officers in decisions about the strategies we pursue, causes we prioritize, and grants we ultimately make. As such, we think it’s helpful for individual staff members to discuss major ways in which our personal thinking has changed, not only about particular causes and grants, but also about our background worldviews.

I recently wrote up a relatively detailed discussion of how my personal thinking has changed about three interrelated topics: (1) the importance of potential risks from advanced artificial intelligence, particularly the value alignment problem; (2) the potential of many of the ideas and people associated with the effective altruism community; (3) the properties to look for when assessing an idea or intervention, and in particular how much weight to put on metrics and “feedback loops” compared to other properties. My views on these subjects have changed fairly dramatically over the past several years, contributing to a significant shift in how we approach them as an organization.

I’ve posted my full writeup as a personal Google doc. A summary follows.

I first encountered the idea of potential risks from advanced artificial intelligence – and in particular, the value alignment problem – in 2007. There were aspects of this idea I found intriguing, and aspects I felt didn’t make sense. The most important question, in my mind, was “Why are there no (or few) people with relevant-seeming expertise who seem concerned about the value alignment problem?”

I initially guessed that relevant experts had strong reasons for being unconcerned, and were simply not bothering to engage with people who argued for the importance of the risks in question. I believed that the tool-agent distinction was a strong candidate for such a reason. But as I got to know the AI and machine learning communities better, saw how Superintelligence was received, heard reports from the Future of Life Institute’s safety conference in Puerto Rico, and updated on a variety of other fronts, I changed my view.

I now believe that there simply is no mainstream academic or other field (as of today) that can be considered to be “the locus of relevant expertise” regarding potential risks from advanced AI. These risks involve a combination of technical and social considerations that don’t pertain directly to any recognizable near-term problems in the world, and aren’t naturally relevant to any particular branch of computer science. This is a major update for me: I’ve been very surprised that an issue so potentially important has, to date, commanded so little attention – and that the attention it has received has been significantly (though not exclusively) due to people in the effective altruism community.

More detail on this topic

2. Changing my mind about the effective altruism (EA) community

I’ve had a longstanding interest in the effective altruism community. I identify as part of this community, and I share some core values with it (in particular, the goal of doing as much good as possible). However, for a long time, I placed very limited weight on the views of a particular subset1 of  the people I encountered through this community. This was largely because they seemed to have a tendency toward reaching very unusual conclusions based on seemingly simple logic unaccompanied by deep investigation. I had the impression that they tended to be far more willing than I was to “accept extraordinary claims without extraordinary evidence” in some sense, a topic I’ve written about several times (here, here and here).

A number of things have changed.

• Potential risks from advanced AI, discussed above, is one topic I’ve changed my mind about: I previously saw this as a strange preoccupation of the EA community, and now see it as a major case where the community was early to highlight an important issue.
• More generally, I’ve seen the outputs from a good amount of cause selection work at the Open Philanthropy Project. I now believe that the preponderance of the causes that I’ve seen the most excitement about in the effective altruism community are outstanding by our criteria of importance, neglectedness and tractability. These causes include farm animal welfare and biosecurity and pandemic preparedness in addition to potential risks from advanced artificial intelligence. They aren’t the only outstanding causes we’ve identified, but overall, I’ve increased my estimate of how well excitement from the effective altruism community predicts what I will find promising after more investigation.
• I’ve seen EA-focused organizations make progress on galvanizing interest in effective altruism and growing the community. I’ve seen some effects of this directly, including more attention, donors, and strong employee candidates for GiveWell and the Open Philanthropy Project.
• I’ve gotten to know some community members better generally, and my views on some general topics (below) have changed in ways that have somewhat reduced my skepticism of the kinds of ideas effective altruists pursue.

I now feel the EA community contains the closest thing the Open Philanthropy Project has to a natural “peer group” – a set of people who consistently share our basic goal (doing as much good as possible), and therefore have the potential to help with that goal in a wide variety of ways, including both collaboration and critique. I also value other sorts of collaboration and critique, including from people who question the entire premise of doing as much good as possible, and can bring insights and abilities that we lack. But people who share our basic premises have a unique sort of usefulness as both collaborators and critics, and I’ve come to feel that the effective altruism community is the most logical place to find such people.

This isn’t to say I support the effective altruism community unreservedly; I have concerns and objections regarding many ideas associated with it and some of the specific people and organizations within it. But I’ve become more positive compared to my early impressions.

More detail on this topic

3. Changing my mind about general properties of promising ideas and interventions

Of the topics discussed here, this one is the hardest to trace the evolution of my thinking on, and the hardest to summarize.

I used to think one should be pessimistic about any intervention or idea that doesn’t involve helpful “feedback loops” (trying something, seeing how it goes, making small adjustments, and trying again many times) or useful selective processes (where many people try different ideas and interventions, and the ones that are successful in some tangible way become more prominent, powerful, and imitated). I was highly skeptical of attempts to make predictions and improve the world based primarily on logic and reflection, when unaccompanied by strong feedback loops and selective processes.

I still think these things (feedback loops, selective processes) are very powerful and desirable; that we should be more careful about interventions that don’t involve them; that there is a strong case for preferring charities (such as GiveWell’s top charities) that are relatively stronger in terms of these properties; and that much of the effective altruism community, including the people I’ve been most impressed by, continues to underweight these considerations. However, I have moderated significantly in my view. I now see a reasonable degree of hope for having strong positive impact while lacking these things, particularly when using logical, empirical, and scientific reasoning.

Learning about the history of philanthropy – and learning more about history more broadly – has been a major factor in changing my mind. I’ve come across many cases where a philanthropist, or someone else, seems to have had remarkable prescience and/or impact primarily through reasoning and reflection. Even accounting for survivorship bias, my impression is that these cases are frequent and major enough that it is worth trying to emulate this sort of impact. This change in viewpoint has both influenced and been influenced by the two topics discussed above.

More detail on this topic

4. Conclusion

Over the last several years, I have become more positive on the cause of potential risks from advanced AI, on the effective altruism community, and on the general prospects for changing the world through relatively speculative, long-term projects grounded largely in intellectual reasoning (sometimes including reasoning that leads to “wacky” ideas) rather than direct feedback mechanisms. These changes in my thinking have been driven by a number of factors, including by each other.

One cross-cutting theme is that I’ve become more interested in arguments with the general profile of “simple, logical argument with no clear flaws; has surprising and unusual implications; produces reflexive dissent and discomfort in many people.” I previously was very suspicious of arguments like this, and expected them not to hold up on investigation. However, I now think that arguments of this form are generally worth paying serious attention to until and unless flaws are uncovered, because they often represent positive innovations.

The changes discussed here have caused me to shift from being a skeptic of supporting work on potential risks from advanced AI and effective altruism organizations to being an advocate, which in turn has been a major factor in the Open Philanthropy Project’s taking on work in these areas. As discussed at the top of this post, I believe that sort of relationship between personal views and institutional priorities is appropriate given the work we’re doing.

I’m not certain that I’ve been correct to change my mind in the ways described here, and I still have a good deal of sympathy for people whose current views are closer to my former ones. But hopefully I have given a sense of where the changes have come from.

More detail is available here:

Some Key Ways in Which I’ve Changed My Mind Over the Last Several Years