Technical Updates to Our Global Health and Wellbeing Cause Prioritization Framework

In 2019, we wrote a blog post about how we think about the “bar” for our giving and how we compare different kinds of interventions to each other using back-of-the-envelope calculations, all within the realm of what we now call Global Health and Wellbeing (GHW). This post updates that one and:

  • Explains how we previously compared health and income gains in comparable units. In short, we use a logarithmic model of the utility of income, so a 1% change in income is worth the same to everyone, and a dollar of income is worth 100x more to someone who has 100x less. We measure philanthropic impact in units of the welfare gained by giving a dollar to someone with an annual income of $50,000, which was roughly US GDP per capita when we adopted this framework. Under the logarithmic model, this means we value increasing 100 people’s income by 1% (i.e. a total of 1 natural log unit increase in income) at $50,000. We have previously also valued averting a disability-adjusted life year (DALY; roughly, a year of healthy life lost) at $50,000, so we valued increasing income by one natural-log unit as equal to averting 1 DALY. This would imply that a charity that could avert a DALY for $50 would have a “1,000x” return because the benefits would be $50,000 relative to the costs of $50. (More)
  • Reviews our previous “bar” for what level of cost-effectiveness a grant needed to hit to be worth making. Overall, having a single “bar” across multiple very different programs and outcome measures is an attractive feature because equalizing marginal returns across different programs is a requirement for optimizing the overall allocation of resources1, and we are devoted to doing the most good possible with our giving. Prior to 2019, we used a “100x” bar based on the units above, the scalability of direct cash transfers to the global poor, and the roughly 100x ratio of high-income country income to GiveDirectly recipient income. As of 2019, we tentatively switched to thinking of “roughly 1,000x” as our bar for new programs, because that was roughly our estimate of the unfunded margin of the top charities recommended by GiveWell (which we used to be part of and remain closely affiliated with), and we thought we would be able to find enough other opportunities at that cost-effectiveness to hit our overall spending targets. (More)
  • Updates our ethical framework to increase the weight on life expectancy gains relative to income gains. We’re continuing to use the log income utility model, but, after reviewing several lines of evidence, we’re doubling the weight on health relative to income in low-income settings, so we will now value a DALY at 2 natural log units of income or $100,000. We’re also updating how we measure the DALY burden of a death; our new approach will accord with GiveWell’s moral weights, which value preventing deaths at very young ages differently than implied by a DALY framework. (More)
  • Articulates our tentative “bar” for giving going forward, of roughly 1,000x (which is ~20% lower than our old bar given new units – explanation in footnote2). The (offsetting) changes in the bar come from new units, our available assets growing, more sophisticated modeling of how we expect cost-effectiveness and asset returns to interact over time, growth in GiveWell’s other funding sources, and slightly increased skepticism about our ability to spend as much as needed at much higher levels of cost-effectiveness. Due to the increased assets and lower bar, we’re planning to substantially increase our funding for GiveWell’s recommended charities, which we will write more about next week. However, we still expect most of our medium-term growth in GHW to be in new causes that can take advantage of the leveraged returns to research and advocacy, and could imagine that we’ll eventually find enough room for more funding in those interventions that we will need to raise the bar again. (More)

 

This post focuses exclusively on how we value different outcomes for humans within Global Health and Wellbeing; when it comes to other outcomes like farm animal welfare or the far future, we practice worldview diversification instead of trying to have a single unified framework for cost-effectiveness analysis. We think it’s an open question whether we should have more internal “worldviews” that are diversified over within the broad Global Health and Wellbeing remit (vs everything being slotted into a unified framework as in this post).

This post is unusually technical relative to our others, and we expect it may make sense for most of our usual blog readers to skip it.

1. How we previously compared health and income

We often use “marginal value of a dollar of income to someone with baseline income of $50K” as our unified outcome variable, so by definition giving a dollar to someone with $50k of annual income has a cost-effectiveness of 1x.3 We value income using a logarithmic utility function, so $100 in extra income for a rich person generates as much utility as $1 in extra income for a person with 1/100th the income of that rich person. In order for a grant’s cost-effectiveness to be, say, 1000x, it must be 1000 times more cost-effective than giving a dollar to someone with $50k of annual income.

This can be confusing because GiveWell, which we used to be part of and remain closely affiliated with, uses contributions to GiveDirectly as their baseline unit of “1x.” In our framework, those contributions are very roughly worth 100x, because GiveDirectly recipients are roughly 100x poorer (after overhead) than our $50K income benchmark. This section of our 2019 blog post reviews how our logarithmic income model works and the “100x” calculation.

We quantify health outcomes using disability-adjusted life years (DALYs). The DALY burden of a disease is the sum of the years of life lost (YLL) due to the disease, and the weighted years lived with disability (YLD) due to the disease. If you save the life of someone who goes on to live for 10 years, your intervention averted 10 YLLs. If you prevent an illness that would have caused someone to live in a condition 20% as bad as death for 10 years, your intervention averted 20%*10=2 YLDs. (We don’t necessarily endorse the disability weights used to measure YLDs, and in principle we might prefer other methodologies, such as quality-adjusted life years (QALYs) or just focusing on YLLs. We use DALYs because global-health data is much more widely available in DALYs than in other frameworks, especially from the Global Burden of Disease project.)

The health interventions that we support are primarily lifesaving interventions (with the exception of deworming where the modeled impacts run entirely through economic outcomes) – so although we talk in terms of DALYs, most of our health impact is in YLLs. (For instance, according to the GBD, ~80% of all DALYs in Sub-Saharan Africa are from YLLs, and amongst under-5 children the same figure is 97%. For malaria DALYs the share from YLLs is even higher, at 94% overall and 98% amongst under-5 children.) We also sometimes use quasi-disability-weights for valuing other outcomes (e.g., the harm of a year of prison).

There are different approaches to calculating the YLLs foregone with a death at a given age. For instance, the life-expectancy tables in Kenya suggest that an average child malaria death shortens the child’s life by 68 years (i.e. that is the average remaining life expectancy of a 0-5 year old in Kenya).4 This is not a floor on plausible YLLs – one could imagine that those prevented from dying by some intervention are unusually sick relative to the broader population, and accordingly not likely to live as long as a population life table would suggest – and as explained below GiveWell uses moral weights for child deaths that would be consistent with assuming 51 years of foregone life in the DALY framework (though that is not how they reach the conclusion). On the other extreme, the GBD takes a normative approach to life expectancy, saying in effect that everyone’s life expectancy at birth should be 88 years.5 Therefore any death under age 5 has a DALY burden of at least 84 years under the GBD approach.6 We have previously been inconsistent in this regard – following the methodology of the GBD in much of our own analysis, while deferring to GiveWell’s approach (and moral weights) on their recommendations. (In the analysis further below of our current bar, we assume for simplicity that our status quo approach splits the difference between the GiveWell and GBD approaches, and uses expected life years remaining for someone who lives to age 5 based on national-level life tables for Kenya.) Below, we explain our plan to follow GiveWell’s approach more closely going forward.

Historically we haven’t written much about how we came to value a DALY at $50,000 (or equivalently to one unit of log-income), which is less than many other actors would suggest. We don’t have a well-documented account of this historical choice, but the recollection of one of us (Alexander) is:

  • We had seen numbers like $50K cited in the (older) health economics literature. It was also close to the British government’s widely-cited cost-effectiveness threshold of £20k-£30k (roughly $50k as of 2007; less now) per QALY (though that is more based on an opportunity cost frame than a valuation frame).
  • A $50K DALY valuation roughly reconciled our relative valuation on the life-saving GiveWell top charities against GiveDirectly (i.e., we had GiveDirectly at 100x in our units and a cost per DALY averted of ~$50 for GiveWell top charities, which at $50K/DALY, implied a 1,000x ROI, or a ~10x difference with GiveDirectly) with GiveWell’s (which thought their life-saving charities were ~10x as cost-effective as GiveDirectly at the time).7
  • Given that we were already prioritizing health heavily relative to income in terms of the total set of interventions we were supporting, opting for a lower valuation than some other parts of the literature seemed conservative.
  • Using a number that was significantly larger than GDP per capita (which was ~$50K in the US at the time we adopted this valuation) implied there was more value at stake than total resources in the economy, and that seemed wrong to me at the time. We now think that I was mistaken and there’s no in-principle reason there can’t be substantially more value at stake than the sum of the world economy.

2. Our previous “bar”

It is useful for us to have a single “bar” across multiple very different programs, years, and outcome measures, because equalizing marginal returns across different programs and across time is a requirement for optimizing the overall allocation of resources, and our mission is to give as effectively as we can. The basic idea of this “bar” is that it tells us what level of cost-effectiveness of grant is “good enough” to justify spending on some specific grant or program vs. saving for future years or allocating more funding to another program instead. If instead we set different bars across programs or years (in present value terms, i.e., after adjusting for investment returns), then that would mean we could have more impact by changing the allocation of resources (to put more into the program with the higher bar and less into the program with the lower bar), and we’d, in principle, want to do that. (However, in practice, we often see a case for practicing worldview diversification.)

Prior to 2019, we often used a “100x” bar, based on the units above and the very roughly 100x ratio of $50K to GiveDirectly recipient income (net of transaction costs). We thought “that such giving was quite cost-effective and likely extremely scalable and persistently available, so we should not generally make grants that we expected to achieve less benefit per dollar than that.”

As of 2019, we switched to tentatively thinking of “roughly 1,000x” as our bar for new programs, because that was roughly our estimate of the unfunded margin of the GiveWell top charities, and we thought we would be able to find enough other opportunities at that cost-effectiveness to hit our overall spending targets. We wrote: “Overall, given that GiveWell’s numbers imply something more like “1,000x” than “100x” for their current unfunded opportunities, that those numbers seem plausible (though by no means ironclad), and that they may find yet-more-cost-effective opportunities in the future, it looks like the relevant “bar to beat” going forward may be more like 1,000x than 100x.” However, that bar was rough, and we never made it very precise in large part because we don’t expect to be able to make back-of-the-envelope calculations that could reliably distinguish between, say, 800x and 1,000x.

3. New moral weights

We now think our previous approach to valuing health placed too little value on lifesaving interventions relative to income interventions. Our new approach values a DALY averted twice as highly, equal to a 2-unit (rather than 1-unit) increase in the natural log of any individual’s income. (This is equivalent to increasing 200 people’s incomes by 1% – i.e., in our favored units, equal to $100K in units of marginal dollars to individuals making $50K.) This updated value is more consistent with empirical estimates of beneficiary preferences, the subjective wellbeing literature, and the practice of other actors in the field.

These are all judgment calls subject to uncertainty, and we could readily imagine revising our weights again in the future based on further argument.

3.1 Beneficiary preferences

There is a large body of research on how people tend to trade off mortality risks against income gains, in particular the Value of a Statistical Life (VSL) literature. Some analyses in this literature use the stated preferences of individuals – i.e. responses to survey questions like, “would you rather reduce your risk of dying this year by 1%, or increase your income this year by $500?” Others use the revealed preferences of individuals: are the study subjects, on average, willing to take a job with a 1% higher rate of annual mortality, in exchange for $500 higher annual income?8

The research findings are clearest in high-income countries, where they tend to find that respondents value a year of life expectancy 2.5 to 4 times more than annual income.9 (Since these valuations are mostly based on marginal tradeoffs, and since we model utility as a logarithmic function of income, we can interpret these findings to say that “respondents value a year of life expectancy as much as the utility gained from increasing income by 2.5 to 4 natural-log units.”)

In low-income countries, the evidence is sparser and the findings vary widely.10 For example, in this chart we plot all the estimates found by the literature search in Robinson et al. 2019’s meta-analysis – they searched for all VSL analyses, whether stated or revealed, in any country that had been classified as low- or middle-income in the last 20 years.11 (We divide the VSLs by adult12 life expectancy to get the VSLY – value of a statistical life-year. We express this in terms of local income, which tells us how many units of log-income are worth as much to the respondents as an extra year of life expectancy. We plot stated-preference studies as circles, and revealed-preference studies as diamonds.)

 

Figure1.png

 

As you can see, the results vary extremely widely. One paper (Mahmud 2009) finds that subjects in rural Bangladesh traded off mortality risk against income in a way that suggests they valued a year of life expectancy at, very roughly, just 62% of annual income. Another paper (Qin et al. 2013) finds that subjects in China (who reported only $840 in annual per-capita income) valued a year of life expectancy at roughly 7x annual income.

In the face of this huge variation, some sources13 recommend estimating LMIC preferences via an “elasticity” approach: statistically estimate a function mapping VSL to income, anchoring it off the better-validated VSL figures in high-income countries. This elasticity approach is reviewed in Appendix A. Mainstream versions of it predict that individuals at the global poverty line would trade off 1 year of life expectancy for anywhere between 0.5x-4x of a year’s income14 (which we can interpret as 0.5 to 4 units of log-income).

Our beneficiaries (especially for lifesaving interventions) are mostly in low and lower middle income countries, and for now we’re focusing on this context in setting our moral weights. (As we’ll discuss in Appendix A, there are empirical and theoretical reasons to think that the exchange rate at which people trade off mortality risks against income gains differs systematically across income levels, with richer people valuing mortality more relative to income.)

Rather than estimate beneficiary preferences based on the academic literature, why not ask beneficiaries directly? In 2019, GiveWell commissioned a survey among individuals who are demographically similar to their beneficiaries: families in rural Kenya and Ghana, with annual consumption of roughly $300 per capita.15 IDInsight conducted this survey, and their analysis suggested that these respondents would value an extra year of life expectancy as much as 2.1-3.8 years of income, which we can interpret as 2.1-3.8 units of log-income. (See GiveWell’s post for a discussion of the limitations of this study).16

3.2 Subjective wellbeing

Research on subjective wellbeing suggests that we can improve life satisfaction by increasing incomes, but it also seems to indicate that such gains are small compared to the baseline wellbeing experienced by individuals even at low levels of income. We interpret this as suggestive evidence that a year of life expectancy is worth substantially more than 1 unit of log-income.

For example, this chart from Stevenson and Wolfers 201317 attempts to harmonize different surveys into one measure of “life satisfaction”. It suggests that you’d have to increase income by a factor of at least 64 in order to double reported life satisfaction on the harmonized scale (starting from the baseline level of life satisfaction experienced by the average individual at the global poverty line).18 Note, of course, that the units of this scale are somewhat arbitrary, so you have to add some strong assumptions to think that a doubling on this scale indicates a doubling in actual subjective wellbeing.19

 

StevensonWolfers.png

 

We could use these “life satisfaction” units to estimate an exchange rate between increases in life expectancy and increases in income. The chart suggests that, for someone near the global poverty line, you’d have to increase income by roughly 64x in order to get twice as much life satisfaction at any moment. You could conclude that a 64x increase in income (for one year) is worth as much life satisfaction as an extra year of life. In that case, a logarithmic utility function suggests valuing a DALY at roughly 4 units of log-income.20

We don’t want to over-interpret this – especially since the measure of “life satisfaction” uses a somewhat arbitrary scale and you could get different results depending on your assumptions about how reported satisfaction translates to actual subjective wellbeing.21 But it does seem like suggestive evidence that the old weights placed too much weight on income, and too little value on saving lives.

3.3 Other actors’ values

Doubling the weight we place on health also brings us into better alignment with other actors in the field. For example, GiveWell’s implied valuation of a YLL from preventing an adult death from malaria is 1.6 units of marginal log-income.22 The Lancet Commission on Investing in Health recommended valuing DALYs in LMICs at 2.3 times per-capita GDP.23 Until recently, the World Health Organization recommended that health interventions be considered “cost-effective” if the cost of averting a DALY was less than 3x per-capita GDP.24 Other actors in high-income countries use widely-ranging methodologies; some would give numbers as high as 4x per-capita income25 or as low as 0.7x per-capita income.26

In particular, we don’t want our moral weights to stray too far from GiveWell’s. This is substantially for pragmatic reasons: most of our lifesaving interventions run through them, and a statement of our valuations that was out of line with our spending on the GiveWell recommendations doesn’t seem like it would be accurate. But it is also partially due to some epistemic deference: GiveWell’s moral weights are the result of thoughtful deliberative processes, and we share some meta-ethical and epistemic approaches with them. While we are generally aware of each other’s reasoning, if we invested more time in understanding each other’s thinking, we expect we would likely come closer to agreeing on moral weights. Thus, some preemptive deference to their moral weights makes sense to us.

3.4 Aggregating these considerations

The literature on subjective wellbeing, and our attempts to estimate beneficiary preferences, both suggest to us that we should rate lifesaving impacts significantly more highly than under our old weights. On the other hand, pushing in favor of a less aggressive increase:

  • We don’t want to be too far off from GiveWell’s current valuation of 1.6 units of log-income.
  • The theoretical arguments in Appendix A suggest that high DALY valuations in low-income settings would be inconsistent with developed-world VSL evidence.
  • We don’t want to change our moral weights too much in any one year, to avoid “whipsawing” if we later determine that this change was mistaken.
  • Given that we already prioritize health interventions more highly than other philanthropists, erring towards a lower valuation seems “conservative.”

 

There is huge uncertainty/disagreement across and between lines of evidence – including between us and on the broader GHW cause prioritization research team – and any given choice of ultimate valuations seems fairly arbitrary, so we also prefer a visibly rough/round number that reflects the arbitrariness/uncertainty.

Taking all of these considerations together, we’re doubling our DALY valuation to $100,000, i.e. 2 units of log-income. We expect to continue to revisit this in future years and could readily envision major further updates.

3.5 Measurement of DALYs

We’re also switching our approach to be more consistent with GiveWell’s framework in how we translate deaths into DALYs. GiveWell assigns moral weights to deaths at various ages, rather than to DALYs. But we can use their moral weights to derive a mapping of deaths to DALYs, by dividing GiveWell’s moral weight for each death by GiveWell’s moral weight for a year lived with disability (which is defined by WHO so as to be equivalent to a DALY).27

The resulting model cares about child mortality more than adult mortality, but not by as much as remaining-population-life-expectancy would suggest. For example, GiveWell places 60% more weight on a child malaria death than on an adult death, and we can fairly straightforwardly interpret their process as counting an average of 32 DALYs per adult malaria death,28 so the GiveWell-based DALY model would implicitly count 32*160% = 51 DALYs for an under-5 malaria death. In contrast, a direct remaining-population-life-expectancy approach in Kenya would count 68 DALYs for an under-5 malaria death29, and the Global Burden of Disease approach (explained above) would count more than 84 DALYs.

GiveWell has written about the process for reaching their current moral weights here.

We see a normative case for the Global Burden of Disease’s uniform global approach to DALY attribution, but given our commitment to maximizing (expected) counterfactual impact, we think the national life table approach represents a plausible upper bound on attributable DALYs, and even those seem aggressive as an estimate of counterfactual lifespan for children whose lives are saved on the margin (who are presumably less advantaged and less healthy than the national average). Overall, we’re not sure where precisely this consideration should leave us, but it seems to argue for lower numbers.

We also haven’t reached any settled thoughts on the impact of population ethics or the second-order consequences of saving a life (e.g., on economic or population growth) on how to translate between deaths and DALYs.

For now, in order to be more consistent in our practices, we’re going to defer to GiveWell and start to use the number of DALYs that would be implied by extrapolating their moral weights. (In practice, we already defer to GiveWell for their own recommendations, so this would mainly change how we use GBD figures in our BOTECs for other grants, especially in science. In the section immediately below comparing our new values to old values, we assume for simplicity that we were using the national life table approach, which splits the difference between the GiveWell approach and the GBD in describing our status quo practices.) This means fewer DALYs averted per child death averted, which offsets some of the apparent gains from doubling our value on health. We expect to revisit this and try to form a more confident independent view about the balance of all these considerations in the future.

4. Expanding our spending, and modestly lowering our cost-effectiveness bar

We think there are four major buckets of updates that affect our “bar” going forward:

  • The change to our weight on health, described just above.
  • Other secular changes to GiveWell’s expected cost-effectiveness.
  • Cross-cutting changes to our estimate of future available assets.
  • Updates to our estimate of the likely “last dollar” cost-effectiveness of our non-GiveWell spending.

 

We walk through more detail below, but overall these factors leave us with a bar of very roughly 1,000x going forward for now.

4.1 Increased value on health

Overall, our new moral weights put more emphasis on health than we did before, which in some sense increases the amount of value at stake according to our moral framework, and should raise our cost-effectiveness bar, at least expressed in terms of marginal dollars to someone making $50K.

 

Figure2.png

 

This table shows how we would rate the impact of various interventions, according to our new moral weights. (You can see the calculations in Google Sheets here.) Here’s how to read the first column:

  • GiveWell estimates that HKI’s vitamin A supplementation program in Kenya averts 14 child deaths per $100k granted, and also increases incomes by 500 units of log-income.30
  • Per GiveWell’s moral weights, this is 7 times more cost-effective than cash transfers to the global poor.
  • Open Philanthropy’s old moral weights would have rated this intervention as 747 times more cost-effective than giving a dollar to someone with $50K of income.
  • Under our new moral weights, we would rate this intervention as 977 times more cost-effective than giving a dollar to someone making $50K.

 

Note that the update in cost-effectiveness depends on the mix of beneficial outcomes an intervention generates. A charity that just increases income (as in the second column of this table) will have the same cost-effectiveness under our new or old moral framework. An intervention that simply averted DALYs (say, averting 1 DALY for every $100 spent) would be twice as cost-effective under our new moral weights. Because of the change in how we measure DALYs, an intervention that averts child deaths at a fixed rate (say, averting 1 child death for every $5000 spent) would only be roughly 1.5x more cost-effective under our new moral weights than under the old framework.31 The program in column 1 gets some of its moral impact from income interventions and some from averting child deaths, so its cost-effectiveness changes by a weighted average of 1x and 1.5x.

The third and fourth columns show the mix of outcomes that could be achieved by a typical dollar to GiveWell top charities – that is, roughly 50% of its impact from income effects and 50% from mortality effects.32 Our new weights rate the cost-effectiveness of this mix of outcomes ~25% higher than our old weights did.33 (To foreshadow a bit, this means that if we lower the bar by ~20%, simultaneously with changing our moral weights, then our nominal bar will not change.)

For the past few years GiveWell’s rough margin has been interventions that they rate as 10x more cost-effective than cash transfers to the global poor – the third column shows how a dollar could be spent at this margin. This was our prior bar, and you can see that in our old units it was ~1000x. If the only updates were to our valuation on DALYs (and our framework for translating deaths into DALYs), our bar would go to roughly 1300x (our rough old bar expressed in new units).

However, this change to our valuations is not the only change here; below we address others which, taken together, lower our expected cost-effectiveness of our last dollar (assuming the GiveWell mix of outcomes) by roughly 20%. This means our nominal bar is staying roughly constant.

4.2 Changes to GiveWell’s expected cost-effectiveness

Our expectation of future funding for GiveWell top charities, including both our support and that from others, has grown much faster than we would have expected in 2019. We didn’t have a precise model of expected future funding to GiveWell at that point, but very roughly we think it’s reasonable to model expected future funding for GiveWell’s recommendations as having doubled relative to our 2019 expectations. We currently model the GiveWell opportunity set as isoelastic with eta=.375,34 which implies that a doubling of expected funding should reduce marginal cost-effectiveness (and the bar) by 23% (1 – 2^-.375 ≈ 23%).

On the other hand, GiveWell has found slightly more cost-effective opportunities over the last year than we would have expected them to. This year, they’re expecting roughly $400M of spending capacity at least ~8x as cost-effective as GiveDirectly according to their modeling. This is roughly as much as we would have expected if they had already explored the space of “8x” opportunities as thoroughly as they’ve explored the opportunities that are ~10x as cost-effective as GiveDirectly.35 Given that they have only recently focused on exploring this space of slightly-less-cost-effective interventions, this is a very promising amount of spending capacity, and suggests the potential for even more capacity in the near future. This should marginally raise our bar.

We’ve also done more sophisticated modeling work on how we expect the cost-effectiveness of direct global health aid and asset returns to interact over time, and how we should optimally spread spending across time to maximize expected impact while hitting Cari and Dustin’s (our main funders) stated goal of spending down within their lifetimes. We’re still hoping to share that analysis and code, but the top level conclusion is that for opportunities like the GiveWell top charities and with a ~50 year spenddown target, it’s optimal to spend something like 9% of assets per year. That would imply a significantly faster pace of spending for the assets we expect to recommend to the GiveWell top charities than we’ve reached in the past, which would in turn imply a lowering of the bar. Two other interesting implications of the model for GiveWell spending are that: (a) we should be trying to get to our optimal spending and then spending down with a decreasing dollar amount each year (which may be a more a flaw/simplification of the model than accurate conclusion); and (b) we should expect our bar to fall by roughly 4% per year in nominal terms (largely reflecting asset returns – this helps equalize our “real”36 bar over time).

Overall, we currently expect GiveWell’s marginal cost-effectiveness to end up around 7-8x GiveDirectly (in their units), which, assuming their current distribution across health and income benefits, translates to ~900-1,100x in our new units,37 though our understanding is that GiveWell does not necessarily endorse this extrapolation. Assuming that we continue to support the GiveWell recommendations and have a correctly-implemented uniform bar, that implies a similar bar across all our other work, though it could turn out to be too low if we’re able to find many more cost-effective opportunities in other work.

One complication for extrapolating from the GiveWell bar to our other work is that GiveWell is much more thorough in their cost-effectiveness calculations than we typically are in our back-of-the-envelope calculations, which might mean that the results aren’t really comparable. We linked to some examples of our back-of-the-envelope calculations from our 2019 post, and they compare very unfavorably to the thoroughness of GiveWell’s cost-effectiveness analyses. That said, GiveWell also counts some second-order benefits (e.g., the expected income benefits of health interventions) that we typically don’t, so it isn’t totally obvious which direction this adjustment would end up pointing on net. (It’s also not clear how we would want to make the appropriate adjustment even in principle since there’s some division of labor going on where GiveWell has more conservative/skeptical epistemics, but we intentionally don’t consistently apply those epistemics across our work.) Overall, we think we should probably use a somewhat higher bar for our other BOTECs rather than just applying the same bar from GiveWell, but we’re not currently making a mechanical adjustment for this and don’t have a good sense of how big it should be.

4.3 Increases to our estimate of future available assets

Our available funding has increased significantly as a result of stock market moves over the last few years. (This is not independent of the assumption above of GiveWell’s available resources doubling, since that assumes a substantial increase in our giving to their top charities.)

We’ve also become more optimistic about future funders contributing to highly-effective opportunities of the sort we may recommend, which would also lower our current bar on the margin. Some of this is driven by the emergence of other billionaires self-identifying with effective altruism, but it also reflects GiveWell’s increased funding from other donors, increasingly concentrated wealth at the top end of the global distribution, and the cryptocurrency boom. (Yes, that is weird, we know.)

By the same logic as above, increasing expected resources should lower the bar, but we don’t have as good of a model for how cost-effectiveness scales with resources in other Global Health and Wellbeing causes as we do for GiveWell recommendations, especially not for causes that we haven’t identified yet.

If the only thing that changed were GiveWell autonomously lowering its bar and accordingly having less cost-effective marginal recommendations, we should in principle marginally reallocate away from GiveWell and to other opportunities. But GiveWell isn’t independently lowering its bar; our overall plans and assessment of our bar contribute to the update. And given the composition of GiveWell’s top charities, made up of scalable, commodities-driven global health interventions, we expect them to have a lower eta (i.e., decline less in cost-effectiveness with more funding) than opportunities like R&D or advocacy that are more people-intensive (where we have a prior that returns tend to be more like logarithmic, which is more steeply declining than our model of the GiveWell top charities). That should mean that as resources rise, a larger portion of the total should flow to GiveWell. And that is reflected in our most recent plans: we previously wrote that we expected something like 10% of Open Phil assets/spending to go to “straightforward charity” exemplified by the GiveWell top charities, but now anticipate likely giving a modestly higher proportion (which, combined with asset increases, will mean substantially increasing our support for them in dollar terms).

4.4 Updates to our estimate of the likely “last dollar” cost-effectiveness of our non-GiveWell spending

Above, we argued that the GiveWell bar should go down in terms of our old weights and stay roughly nominally flat (at ~1,000x) in our new units. While the first-order implication is that the bar should uniformly decline across all of our Global Health and Wellbeing work, there are some complicating considerations.

Our new higher valuation on health and the GiveWell bar going down makes non-GiveWell global health R&D and advocacy opportunities look more promising than before, and we really don’t know how many of these opportunities we could find. If, hypothetically, we could find billions of dollars a year in global health R&D opportunities with marginal cost-effectiveness above the new GiveWell bar, then that should be the bar instead (and, by implication, we shouldn’t fund the GiveWell recommendations going forward). That said, it seems unlikely a priori that marginal cost-effectiveness for billions of dollars of global health R&D or advocacy spending would end up right in between the old and new GiveWell bars (which fall by one third for child health interventions and 50% for adult health interventions38), so the most likely implications of this are that either (less likely) our bar has always been too low and we should have always been doing this hypothetical global health R&D or advocacy instead of supporting GiveWell top charities, or (more likely) we will find many good opportunities better than the marginal GiveWell dollar but not enough that they independently drive our marginal dollar. Our new valuation on health (relative to income) is also a little higher than GiveWell’s, which in principle means that there could be things above our bar but below theirs, though in practice the valuations are close enough that we don’t think this is likely to be a big deal.

More concretely, we’ve been continuing to explore new areas that we think might be more leveraged than the GiveWell top charities, including recently making hires (announcements coming soon!) to lead new programs in global aid policy and South Asian air quality. The basic thesis on these new causes is to try to “multiply sources of leverage for philanthropic impact (e.g., advocacy, scientific research, helping the global poor) to get more humanitarian impact per dollar (for instance via advocacy around scientific research funding or policies, or scientific research around global health interventions, or policy around global health and development).” With our first hires in these new causes starting next year, it’s too soon for us to have a major update on the marginal cost-effectiveness of spending far out the curve from where we are now, but a few modest updates:

  • The a priori case for the existence of large scale leveraged interventions more cost-effective than the GiveWell margin continues to seem compelling to us. The Bill and Melinda Gates Foundation spends well over half a billion dollars a year on global health research and development39 and we find it plausible (though far from obvious, and haven’t been able to get great data on the view) that the marginal dollar there is better than the GiveWell margin.
  • But we haven’t found anything obviously more cost-effective than the GiveWell margin and scalable to billions of dollars a year. We’re far from being done looking, and have only covered a small part of the conceptual space, but we’re also not prepared to bet that we will succeed at that scale in the future.
  • For a more pessimistic prior, consider that at GiveWell’s bar of 8x GiveDirectly, the cost per outcome-as-good-as-averting-an-under-5 death is about $4000-$4500, and the cost per actual under-5 life saved (ignoring other benefits) for a charity focused on saving kids’ lives, is about $6000-$7000.40 There are about 5 million under-5 deaths per year, which implies that they would all be eliminated for $20-35B/year if GiveWell’s cost-effectiveness level could be maintained to an arbitrary scale. Total development assistance for health is more than that. If there were more than $3B/year of room-for-more-funding an order of magnitude more cost-effective than GiveWell’s margin, it would need to be as effective as eliminating all child mortality. This argument isn’t decisive by any means but can help give a sense of how hard it would be to beat the GiveWell margin by a lot at massive scale: there just aren’t that many orders of magnitude to work with.
  • We still don’t think all of our existing grantmaking necessarily hits this bar, and are continuing to try to move flexible portfolios towards higher expected-return areas while learning more about our returns and considering reducing our spending in others.

 

Our current best guess is that we won’t be able to spend as much as we need to within the Global Health and Wellbeing frame at a significantly higher marginal cost-effectiveness rate (though average might be higher) than the GiveWell top charities, so the marginal cost-effectiveness of the GiveWell recommendations continues to be a relevant metric for our overall bar. However, we still expect most of our medium-term growth in GHW to be in new causes that can take advantage of the leveraged returns to research and advocacy, and could imagine that we’ll eventually find enough room for more funding in those interventions that we will need to raise the bar again.

We haven’t done a thorough analysis of the costs of over- vs under-shooting “the bar” for all of our causes, but one important takeaway from our analysis of that for the GiveWell setting, which may or may not properly extrapolate, is that saving and collecting investment returns isn’t “free” (or positive) from an impact perspective. That is because, very roughly, the world is getting better (and accordingly opportunities to improve the world are getting worse) over time, and saved funding doesn’t have that long to compound and has to be spent later at a higher rate given the “spend down within our primary donors’ lifetimes” constraint (which in turn likely means it will be spent at a lower cost-effectiveness further out the annual spending curve). We also think we will be in a much better place to raise more funding from other donors if we’re spending down our existing resources, and the expected benefits of that, along with the ex ante possibility of raising a lot more money, makes it theoretically ambiguous whether the costs of under- or over-estimating the “true” bar are higher. Accordingly we’re just going with our very rough and round current best guess for the bar for now, rather than doing a full expected-value calculation, and we will revisit in the future as we learn more.

4.5 Bottom line

We are now treating our bar as “roughly 1,000x” (with our new weights on health) for the GiveWell top charities and in our new cause selection and grantmaking, though we retain considerable uncertainty and expect to continue to revisit that over the coming years. For the typical mix of GiveWell interventions, this bar is about 20% lower given our new moral weights.41

We think it’s important to note that the bar is very rough – we aren’t very confident that it, or the BOTECs we consider against it, are even within a factor of 2 of correct – and we will continue to put considerable weight on factors not included in our rough back-of-the-envelope calculations in making major decisions.

Due to this analysis and the lower forward-looking bar, we’re planning to give more to the GiveWell top charities this year and going forward – more on that next week.

5. Appendix A

The available direct service interventions in health, like the ones GiveWell recommends, are far more cost-effective in low-income countries than high-income countries, so the discussion above focuses on what value we should place on lifesaving interventions in low-income countries. If we were focused instead on saving lives in the developed world, likely via advocacy of some sort, we might trade off differently between lifesaving vs income-enhancing interventions – we are uncertain over a range of rich-world DALY valuations between 2-6 units of log-income.

There are theoretical and empirical reasons to think that the exchange rate at which people trade off mortality risks against income gains differs systematically across income levels, with richer people valuing mortality more relative to income.

5.1 VSL elasticity to income

The main empirical evidence is from the VSL literature described above. As discussed, economists often attempt to statistically estimate a function mapping VSL to income, anchoring it off the better-validated VSL figures in high-income countries.

This literature generally finds that individuals’ willingness to pay for life expectancy increases with income, which is unsurprising – a dollar matters a lot more to a rich person, so if everyone valued a DALY equally then VSLYs would increase linearly with income. Most reviews also find that this willingness to pay increases at a faster pace than income. For example, Robinson et al. 2019 review the mainstream literature, which finds that the elasticity of VSL to income is between 1.0-1.2 across LICs (and a bit below 1 across the developed world), though Robinson’s own analysis suggests an elasticity of 1.5 for extrapolating to LMICs.42 An elasticity of 1.2 would mean that if two individuals’ income differs by 10%, then on average the dollar value they place on a year of life expectancy will differ by 12%.

Lisa Robinson chaired a commission, sponsored by the Gates Foundation, that recommended the following ensemble approach:

  • VSL is anchored to US at 160x income, with an income elasticity of 1.5, and a lower bound of 20x income.
  • VSL is anchored to US at 160x income, with an income elasticity of 1.0
  • VSL is anchored to OECD at 100x income, with an income elasticity of 1.0

 

These yield the following VSLY estimates for someone at an income 1/100th that of the US43:

  • 0.5x income, via lower bound44
  • 4x income45
  • 2.5x income46

 

Here again is the chart from our discussion of LIC VSLYs above. As noted above, in this chart we plot all the estimates found by the literature search in Robinson et al. 2019’s meta-analysis – they searched for all VSL analyses, whether stated or revealed, in any country that had been classified as low- or middle-income in the last 20 years.47

I’ve added a few lines to show various elasticities you could use to predict LMIC VSLYs based on US VSLYs. An elasticity of 1 would say that willingness to pay for a year of life expectancy varies 1:1 with income – combining that with the US VSLY of 4 years of income, we’d predict that individuals in any country are willing, on average, to trade 4 units of log-income for 1 year of life expectancy. If instead we combined an elasticity of 1.1 with the US VSLY, we’d predict that individuals at the global poverty line would be willing to trade roughly 2.5 years of income for a year of life expectancy – this is roughly in line with the VSLYs from IDInsight’s surveys of communities demographically similar to GiveWell beneficiaries.

It seems to me that any elasticity between, say, 0.9 and 1.3 is potentially compatible with this data.

 

Figure3.png

 

Analysts are often interested in the relationship between VSL and national-level income statistics like per-capita GDP, even though we often have measures of the respondents’ incomes, because it is often more practical to apply heuristics to national-level income statistics, especially when deriving VSL estimates to inform national policies. When we analyze VSLY measured in multiples of GNI per capita, rather than respondents’ income, we see a stronger relationship between income and VSLY – this data could be compatible with elasticities between, say, 1.0 and 1.5. This seems to be because the respondents in LMIC VSL studies report lower average incomes than their respective national average – presumably because the social scientists performing these studies are interested in somewhat poorer populations.48 We suspect that comparing LMIC VSL estimates to national-level income averages, rather than to respondents’ (lower) incomes, biases analyses like Robinson et al. toward finding higher elasticities of VSL to income.

 

Figure4.png

5.2 Theoretical arguments for DALY valuations to vary by income

One theoretical argument would go like this: If you think we can improve individuals’ lives by improving their incomes, and you also think the moral impact of saving a life varies somewhat with the quality of that life (i.e. it’s better to extend a happy life than a miserable life), then it follows that it is more valuable in theory to extend the life of a typical high-income country resident than that of a typical person at the global poverty line.49 Many people (including us) find this a deeply concerning line of reasoning – and critically, this theoretical dynamic is in reality swamped by the fact that it is dramatically less expensive to extend the life of someone at the global poverty line, which is why the overwhelming majority of our GHW portfolio is focused on extending lives and increasing incomes in low and lower middle income countries.

5.3 Bringing these considerations together

We’ve been unsettled about how to aggregate these lines of argument, but ended up concluding that we didn’t need to reach a resolution on this because the expected cost of mistakes if we were wrong (i.e., if we assumed 2x and the true answer were 6x, or vice versa) were low.50
For now, we haven’t decided specifically how to weigh lives-vs-income tradeoffs in high-income countries, and when we face decisions that might depend on the specifics, will test a range of values between 2 and 6 units of log-income.

How Feasible Is Long-range Forecasting?

How accurate do long-range (≥10yr) forecasts tend to be, and how much should we rely on them?

As an initial exploration of this question, I sought to study the track record of long-range forecasting exercises from the past. Unfortunately, my key finding so far is that it is difficult to learn much of value from those exercises, for the following reasons:

  1. Long-range forecasts are often stated too imprecisely to be judged for accuracy. [More]
  2. Even if a forecast is stated precisely, it might be difficult to find the information needed to check the forecast for accuracy. [More]
  3. Degrees of confidence for long-range forecasts are rarely quantified. [More]
  4. In most cases, no comparison to a “baseline method” or “null model” is possible, which makes it difficult to assess how easy or difficult the original forecasts were. [More]
  5. Incentives for forecaster accuracy are usually unclear or weak. [More]
  6. Very few studies have been designed so as to allow confident inference about which factors contributed to forecasting accuracy. [More]
  7. It’s difficult to know how comparable past forecasting exercises are to the forecasting we do for grantmaking purposes, e.g. because the forecasts we make are of a different type, and because the forecasting training and methods we use are different. [More]

We plan to continue to make long-range quantified forecasts about our work so that, in the long run, we might learn something about the feasibility of long-range forecasting, at least for our own case. [More]

1. Challenges to learning from historical long-range forecasting exercises

Most arguments I’ve seen about the feasibility of long-range forecasting are purely anecdotal. If arguing that long-range forecasting is feasible, the author lists a few example historical forecasts that look prescient in hindsight. But if arguing that long-range forecasting is difficult or impossible, the author lists a few examples of historical forecasts that failed badly. How can we do better?

The ideal way to study the feasibility of long-range forecasting would be to conduct a series of well-designed prospective experiments testing a variety of forecasting methods on a large number of long-range forecasts of various kinds. However, doing so would require us to wait ≥10 years to get the results of each study and learn from them.

To learn something about the feasibility of long-range forecasting more quickly, I decided to try to assess the track record of long-range forecasts from the past. First, I searched for systematic retrospective accuracy evaluations for large collections of long-range forecasts. I identified a few such studies, but found that they all suffered from many of the limitations discussed below.[1]E.g. Kott & Perconti (2018); Fye et al. (2013); Albright (2002), which I previously discussed here; Parente & Anderson-Parente (2011).

I also collected past examples of long-range forecasting exercises I might evaluate for accuracy myself, but quickly determined that doing so would require more effort than the results would likely be worth. Finally, I reached out to the researchers responsible for a large-scale retrospective analysis with particularly transparent methodology,[2]This was Fye et al. (2013). See Mullins (2012) for an extended description of the data collection and analysis process, and attached spreadsheets of all included sources and forecasts and how they were evaluated in the study. and commissioned them to produce a follow-up study focused on long-range forecasts. Its results were also difficult to learn from, again for some of the reasons discussed below (among others).[3]The commissioned follow-up study is Mullins (2018). A few notes on this study: The study was pre-registered at OSF Registries here. Relative to the pre-registration, Mullins (2018) extracted forecasts from a slightly different set of source documents, because one of the planned source documents … Continue reading

1.1 Imprecisely stated forecasts

If a forecast is phrased in a vague or ambiguous way, it can be difficult or impossible to subsequently judge its accuracy.[4]For further discussion of this point, see e.g. Tetlock & Gardner (2015), ch. 3. This can be a problem even for very short-range forecasts, but the challenge is often greater for long-range forecasts, since they often aim to make a prediction about circumstances, technologies, or measures that … Continue reading

For example, consider the following forecasts:[5]The forecasts in this section are taken from the forecasts spreadsheet attached to Mullins (2018). In some cases they are slight paraphrases of the forecasting statements from the source documents.

  • From 1975: “By 2000, the tracking and data relay satellite system (TDRSS) will acquire and relay data at gigabit rates.”
  • From 1980: “The world’s population will increase 55 percent, from 4.1 billion people in 1975 to 6.35 billion in 2000.”
  • From 1977: “The average fuel efficiency of automobiles in the US will be 27 to 29 miles per gallon in 2000.”
  • From 1972: “The CO2 concentration will reach 380 ppm by the year 2000.”
  • From 1987: “In Germany, in the year 1990, 52.0% of women aged 15 – 64 will be registered as employed.”
  • From 1967: “The installed power in the European Economic Community will grow by a factor of a hundred from a programmed 3,700 megawatts in 1970 to 370,000 megawatts in 2000.”

Broadly speaking, these forecasts were stated with sufficient precision to now judge them as correct or incorrect.

In contrast, consider the low precision of these forecasts:

  • From 1964: “Operation of a central data storage facility with wide access for general or specialized information retrieval will be in use between 1971 and 1991.” What counts as “a central data storage facility”? What counts as “general or specialized information retrieval”? Perhaps most critically, what counts as “wide access”? Given the steady growth of (what we now call) the internet from the late 1960s onward, this forecast might be considered true for different decades depending on whether we interpret “wide access” to refer to access by thousands, or millions, or billions of people.
  • From 1964: “In 2000, general immunization against bacterial and viral diseases will be available.” What is meant by “general immunization?” Did the authors mean a universal vaccine? Did they mean widely-delivered vaccines protecting against several important and common pathogens? Did they mean a single vaccine that protects against several pathogens?
  • From 1964: “In 2000, automation will have advanced further, from many menial robot services to sophisticated, high-IQ machines.” What counts as a “menial robot service,” and how many count as “many”? How widely do those services need to be used? What is a high-IQ machine? Would a machine that can perform well on IQ tests but nothing else count? Would a machine that can outperform humans on some classic “high-IQ” tasks (e.g. chess-playing) count?
  • From 1964: “Reliable weather forecasts will be in use between 1972 and 1988.” What accuracy score counts as “reliable”?
  • From 1983: “Between 1983 and 2000, large corporate farms that are developed and managed by absentee owners will not account for a significant number of farms.” What counts as a “large” corporate farm? What counts as a “significant number”?

In some cases, even an imprecisely phrased forecast can be judged as uncontroversially true or false, if all reasonable interpretations are true (or false). But in many cases, it’s impossible to determine whether a forecast should be judged as true or false.

Unfortunately, it can often require substantial skill and effort to transform an imprecise expectation into a precisely stated forecast, especially for long-range forecasts.[6]Technically, it should be possible to transform almost any imprecise forecast into a precise forecast using a “human judge” approach, but this can often be prohibitively expensive. In a “human judge” approach, one would write down an imprecise forecast, perhaps along with … Continue reading

In such cases, one can choose to invest substantial effort into improving the precision of one’s forecasting statement, perhaps with help from someone who has developed substantial expertise in methods for addressing this difficulty (e.g. the “Questions team” at Good Judgment Inc.). Or, one can make the forecast despite its imprecision, to indicate something about one’s expectations, while understanding that it may be impossible to later judge as true or false.

Regardless, the frequent imprecision of historical long-range forecasts makes it difficult to assess them for accuracy.

1.2 Practically uncheckable forecasts

Even if a forecast is stated precisely, it might be difficult to check for accuracy if the information needed to judge the forecast is non-public, difficult to find, untrustworthy, or not available at all. This can be an especially common problem for long-range forecasts, for example because variables that are reliably measured (e.g. by a government agency) when the forecast is made might no longer be reliably measured at the time of the forecast’s “due date.”

For example, in the study we recently commissioned,[7]See the forecasts spreadsheet attached to Mullins (2018). the following forecasts were stated with relatively high precision, but it was nevertheless difficult to find reliable sources of “ground truth” information that could be used to judge the exact claim of the original forecast:

  • From 1967: “By the year 2000, the US will include approximately 232 million people age 14 and older.” The commissioned study found two “ground truth” sources for judging this forecast, but some guesswork was still required because the two sources disagreed with each other substantially, and one source had info on the population of those 15 and older but not of those 14 and older.
  • From 1980: “In 2000, 400 cities will have passed the million population mark.” In this case there is some ambiguity about what counts as a city, but even if we set that aside, the commissioned study found two “ground truth” sources for judging this forecast, but some guesswork was still required because those sources included figures for some years (implying particular average trends that could be extrapolated) but not for 2000 exactly.

1.3 Non-quantified degrees of confidence

In most forecasting exercises I’ve seen, forecasters provide little or no indication of how confident they are in each of their forecasts, which makes it difficult to assess their overall accuracy in a meaningful way. For example, if 50% of a forecaster’s predictions are correct, we would assess their accuracy very differently if they made those forecasts with 90% confidence vs. 50% confidence. If degrees of confidence are not quantified, there is no way to compare the forecaster’s subjective likelihoods to the objective frequencies of events.[8]One recent proposal is to infer forecasters’ probabilities from their imprecise forecasting language, as in Lehner et al. (2012). I would like to see this method validated more extensively before I rely on it.

Unfortunately, in the long-range forecasting exercises I’ve seen, degrees of confidence are often not mentioned at all. If they are mentioned, forecasters typically use imprecise language such as “possibly” or “likely,” terms which can be used to refer to hugely varying degrees of confidence.[9]E.g. see figure 18 in chapter 12 of Heuer (1999); a replication of that study by Reddit.com user zonination here; Wheaton (2008); Mosteller & Youtz (1990); Mauboussin & Mauboussin (2018) (original results here); table 1 of Mandel (2015). I haven’t vetted these studies. Such imprecision can sometimes lead to poor decisions,[10]Tetlock & Gardner (2015), ch. 3, gives the following (possible) example: In 1961, when the CIA was planning to topple the Castro government by landing a small army of Cuban expatriates at the Bay of Pigs, President John F. Kennedy turned to the military for an unbiased assessment. The Joint … Continue reading and means that such forecasts cannot be assessed using calibration and resolution measures of accuracy.

1.4 No comparison to a baseline method or null model is feasible

One way to make a large number of correct forecasts is to make only easy forecasts, e.g. “in 10 years, world population will be larger than 5 billion.” One can also use this strategy to appear impressively well-calibrated, e.g. by making forecasts like “With 50% confidence, when I flip this fair coin it will come up heads.” And because forecasts can vary greatly in difficulty, it can be misleading to compare the accuracy of forecasters who made forecasts about different phenomena.[11]One recent proposal for dealing with this problem is to use Item Response Theory, as described in Bo et al. (2017): The conventional estimates of a forecaster’s expertise (e.g., his or her mean Brier score, based on all events forecast) are content dependent, so people may be assigned higher or … Continue reading

For example, forecasters making predictions about data-rich domains (e.g. sports or weather) might have better Brier scores than forecasters making predictions about data-poor domains (e.g. novel social movements or rare disasters), but that doesn’t mean that the sports and weather forecasters are better or “more impressive” forecasters — it may just be that they have limited themselves to easier-to-forecast phenomena.

To assess the ex ante difficulty of some set of forecasts, one could compare the accuracy of a forecasting exercises’ effortfully produced forecasts against the accuracy of forecasts about the same statements produced by some naive “baseline” method, e.g. a simple poll of broadly educated people (conducted at the time of the original forecasting exercise), or a simple linear extrapolation of the previous trend (if time series data are available for the phenomenon in question). Unfortunately, such naive baseline comparisons are often unavailable.

Even if no comparison to the accuracy of a naive baseline method is available, one can sometimes compare the accuracy of a set of forecasts to the accuracy predicted by a “null model” of “random” forecasts. For example, for the forecasting tournaments described in Tetlock (2005), all forecasting questions came with answer options that were mutually exclusive and mutually exhaustive, e.g. “Will [some person] still be President on [some date]?” or “Will [some state’s] borders remain the same, expand, or contract by [some date]?”[12]See the Methodological Appendix of Tetlock (2005).

Because of this, Tetlock knew the odds that a “dart-throwing chimp” (i.e. a random forecast) would get each question right (50% chance for the first question, 1/3 chance for the second question). Then, he could compare the accuracy of expert forecasters to the accuracy of a random-forecast “null model.” Unfortunately, the forecasting questions of the long-range forecasting exercises I’ve seen are rarely set up to allow for the construction of a null model to compare against the (effortful) forecasts produced by the forecasting exercise.[13]This includes the null models used in Fye et al. (2013) and Mullins (2018), which I don’t find convincing.

1.5 Unclear or weak incentives for accuracy

For most long-range forecasting exercises I’ve seen, it’s either unclear how much incentive there was for forecasters to strive for accuracy, or the incentives for accuracy seem clearly weak.

For example, in many long-range forecasting exercises, there seems to have been no concrete plan to check the accuracy of the study’s forecasts at a particular time in the future — and in fact, the forecasts from even the most high-profile long-range forecasting studies I’ve seen were never checked for accuracy (as far as I can tell), at least not by anyone associated with the original study or funded by the same funder or funder(s). Without a concrete plan to check the accuracy of the forecasts, how strong could the incentive for forecaster accuracy be?

Furthermore, long-range forecasting exercises are rarely structured as forecasting tournaments, with multiple individuals, groups, or methods competing to make the most accurate forecasts about the same forecasting questions (or heavily overlapping sets of forecasting questions). As such, there’s no way to compare the accuracy of one individual or group or method against another, and again it’s unclear whether the forecasters had much incentive to strive for accuracy.

Also, some studies that were set up to eventually check the accuracy of the forecasts made didn’t use a scoring rule that reliably incentivized reporting one’s true probabilities, i.e. a proper scoring rule.

1.6 Weak strategy for causal identification

Even if a study passes the many hurdles outlined above, and there are clearly demonstrated accuracy differences between different forecasting methods, it can still be difficult to learn about which factors contributed to those accuracy differences if the study was not structured as a randomized controlled trial, and no other strong causal identification strategy was available.[14]On the tricky challenge of robust causal identification from observational data, see e.g. Athey & Imbens (2017) and Hernán & Robins (forthcoming).

1.7 Unclear relevance to our own long-range forecasting

I haven’t yet found a study that (1) evaluates the accuracy of a large collection of somewhat-varied[15]By “somewhat varied,” I mean to exclude studies that are e.g. limited to forecasting variables for which substantial time series data is available, or variables in a very narrow domain such as a handful of macroeconomic indicators or a handful of environmental indicators. long-range (≥10yr) forecasts and that (2) avoids the limitations above. If you know of such a study, please let me know.

Tetlock’s “Expert Political Judgment” project (EPJ; Tetlock 2005) and his “Good Judgment Project” (GJP; Tetlock & Gardner 2015) might come closest to satisfying those criteria, and that is a major reason we have prioritized learning what we can from Tetlock’s forecasting work specifically (e.g. see here) and have supported his ongoing research.

Tetlock’s work hasn’t focused on long-range forecasting specifically, but because Tetlock’s work largely (but not entirely) avoids the other limitations above, I will briefly explore what I think we can and can’t learn from his work about the feasibility of long-range forecasting, and use it to explore the more general question of how studies of long-range forecasting can be of unclear relevance to our own forecasting even when they largely avoid the other limitations discussed above.

1.7.1 Tetlock, long-range forecasting, and questions of relevance

Most GJP forecasts had time horizons of 1-6 months,[16]See figure 3 of this December 2015 draft of a paper eventually published (without that figure) as Friedman et al. (2018). and thus can tell us little about the feasibility of long-range (≥10yr) forecasting.[17]Despite this, I think we can learn a little from GJP about the feasibility of long-range forecasting. Good Judgment Project’s Year 4 annual report to IARPA (unpublished), titled “Exploring the Optimal Forecasting Frontier,” examines forecasting accuracy as a function of … Continue reading

In Tetlock’s EPJ studies, however, forecasters were asked a variety of questions with forecasting horizons of 1-25 years. (Forecasting horizons of 1, 3, 5, 10, or 25 years were most common.) Unfortunately, by the time of Tetlock (2005), only a few 10-year forecasts (and no 25-year forecasts) had come due, so Tetlock (2005) only reports accuracy results for forecasts with forecasting horizons he describes as “short-term” (1-2 years) and “long-term” (usually 3-5 years, plus a few longer-term forecasts that had come due).[18]Forecasting horizons are described under “Types of Forecasting Questions” in the Methodological Appendix of Tetlock (2005). The definitions of “short-term” and “long-term” were provided via personal communication with Tetlock, as was the fact that only a few … Continue reading

The differing accuracy scores for short-term vs. long-term forecasts in EPJ are sometimes used to support a claim that the accuracy of expert predictions declines toward chance five years out.[19]E.g. Tetlock himself says “there is no evidence that geopolitical or economic forecasters can predict anything ten years out beyond the excruciatingly obvious — ‘there will be conflicts’ — and the odd lucky hits that are inevitable whenever lots of forecasters make lots of … Continue reading

While it’s true that accuracy declined “toward” chance five years out, the accuracy differences reported in Tetlock (2005) are not as large as I had assumed upon initially hearing this claim (see footnote for details[20]Tetlock (2005) reports both calibration scores and discrimination (aka resolution) scores, explaining that: “A calibration score of .01 indicates that forecasters’ subjective probabilities diverged from objective frequencies, on average, by about 10 percent; a score of .04, an average gap … Continue reading). Fortunately, we might soon be in a position to learn more about long-range forecasting from the EPJ data, since most EPJ forecasts (including most 25-year forecasts) will have resolved by 2022.[21]Personal communication with Phil Tetlock. And according to the Acknowledgements section at the back of Tetlock (2005), all EPJ forecasts will come due by 2026.

Perhaps more importantly, how analogous are the forecasting questions from EPJ to the forecasting questions we face as a grantmaker, and how similar was the situation of the EPJ forecasters to the situation we find ourselves in?

For context, some (paraphrased) representative example “long-term” forecasting questions from EPJ include:[22]Here is an abbreviated summary of EPJ’s forecasting questions, drawing and quoting from Tetlock (2005)’s Methodological Appendix: Each expert was asked to make short-term and long-term predictions about “each of four nations (two inside and two outside their domains of expertise) … Continue reading

  • Two elections from now, will the current majority in the legislature of [some stable democracy] lose its majority, retain its majority, or strengthen its majority?
  • In the next five years, will GDP growth rates in [some nation] accelerate, decelerate, or remain about the same?
  • Over the next ten years, will defense spending as a percentage of [some nation’s] expenditures rise, fall, or stay about the same?
  • In the next [ten/twenty-five] years, will [some state] deploy a nuclear or biological weapon of mass destruction (according to the CIA Factbook)?

A few observations come to mind as I consider analogies and disanalogies between EPJ’s “long-term” forecasting and the long-range forecasting we do as a grantmaker:[23]Some of these observations overlap with the other limitations listed above.

  • For most of our history, we’ve had the luxury of knowing the results from EPJ and GJP and being able to apply them to our forecasting, which of course wasn’t true for the EPJ forecasters. For example, many of our staff know that it’s often best to start one’s forecast from an available base rate, and that many things probably can’t be predicted with better accuracy than chance (e.g. which party will be in the majority two elections from now). Many of our staff have also done multiple hours of explicit calibration training, and my sense is that very few (if any) EPJ forecasters are likely to have done calibration training prior to making their forecasts. Several of our staff have also participated in a Good Judgment Inc. forecasting training workshop.
  • EPJ forecasting questions were chosen very carefully, such that they (a) were stated precisely enough to be uncontroversially judged for accuracy, (b) came with prepared answer options that were mutually exclusive and collectively exhaustive (or continuous), (c) were amenable to base rate forecasting (though base rates were not provided to the forecasters), and satisfied other criteria necessary for rigorous study design.[24]On the other criteria, see the Methodological Appendix of Tetlock (2005). In contrast, most of our forecasting questions (1) are stated imprecisely (because the factors that matter most to the grant decision are ~impossible or prohibitively costly to state precisely), (2) are formulated very quickly by the forecaster (i.e. the grant investigator) as they fill out our internal grant write-up template, and thus don’t come with pre-existing answer options, and (3) rarely have clear base rate data to learn from. Overall, this might suggest we should (ignoring other factors) expect lower accuracy than was observed in EPJ, e.g. because we formulate questions and make forecasts about them so quickly. It also means that we are less able to learn from the forecasters we make, because many of them are stated too imprecisely to judge for accuracy.
  • I’m unsure whether EPJ questions asked about phenomena that are “intrinsically” easier or harder to predict than the phenomena we try to predict. E.g. party control in established democracies changes regularly and is thus very difficult to predict even one or two elections in advance, whereas some of our grantmaking is premised substantially on the continuation of stable long-run trends. On the other hand, many of our forecasts are (as mentioned above) about phenomena which lack clearly relevant base rate data to extrapolate, or (in some cases) about events that haven’t ever occurred before.
  • How motivated were EPJ forecasters to strive for accuracy? Presumably the rigorous setup and concrete plan to measure forecast accuracy provided substantial incentives for accuracy, though on the other hand, the EPJ forecasters knew their answers and accuracy scores would be anonymous. Meanwhile, explicit forecasting is a relatively minor component of Open Phil staffers’ work, and our less rigorous setup means that incentives for accuracy may be weak, but also our (personally identified) forecasts are visible to many other staff.

Similar analogies and disanalogies also arise when comparing our forecasting situation to that of the forecasters who participated in other studies of long-range forecasting. This should not be used an excuse to avoid drawing lessons from studies when we should, but it does mean that it may be tricky to assess what we should learn about our own situation from even very well-designed studies of long-range forecasting.

2. Our current attitude toward long-range forecasting

Despite our inability to learn much (thus far) about the feasibility of long-range forecasting, and therefore also about best practices for long-range forecasting, we plan to continue to make long-range quantified forecasts about our work so that, in the long run, we might learn something about the feasibility of long-range forecasting, at least for our own case. We plan to say more in the future about what we’ve learned about forecasting in our own grantmaking context, especially after a larger number of our internal forecasts have come due and then been judged for accuracy.

Footnotes

Footnotes
1 E.g. Kott & Perconti (2018); Fye et al. (2013); Albright (2002), which I previously discussed here; Parente & Anderson-Parente (2011).
2 This was Fye et al. (2013). See Mullins (2012) for an extended description of the data collection and analysis process, and attached spreadsheets of all included sources and forecasts and how they were evaluated in the study.
3 The commissioned follow-up study is Mullins (2018). A few notes on this study:

  • The study was pre-registered at OSF Registries here. Relative to the pre-registration, Mullins (2018) extracted forecasts from a slightly different set of source documents, because one of the planned source documents didn’t fit the study’s criteria upon examination, and we needed to identify additional source documents to ensure we could reach our target of ≥400 validated long-range forecasts.
  • Three spreadsheets are attached to the PDF of Mullins (2018): one with details on all source documents, one with details on all evaluated forecasts, and one with details on the “ground truth evidence” used to assess the accuracy of each forecast.
  • I chose the source documents based on how well they seemed (upon a quick skim) to meet as many of the following criteria as possible (the first two criteria were necessary, the others were ideal but not required):
    • One of the authors’ major goals was to say something about which events/scenarios were more vs. less likely, as opposed to merely aiming to e.g. “paint possible futures.”
    • The authors made forecasts of events/scenarios ≥10yrs away, that were expected to be somewhat different from present reality. (E.g. not “vacuum cleaners will continue to exist.”)
    • The authors expressed varying degrees of confidence for many of their forecasts, quantitatively or at least with terms such as “likely,” “unlikely,” “highly likely,” etc.
    • The authors made some attempt to think about which plans made sense given their forecasts. (I.e., important decisions were at stake, or potentially at stake.)
    • The authors’ language suggests they had some degree of self-awareness about the difficulty of long-range forecasting.
    • The authors seemed to have a decent grasp of not just the domain they were trying to forecast, but also of broadly applicable reasoning tools such as those from economics.
    • The authors made their forecasts after ~1965 (so they had access to a decent amount of “modern” science) but before 2007 (so that we’d have some ≥10yr forecasts evaluable for accuracy).
    • The authors seemed to put substantial effort into their forecasts, e.g. with substantial analysis, multiple lines of argument, thoughtful caveats, engagement with subject-matter experts, etc.
    • The authors were writing for a fairly serious audience with high expectations, e.g. an agency of a leading national government.

Since Mullins (2018) is modeled after Fye et al. (2013), we knew in advance it would have several of the limitations described in this post, but we hoped to learn some things from it anyway, especially given the planned availability of the underlying raw data. Unfortunately, upon completion we discovered additional limitations of the study.

For example, Mullins (2018) implicitly interprets all forecasts as “timing forecasts” of the form “event X will first occur in approximately year Y.” This has some advantages (e.g. allowing one to operationalize some notion of “approximately correct”), but it also leads to counterintuitive judgments in many cases:

  • In some cases, forecasts that seem to be of the form “X will be true in year Y” are interpreted for evaluation as “event X will first occur in approximately year Y.” For example, consider the following forecast made in 1975: “In 1985, deep-space communication stations on Earth will consist of two 64-meter antennas plus one 26-meter antenna at Goldstone, California; Madrid, Spain; and Canberra, Australia” (Record ID #2001). This forecast was judged incorrect, and with a temporal forecasting error of 13 years, on the grounds that the forecasted state of affairs was already true 13 years earlier (in 1972), rather than having come to be true in approximately 1985.
  • In other cases, forecasts that seem to be of the form “parameter P will have approximately value V in year Y” are interpreted for evaluation as “parameter P will first approximately hit value V in year Y.” For example, consider the following forecast made in 1978: “In Canada, in the year 1990, 55.2% of women aged 15 – 64 will be registered as employed” (Record ID #2748). The forecast was judged as incorrect because the true value in 1990 was 58.5%, and had reached 55% in 1985, just barely outside the “within 30%” rule for judging a forecast as a success. In this example, it seems more reasonable to say that the original forecast was nearly (but not quite) correct for 1990, rather than interpreting the original forecast as being primarily about the timing of when the female labor force participation rate would hit exactly 55.2%. (The forecast is correctly marked as “Mostly realized,” but the analytic setup doesn’t give much room this label to have much effect on the top-line quantitative results.)
  • Some forecasts aren’t interpretable as timing forecasts at all, and thus shouldn’t have been included when comparing the success rate of the evaluated forecasts against BryceTech’s “null model” (i.e. random forecast) success rate, which assumes forecasts are timing forecasts. Example forecasts that can’t be interpreted as timing forecasts include negative forecasts (e.g. Record ID #2336: “In the year 2000, fusion power will not be a significant source of energy”), no-change forecasts (e.g. Record ID #2364: “The world’s population in the year 2000 will be less than the seven billion”), and whole-period forecasts (e.g. Record ID #2370: “The continent of Africa will have a population growth rate of 2.7 per cent over the 1965-2000 period”). Many of these forecasts were assigned a temporal forecasting error of 0 despite not being interpretable as timing forecasts.

There are other limits to the data and analysis in Mullins (2018), and we don’t think one should draw major substantive conclusions from it. It may, however, be a useful collection of long-range forecasts that could be judged and analyzed for accuracy using alternate methods.

My thanks to Kathleen Finlinson and Bastian Stern for their help evaluating this report.

4 For further discussion of this point, see e.g. Tetlock & Gardner (2015), ch. 3. This can be a problem even for very short-range forecasts, but the challenge is often greater for long-range forecasts, since they often aim to make a prediction about circumstances, technologies, or measures that aren’t yet well-defined at the time the forecast is made.
5 The forecasts in this section are taken from the forecasts spreadsheet attached to Mullins (2018). In some cases they are slight paraphrases of the forecasting statements from the source documents.
6 Technically, it should be possible to transform almost any imprecise forecast into a precise forecast using a “human judge” approach, but this can often be prohibitively expensive. In a “human judge” approach, one would write down an imprecise forecast, perhaps along with some accompanying material about motivations and reasoning and examples of what would and wouldn’t satisfy the intention of the forecast, and then specify a human judge (or panel of judges) who will later decide whether one’s imprecise forecast should be judged true or false (or, each judge could give a Likert-scale rating of “how accurate” or “how clearly accurate” the forecast was). Then, one can make a precise forecast about the future judgment of the judge(s). The precise forecast, then, would be a forecast both about the phenomenon one wishes to forecast, and about the psychology and behavior of the judge(s). Of course, one’s precise forecast must also account for the possibility that one or more judges will be unwilling or unable to provide a judgment at the required time.

An example of this “human judge” approach is the following forecast posted to the Metaculus forecasting platform: “Will radical new ‘low-energy nuclear reaction’ technologies prove effective before 2019?” In this case, the exact (but still somewhat imprecise) forecasting statement was: “By Dec. 31, 2018, will Andrea Rossi/Leonardo/Industrial Heat or Robert Godes/Brillouin Energy have produced fairly convincing evidence (> 50% credence) that their new technology […] generates substantial excess heat relative to electrical and chemical inputs?” Since there remains some ambiguity about e.g. what should count as “convincing evidence,” the question page also specifies that “The bet will be settled by [Huw] Price and [Carl] Shulman by New Years Eve 2018, and in the case of disagreement shall defer to majority vote of a panel of three physicists: Anthony Aguirre, Martin Rees, and Max Tegmark.”

7 See the forecasts spreadsheet attached to Mullins (2018).
8 One recent proposal is to infer forecasters’ probabilities from their imprecise forecasting language, as in Lehner et al. (2012). I would like to see this method validated more extensively before I rely on it.
9 E.g. see figure 18 in chapter 12 of Heuer (1999); a replication of that study by Reddit.com user zonination here; Wheaton (2008); Mosteller & Youtz (1990); Mauboussin & Mauboussin (2018) (original results here); table 1 of Mandel (2015). I haven’t vetted these studies.
10 Tetlock & Gardner (2015), ch. 3, gives the following (possible) example:

In 1961, when the CIA was planning to topple the Castro government by landing a small army of Cuban expatriates at the Bay of Pigs, President John F. Kennedy turned to the military for an unbiased assessment. The Joint Chiefs of Staff concluded that the plan had a “fair chance” of success. The man who wrote the words “fair chance” later said he had in mind odds of 3 to 1 against success. But Kennedy was never told precisely what “fair chance” meant and, not unreasonably, he took it to be a much more positive assessment. Of course we can’t be sure that if the Chiefs had said “We feel it’s 3 to 1 the invasion will fail” that Kennedy would have called it off, but it surely would have made him think harder about authorizing what turned out to be an unmitigated disaster.

11 One recent proposal for dealing with this problem is to use Item Response Theory, as described in Bo et al. (2017):

The conventional estimates of a forecaster’s expertise (e.g., his or her mean Brier score, based on all events forecast) are content dependent, so people may be assigned higher or lower “expertise” scores as a function of the events they choose to forecast. This is a serious shortcoming because (a) typically judges do not forecast all the events and (b) their choices of which events to forecast are not random. In fact, one can safely assume that they select questions strategically: Judges are more likely to make forecasts about events in domains where they believe (or are expected to) have expertise or events they perceive to be “easy” and highly predictable, so their Brier scores are likely to be affected by this self-selection that, typically, leads to overestimation of one’s expertise. Thus, all comparisons among people who forecast distinct sets of events are of questionable quality.

A remedy to this problem is to compare directly the forecasting expertise based only on the forecasts to the common subset of events forecast by all. But this approach can also run into problems. As the number of forecasters increases, comparisons may be based on smaller subsets of events answered by all and become less reliable and informative. As an example, consider financial analysts who make predictions regarding future earnings of companies that are traded on the market. They tend to specialize in various areas, so it is practically impossible to compare the expertise of an analyst that focuses on the automobile industry and another that specialize in the telecommunication area, since there is no overlap between their two areas. Any difference between their Brier scores could be a reflection of how predictable one industry is, compared to the other, and not necessarily of the analysts’ expertise and forecasting ability. An IRT model can solve this problem. Assuming forecasters are sampled from a population with some distribution of expertise, a key property of IRT models is invariance of parameters (Hambleton & Jones, 1993): (1) parameters that characterize an individual forecaster are independent of the particular events from which they are estimated; (2) parameters that characterize an event are independent of the distribution of the abilities of the individuals who forecast them (Hambleton, Swaminathan & Rogers, 1991). In other words, the estimated expertise parameters allow meaningful comparisons of all the judges from the same population as long as the events require the same latent expertise (i.e., a unidimensional assumption).

…we describe an IRT framework in which one can incorporate any proper scoring rule into the model, and we show how to use weights based on event features in the proper scoring rules. This leads to a model-based method for evaluating forecasters via proper scoring rules, allowing us to account for additional factors that the regular proper scoring rules rarely consider.

I have not evaluated this approach in detail and would like to see it critiqued and validated by other experts.

On this general challenge, see also the discussion of “Difficulty-adjusted probability scores” in the Technical Appendix of Tetlock (2005).

12 See the Methodological Appendix of Tetlock (2005).
13 This includes the null models used in Fye et al. (2013) and Mullins (2018), which I don’t find convincing.
14 On the tricky challenge of robust causal identification from observational data, see e.g. Athey & Imbens (2017) and Hernán & Robins (forthcoming).
15 By “somewhat varied,” I mean to exclude studies that are e.g. limited to forecasting variables for which substantial time series data is available, or variables in a very narrow domain such as a handful of macroeconomic indicators or a handful of environmental indicators.
16 See figure 3 of this December 2015 draft of a paper eventually published (without that figure) as Friedman et al. (2018).
17 Despite this, I think we can learn a little from GJP about the feasibility of long-range forecasting. Good Judgment Project’s Year 4 annual report to IARPA (unpublished), titled “Exploring the Optimal Forecasting Frontier,” examines forecasting accuracy as a function of forecasting horizon in this figure (reproduced with permission):

AUC as a function of forecasting horizon and type of forecaster.png

This chart uses an accuracy statistic known as AUC/ROC (see Steyvers et al. 2014) to represent the accuracy of binary, non-conditional forecasts, at different time horizons, throughout years 2-4 of GJP. Roughly speaking, this chart addresses the question: “At different forecasting horizons, how often (on average) were forecasters on ‘the right side of maybe’ (i.e. above 50% confidence in the binary option that turned out to be correct), where 0.5 represents ‘no better than chance’ and 1 represents ‘always on the right side of maybe’?”

For our purposes here, the key results shown above are, roughly speaking, that (1) regular forecasters did approximately no better than chance on this metric at ~375 days before each question closed, (2) superforecasters did substantially better than chance on this metric at ~375 days before each question closed, (3) both regular forecasters and superforecasters were almost always “on the right side of maybe” immediately before each question closed, and (4) superforecasters were roughly as accurate on this metric at ~125 days before each question closed as they were at ~375 days before each question closed.

If GJP had involved questions with substantially longer time horizons, how quickly would superforecaster accuracy declined with longer time horizons? We can’t know, but an extrapolation of the results above is at least compatible with an answer of “fairly slowly.”

Of course there remain other questions about how analogous the GJP questions are to the types of questions that we and other actors attempt to make long-range forecasts about.

18 Forecasting horizons are described under “Types of Forecasting Questions” in the Methodological Appendix of Tetlock (2005). The definitions of “short-term” and “long-term” were provided via personal communication with Tetlock, as was the fact that only a few 10-year forecasts could be included in the analysis of Tetlock (2005).
19 E.g. Tetlock himself says “there is no evidence that geopolitical or economic forecasters can predict anything ten years out beyond the excruciatingly obvious — ‘there will be conflicts’ — and the odd lucky hits that are inevitable whenever lots of forecasters make lots of forecasts. These limits on predictability are the predictable results of the butterfly dynamics of nonlinear systems. In my EPJ research, the accuracy of expert predictions declined toward chance five years out” (Tetlock & Gardner 2015, p. 243).
20 Tetlock (2005) reports both calibration scores and discrimination (aka resolution) scores, explaining that: “A calibration score of .01 indicates that forecasters’ subjective probabilities diverged from objective frequencies, on average, by about 10 percent; a score of .04, an average gap of 20 percent. A discrimination score of .01 indicates that forecasters, on average, predicted about 6 percent of the total variation in outcomes; a score of .04, that they captured 24 percent” (Tetlock 2005, ch. 2). See the book’s Technical Appendix for details on how Tetlock’s calibration and discrimination scores are computed.

Given this scoring system, Tetlock’s results on the accuracy of short-term vs. long-term forecasts are:

Sample of forecasts Calibration score Discrimination score
Expert short-term forecasts .023 .027
Expert long-term forecasts .026 .021
Non-expert short-term forecasts .024 .023
Non-expert long-term forecasts .020 .021

The data above are from figure 2.4 of Tetlock (2005). I’ve renamed “dilettantes” to “non-experts.”

See also this spreadsheet, which contains additional short-term vs. long-term accuracy comparisons in data points estimated from figure 3.2 of Tetlock (2005) using WebPlotDigitizer. See ch. 3 and the Technical Appendix of Tetlock (2005) for details on how to interpret these data points. Also note that there is a typo in the caption for figure 3.2; I confirmed with Tetlock that the phrase which reads “long-term (1, 2, 5, 7…)” should instead be “long-term (1, 3, 5, 7…).”

21 Personal communication with Phil Tetlock. And according to the Acknowledgements section at the back of Tetlock (2005), all EPJ forecasts will come due by 2026.
22 Here is an abbreviated summary of EPJ’s forecasting questions, drawing and quoting from Tetlock (2005)’s Methodological Appendix:

  • Each expert was asked to make short-term and long-term predictions about “each of four nations (two inside and two outside their domains of expertise) on seventeen outcome variables (on average), each of which was typically broken down into three possible futures and thus required three separate probability estimates.” (Experts didn’t respond to all questions, though.)
  • Most forecasting questions asked about the possible futures of ~60 nations, clustered into nine regions: the Soviet bloc, the Europian Union, North America, Central and Latin America, the Arab world, sub-Saharan Africa, China, Northeast Asia, and Southeast Asia.
  • Most forecasting questions fell into one of four content categories:
    • Continuity of domestic political leadership: “For established democracies, should we expect after either the next election (short-term) or the next two elections (longer-term) the party that currently has the most representatives in the legislative branch(es) of government will retain this status, will lose this status, or will strengthen its position (separate judgments for bicameral systems)? For democracies with presidential elections, should we expect that after the next election or next two elections, the current incumbent/party will lose control, will retain control with reduced popular support, or will retain control with greater popular support? …For states with shakier track records of competitive elections, should we expect that, in either the next five or ten years, the individuals and (separate judgment) political parties/movements currently in charge will lose control, will retain control but weather major challenges to their authority (e.g., coup attempts, major rebellions), or will retain control without major challenges? Also, for less stable polities, should we expect the basic character of the political regime to change in the next five or ten years and, if so, will it change in the direction of increased or reduced economic freedom, increased or reduced political freedom, and increased or reduced corruption? Should we expect over the next five or ten years that interethnic and other sectarian violence will increase, decrease, or remain about the same? Finally, should we expect state boundaries — over the next ten or twenty-five years — to remain the same, expand, or contract and — if boundaries do change — will it be the result of peaceful or violent secession by a subnational entity asserting independence or the result of peaceful or violent annexation by another nation-state?”
    • Domestic policy and economic performance: “With respect to policy, should we expect — over the next two or five years — increases, decreases, or essentially no changes in marginal tax rates, central bank interest rates, central government expenditures as percentage of GDP, annual central government operating deficit as percentage of GDP, and the size of state-owned sectors of the economy as percentage of GDP? Should we expect — again over the next two or five years — shifts in government priorities such as percentage of GDP devoted to education or to health care? With respect to economic performance, should we expect — again over the next two or five years — growth rates in GDP to accelerate, decelerate, or remain about the same? What should our expectations be for inflation and unemployment over the next two or five years? Should we expect — over the next five or ten years — entry into or exit from membership in free-trade agreements or monetary unions?”
    • National security and defense policy: “Should we expect — over the next five or ten years — defense spending as a percentage of central government expenditure to rise, fall, or stay about the same? Should we expect policy changes over the next five to ten years with respect to military conscription, with respect to using military force (or supporting insurgencies) against states, with respect to participation in international peacekeeping operations (contributing personnel), with respect to entering or leaving alliances or perpetuation of status quo, and with respect to nuclear weapons (acquiring such weapons, continuing to try to obtain such weapons, abandoning programs to obtain such weapons or the weapons themselves)?”
    • Special-purpose exercises: In these eight exercises, experts made forecasts about: (1) “the likelihood of twenty-five states acquiring capacity to produce weapons of mass destruction, nuclear or biological, in the next five, ten, or twenty-five years as well as the possibility of states — or subnational terrorist groups — using such weapons”; (2) “whether there would be a war [in the Persian Gulf] (and, if so, how long it would last, how many Allied casualties there would be, whether Saddam Hussein would remain in power, and, if not, whether all or part of Kuwait would remain under Iraqi control)”; (3) the likelihood — over the next three, six, or twelve years — of “both economic reform (rate of divesting state-owned enterprises; degree to which fiscal and monetary policy fit templates of “shock therapy”) and subsequent economic performance (unemployment, inflation, GDP growth)”; (4) the likelihood of “human-caused or -facilitated disasters in the next five, ten, or twenty-five years, including refugee flows, poverty, mass starvation, massacres, and epidemics (HIV prevalence) linked to inadequate public health measures”; (5) adoption of the Euro and “prospects of former Soviet bloc countries, plus Turkey, in meeting [Europian Union] entry requirements”; (6) who will win the American presidential elections of 1992 and 2000 and by how much; (7) “the overall performance of the NASDAQ (Is it a bubble? If so, when will it pop?) as well as the revenues, earnings, and share prices of selected ‘New Economy’ firms, including Microsoft, CISCO, Oracle, IBM, HP, Dell, Compaq, Worldcom, Enron, AOL Time Warner, Amazon, and e-Bay”; (8) “CO2 emissions per capita (stemming from burning fossil fuels and manufacturing cement) of twenty-five states over the next twenty-five years, and on the prospects of states actually ratifying an international agreement (Kyoto Protocol) to regulate such emissions.”
23 Some of these observations overlap with the other limitations listed above.
24 On the other criteria, see the Methodological Appendix of Tetlock (2005).

Questions We Ask Ourselves Before Making a Grant

Although we have typically emphasized the importance for effective philanthropy of long-term commitment to causes and getting the right people in place, the most obvious day-to-day decision funders face is whether to support specific potential giving opportunities. As part of our internal guidance for program officers, we’ve collected a series of questions that we like to ask ourselves about potential funding opportunities, including:

This post, which I adapted from the internal guidance for program officers, reviews the value we get from these questions and some of our approaches to answering them.

This is a list of questions we’ve provided to staff to help them think about how to structure their internal grant writeups, and that grant reviewers tend to ask ourselves as we review grants. We don’t have these questions in our standard template or ask that they be itemized for each grant (there are many individual cases where particular questions are inapplicable or unimportant).

Evaluating a grant’s place in the ecosystem

As we try to determine whether a grant is likely to have a positive impact, we ask ourselves key questions including: What does this grant do for the overall ecosystem of organizations working on the cause in question? What key need is it addressing and how? What other ways might there be to fill/maintain/expand this need, and does this grant seem like the best way? If not, is it compatible with pursuing other approaches to the same need simultaneously?

Examples of ecosystem “needs” might be: intellectually solid analysis of what policies should be (What should sentencing laws be? How can the government improve existing programs to strengthen pandemic preparedness?); “insider” advocacy building support for the right policies (e.g., making the case to legislators, businesses, and other influential leaders); “grassroots” advocacy to build support for the right policies, or building the power of aligned constituencies.

That said, we recognize there are stark limits to our own knowledge of “what the field needs,” especially when we’re supporting organizations that primarily serve other organizations in the field (e.g. by providing training or support). In these cases, we try to think about how to create dynamics where a service-providing grantee will be accountable to the organizations that consume its services (e.g. the people who attend the trainings). One specific way we do this is by asking people in our fields whether they’d rather have grants for their own organizations or for the “public good” service provider we’re considering providing. We often ask, “who is the right conceptual ‘buyer’ for these services and do they actually think it would be worth paying this much for?”

One thing we generally try to avoid in assessing an ecosystem is creating an elaborate “theory of change” that requires our grantees to work together in highly specific ways as more than the sum of their parts. We tend to think that unless we’re effectively the only funder in a field, our grants are each relatively marginal and can be fairly accurately assessed on their own terms. Sometimes we see opportunities to help grantees coordinate around a particular strategy, but that is not our default approach. And, like other funders, we’re open to convening grantees (and non-grantees) to try to help develop shared strategies or goals; we’re just typically more skeptical of the idea that we need to fund all parts of a strategy or ecosystem for the whole to be effective.

Evaluating a grantee’s leadership

When deciding whether to make a grant and how to structure it, we ask ourselves: Who is being empowered by this grant? Are they an effective leader who we’re happy to bet on? Are they underfunded? What are their strengths and weaknesses? Are we empowering them to do what they already want to do, or are we asking them to do something different than they might have chosen on their own? If the latter, why is that a good idea, and have we considered deferring to their judgment instead?

Since our grantees tend to know their work much better than we do, we believe we’re usually better off finding people who share our key goals and letting them work out the details. Many of our favorite grants are in the model of “Find a person who is fantastic at doing something the field needs, and give them no-strings-attached support to help them do more, faster.” Some of the funders we think are most impressive focus much more on supporting the right people than on the details of those people’s plans. (For example, see our post on the Sandler Foundation.)

Because we are often trying to support the work of particular people housed within larger organizations, we might place restrictions on funding to make it conditional to those people’s involvement in, and control of, the work. However, once the right people are empowered, we try to be skeptical of our own impulses to narrowly direct how the funding is used. If we find ourselves asking a person to spend more time and energy on a specific project we prefer, we ask ourselves, “Are we sure we’re right, or is it possible that they know what the field needs better than we do, and we should be funding them to do their first-choice work?”

We are sometimes wary of potential grants where a well-funded organization offers to do a project that’s particularly appealing to us. In many cases, we believe they’re proposing a project they would do with or without our support, and our funding is effectively paying for other things they want to do. We often refer to this concept as “fungibility.” We try to avoid these cases by asking ourselves, “Why can’t this person do the project they’re pitching with money they already have?” If the answer is that they don’t have much money, it’s possible we should be giving unrestricted support. If the answer is that they don’t actually value the project as much as other things they could do, we ask whether there’s a good reason they value it less, and whether we can expect them to do good work on the project if we fund it.

Evaluating comparative advantage

As we assess whether a specific grantee is the right partner, we believe it’s helpful to ask ourselves whether it is the best organization to do this work, and if its work is the most valuable use of this organization’s additional resources. When we’re considering funding a research organization that wants to do public advocacy around its research, we consider looking for an advocacy organization that could potentially do the same work more effectively. When we’re considering funding an organization that will provide services to other organizations, we tend to ask if those “client” organizations are adequately funded, since they are often better positioned than we are to decide whether new services are worth paying for.

The question of comparative advantage applies to us as well. We often ask ourselves: Are we the best-placed organization to evaluate and fund this? Who else could fund it, and why aren’t they?

If we find ourselves considering a lot of similar grants, especially small ones, we tend to look for opportunities make a bigger grant and have someone else take the time to strategically regrant. For instance, rather than five small grants to five individuals doing similar activities in five different states, we may be better off finding (or creating) an organization that is well-placed to evaluate those activities, and giving them one big grant. This thinking informed our decision to support The Humane League’s Open Wing Alliance, a new coalition of promising farm animal welfare groups that are trained in corporate campaigning so they can achieve cage-free wins in new countries.

For us, determining whether other funders are better positioned to support a project isn’t just about saving money. It’s helpful for checking the basic case for the grant, identifying considerations we might have missed, and continuing to refine our theory of our own comparative advantage as a funder.

If we can think of someone who seems like they logically ought to be willing to fund a potential grant, we try to ask them what they think. Sometimes it will turn out that they know more about this type of grant than we do, and they will raise new considerations. Sometimes it will turn out that the grant is not actually a fit for what they do, and we’ll learn more about them and improve our model of them as a funder. Accurate models of other funders help us assess our own comparative advantage, and potentially help the other funders by allowing us to refer potential grantees who seem like a good fit for their goals.

Evaluating a grant’s size and duration

When we review grantees’ budgets and determine how much funding to provide, we ask ourselves: What are we paying for, above and beyond what’s already funded by others? Could we get the same work done with a lot less money? Are we sure this grant shouldn’t be bigger? What is our “willingness to pay” for success here?

While sometimes people will ask for much more funding than we think appropriate, we are also often concerned about the times when people ask for less than they should. After conversations with many funders and many nonprofits, some of whom are our grantees and some of whom are not, our best model is that many grantees are constantly trying to guess what they can get funded, won’t ask for as much money as they should ask for, and, in some cases, will not even consider what they would do with some large amount because they haven’t seriously considered the possibility that they might be able to raise it. We’ve had multiple experiences where a grantee asks for X, and we say “What would you do with 2X?” They usually say, “Never thought about it — let me get back to you,” and we often end up with a much better grant in the end. This was the case in our grant to support research at the University of Washington. Through ongoing conversations, the original grant proposal focusing on the development of a universal flu vaccine evolved into an expanded grant incorporating work on a computational protein design system that we believe could have much broader utility if it makes it possible to rapidly design new vaccines or antiviral drugs. We sometimes ask grantees what activities would occur at several different funding levels and consider these scenarios and tradeoffs as part of the decision-making process.

When we are considering higher salaries or more ambitious proposals, we need to ask ourselves (and our grantees) how many years we should commit to. Long-term commitments often help grantees plan, so if it seems like we’re funding something that’s going to take several years to play out, we generally aim to commit to more than one year up front. In the case of the new Center for Security and Emerging Technology, we think it will take some time to develop expertise on key questions relevant to policymakers and want to give CSET the commitment necessary to recruit key people, so we provided a five-year grant. That said, for small organizations, a few years can be a very long time in terms of learning new things, changing organizational direction and finding new funders, so a two- to three-year commitment will often (in our view) be sufficient to achieve the goal of helping with planning.

Evaluating a grant’s cost-effectiveness

When deciding whether to make a grant or hold the money in reserve to distribute elsewhere, we think about cost-effectiveness. Do we get good bang-for-the-buck?

Grant investigators sometimes include a “back-of-the-envelope-calculation” (“BOTEC”) to roughly estimate the expected cost-effectiveness of the potential grant in their writeup.

We recently shared an update, including some example BOTECs, on our thinking about how to evaluate giving aimed at helping people alive today, including a mix of direct aid, policy work, and scientific research. While we previously used unconditional cash transfers to people living in extreme poverty as “the bar” for being willing to make a grant, GiveWell has continued to find more and larger opportunities over time, which has the implication that we may raise “the bar” to something closer to the current estimated cost-effectiveness of GiveWell’s unfunded top charities.

Longtermist” BOTECs focus more on how much a given grant might reduce the chances of a global catastrophic risk.

Noting and considering reservations

We generally try to ask (and our investigation encourages asking): If we had a smart, thoughtful friend who thought this grant was going to end up having no impact, what would they tell us and how would we respond?

Grant investigators include for decision-makers’ consideration the devil’s advocate argument against a grant’s potential impact, and explain why they don’t believe that argument to be decisive. We hope that grant investigators explicitly considering and vocalizing these objections helps reduce the chances that they follow a faulty line of reasoning too far, or that a crucial objection to a grant is never brought up.

Evaluating predictions about how a grant will play out

We hope to learn from our past grantmaking. Looking back on a grant, how might we determine whether it has gone well or badly?

By asking grant investigators to consider this question before recommending a grant, and by encouraging them to make quantified and objectively evaluable predictions about how a grant will play out over time, we try to make our future evaluation of a grant’s performance easier.

While we no longer publish these predictions in our public grant writeups, we continue to track them internally and use them in renewal decisions to update our expectations in light of facts on the ground.

(More on this practice here.)

GiveWell’s Top Charities Are (Increasingly) Hard to Beat

Our thinking on prioritizing across different causes has evolved as we’ve made more grants. This post explores one aspect of that: the high bar set by the best global health and development interventions, and what we’re learning about the relative performance of some of our other grantmaking areas that seek to help people today.

To summarize:

  • When we were getting started, we used unconditional cash transfers to people living in extreme poverty (a program run by GiveDirectly) as “the bar” for being willing to make a grant, on the grounds that such giving was quite cost-effective and likely extremely scalable and persistently available, so we should not generally make grants that we expected to achieve less benefit per dollar than that. Based on the roughly 100-to-1 ratio of average consumption between the average American and GiveDirectly cash transfer recipients, and a logarithmic model of the utility of money, we call this the “100x bar.” So if we are giving to, e.g., encourage policies that increase incomes for average Americans, we need to increase them by $100 for every $1 we spend to get as much benefit as just giving that $1 directly to GiveDirectly recipients. More.
  • GiveWell (which we used to be part of and remain closely affiliated with) has continued to find more and larger opportunities over time, and become more optimistic about finding more cost-effective ones in the future. This has the implication that we should raise “the bar” to something closer to the current estimated cost-effectiveness of GiveWell’s unfunded top charities, which they believe to be in the range of 5-15x more cost-effective than 100x cash transfers, meaning a bar of benefits 500-1,500 times the cost of our grants (which we approximate to a “1,000x bar”). More.
  • Since adopting cash transfers as the relevant benchmark for our giving aimed at helping people alive today, we’ve given ~$100M in U.S. policy, ~$100M in scientific research, and ~$300M based on GiveWell recommendations. According to our extremely rough internal calculations, we do expect many of our grants in scientific research and U.S. policy to exceed the “100x bar” represented by unconditional cash transfers, but relatively few to clear a “1,000x bar” roughly corresponding to high-end estimates for GiveWell’s unfunded top charities. This would imply that it’s quite difficult/rare for work in these categories to look more cost-effective than GiveWell’s top charities. However, these calculations are extraordinarily rough and uncertain. More.
  • In spite of these calculations, we think there are some good arguments to consider in favor of our current grantmaking in these areas. More.
  • We continue to think it is likely that there are causes aimed at helping people today (potentially including our current ones) that could be more cost-effective than GiveWell’s top charities, and we are hiring researchers to work on finding and evaluating them. More.
  • We are still thinking through the balance of these considerations. We are not planning any rapid changes in direction. More.

Cash transfers to people in extreme poverty

In 2015, when we were still part of GiveWell, we wrote:

By default, we feel that any given grant of $X should look significantly better than making direct cash transfers (totaling $X) to people who are extremely low-income by global standards – abbreviated as “direct cash transfers.” We believe it will be possible to give away very large amounts, at any point in the next couple of decades, via direct cash transfers, so any grant that doesn’t meet this bar seems unlikely to be worth making….

It’s possible that this standard is too lax, since we might find plenty of giving opportunities in the future that are much stronger than direct cash transfers. However, at this early stage, it isn’t obvious how we will find several billion dollars’ worth of such opportunities, and so – as long as total giving remains within the … budget – we prefer to err on the side of recommending grants when we’ve completed an investigation and when they look substantially better than direct cash transfers.

It is, of course, often extremely unclear how to compare the good accomplished by a given grant to the good accomplished by direct cash transfers. Sometimes we will be able to do a rough quantitative estimate to determine whether a given grant looks much better, much worse or within the margin of error. (In the case of our top charities, we think that donations to AMF, SCI and Deworm the World look substantially better.) Other times we may have little to go on for making the comparison other than intuition. Still, thinking about the comparison can be informative. For example, when considering grants that will primarily benefit people in the U.S. (such as supporting work on criminal justice reform), benchmarking to direct cash transfers can be a fairly high standard. Based on the idea that the value of additional money is roughly proportional to the logarithm of income,[1]See e.g. Subjective Well‐Being and Income: Is There Any Evidence of Satiation? (archive) For instance Deaton (2008) and Stevenson and Wolfers (2008) find that the well-being–income relationship is roughly a linear-log relationship, such that, while each additional dollar of income yields a … Continue reading and the fact that mean American income is around 100x annual consumption for GiveDirectly recipients, we assume that a given dollar is worth ~100x as much to a GiveDirectly recipient as to the average American. Thus, in considering grants that primarily benefit Americans, we look for a better than “100x return” in financial terms (e.g. increased income). Of course, there are always huge amounts of uncertainty in these comparisons, and we try not to take them too literally.

To walk through the logic of how this generates a “100x” bar a bit more clearly:

  • We want to be able to compare philanthropic opportunities that will save the U.S. or state governments money, or increase incomes for average Americans, against opportunities to directly help the global poor (or deliver other benefits) in a somewhat consistent fashion. For instance, we could imagine a hypothetical domestic advocacy opportunity that might be able to save the government $100 million, or increase productivity by $100 million, for a cost of $1 million; we would call that opportunity roughly “100x” because the benefit in terms of income to the average American is $100 for every $1 we spend.[2]We’re eliding a huge amount of complexity here in terms of modeling the domestic welfare impacts of various policy changes, which we recognize. In practice, our calculations are often very crude, though we try to be roughly consistent in considering distributional issues and weighing whether … Continue reading If we just directly gave a random person in the U.S. $1,000, we’d expect to get “1x” because the benefit to them is equal to the cost to us (ignoring transaction costs). That is, we take our core unit of measurement for this exercise as “dollars to the average American.” Then we face the question: how should we compare transfers to the global poor (or other programs) to transfers to the average American?
  • GiveWell reports that the income of GiveDirectly recipients averages $0.79 per day[3]See footnote 33 in GiveWell’s writeup on GiveDirectly. — so approximately $290 per person per year, compared to more than $34,000 per capita per year in the U.S.[4]2017 average U.S., per capita income was $34,489, per the U.S. Census. (archive) This means $34,000 could double one individual’s income for a year in the U.S., or (after ~10% overhead is taken out) double the income of about 106 GiveDirectly recipients for a year.[5]$34,000 / ($288.35 / 0.9) = ~106. Using median U.S. income rather than mean would reduce this ~20% but seems less apt as a comparison since we’re partially modeling foregone spending and taxes are moderately progressive.
  • In this context we assume a logarithmic utility function for income, which is a fairly common simplification and assumes that doubling a person’s income contributes the same amount to their well-being regardless of how much income they started with. We think this is a plausible starting point based on evidence from life satisfaction surveys.[6]See Economic Growth and Subjective Well-Being: Reassessing the Easterlin Paradox. (archive) However, it is worth noting that there are credible arguments that a logarithmic utility function places either too much or too little weight on income at the high end.[7]Too much: there is some evidence of satiation (archive) in terms of self-reported wellbeing even in log terms as incomes get very high by global standards. Additionally, if you think very high incomes carry net negative externalities (e.g., through carbon emissions or excess political influence), … Continue reading

    • A logarithmic utility function implies that $1 for someone with 100x less income/consumption is worth 100x as much. This implies direct cash transfers to the extreme global poor go about 100x as far as the same money spent in the U.S., on average, and means any potential grant should create an expected value at least 100x the cost of the grant if it is to be considered a better use of money than such direct cash transfers.
    • With other causes, in addition to looking at monetary savings or gains, we also use “value of a statistical life” techniques to try to account for health and quality-of-life benefits. That yields more cost-effectiveness estimates, all generally framed in the language of “This seems roughly as good as saving an average American $N for each $1 we spend” or simply “Nx.”

    Obviously, calculations like this remain deeply uncertain and vulnerable to large mistakes, so we try to not put too much weight on them in any one case. But the general reality that they reflect — of vast global inequalities, and the relative ease of moving money from people who have a lot of it to people who have little — seems quite robust.

    Although we stopped formally using this 100x benchmark across all of our giving a couple of years ago because of considerations relating to animals and future generations, we have continued to find it a useful benchmark against which “near-termist, human-centric” grants — those that aim to improve the lives of humans on a relatively short time horizon, including a mix of direct aid, policy work, and scientific research — can be measured.

    The best programs are even harder to beat

    In 2015, when we first wrote about adopting the cash transfer benchmark, it looked like GiveWell could plausibly “run out” of their more-cost-effective-than-cash giving opportunities. At the time, they had three non-cash-transfer top charities they estimated to be in the 5-10x cash range (i.e., 5 to 10 times more cost-effective than cash transfers),[8]This 5-10x cash range translated to roughly ~$2,000-4,000 per “life saved equivalent” in the 2015 cost-effectiveness calculation – XLSX. with ~$145 million of estimated short-term room for more funding. That, plus uncertainty about the amount of weight to put on these figures, led us to adopt the cash transfer benchmark. (In the remainder of this post, I occasionally shorten “cash transfer” to just “cash.”) But by the end of 2018, GiveWell had expanded to seven non-cash-transfer top charities estimated to be in the ~5-15x cash range, with $290 million of estimated short-term room for more funding, and with the top recommended unfilled gaps at ~8x cash transfers.[9]Based on the median results from GiveWell’s final 2018 cost-effectiveness calculation, 8x cash implies a “cost-per outcome as good as saving an under-5 life” of ~$1,500. This is not directly comparable to the figures from 2015 because GiveWell made some changes in the values and framework … Continue reading If we combine cash transfers at “100x” and large unfilled opportunities at ~5-15x cash transfers, the relevant “bar to beat” going forward may be more like 500-1,500x.[10]Another way to get similarly high overall ROI figures is from comparing GiveWell’s top charity “cost per life saved equivalent” figures to rich world “value of a statistical life” figures: GiveWell estimates that bednets and seasonal malaria chemoprevention and vitamin A supplementation … Continue reading And earlier this year GiveWell suggested that they expected to find more cost-effective opportunities in the future, and they are staffing up in order to do so.

    Another approach to this question is to ask, how much better than direct cash transfers should we expect the best underfunded interventions to be? I find scalable interventions worth ~5-15x cash a bit surprising, but not wildly so. It’s not obvious where to look for a prior on this point, and it seems to correlate strongly with general views about broad market efficiency: if you think broad “markets for doing good” are efficient, finding a scalable ~5-15x baseline intervention might be especially surprising; conversely if you think markets for doing good are riddled with inefficiencies, you might expect to find many even more cost-effective opportunities.

    One place to potentially look for priors on this point might be compilations of the cost-effectiveness of various evidence-based interventions. I know of five compilations of the cost-effectiveness of different interventions within a given domain that contain easily available tabulations of the interventions reviewed:[11]Since this post was first written, we came across Five-Hundred Life-Saving Interventions and Their Cost-Effectiveness (archive).

    For this purpose, I was just curious about the general distribution of the estimates, and didn’t attempt to verify any of them, and was very rough in discarding estimates that were negative or didn’t have numerical answers, which may bias my conclusions. In general, we regard the calculations included in these compilations as challenging and error-prone, and we would caution against over-reliance on them.[12]When we looked closely at one of the calculations in the the DCP2, we found serious errors. We haven’t looked closely at the other sources at all. Overall, we expect the project of trying to estimate the cost-effectiveness of many different interventions in uniform terms to be extremely difficult … Continue reading

    I made a sheet summarizing the sources’ estimates here. All five distributions appear to be (very roughly) log-normal, with standard deviations of ~0.7-1, implying that a one-standard-deviation increase in cost-effectiveness would equate to a 5-10x improvement. However, any errors in these calculations would typically inflate that figure, and we think they are structurally highly error-prone, so these standard deviations likely substantially overstate the true ones.[13]Some discussion of this in the comments of GiveWell’s 2011 post on errors in the DCP2.

    We don’t know what the mean of the true distribution of cost-effectiveness of global development opportunities might be, but assuming it’s not more than a few times different from cash transfers (in either direction), and that measurement error doesn’t make up more than half of the variance in the cost-effectiveness compilations reviewed above (a non-trivial assumption), then these figures imply we shouldn’t be too surprised to see top opportunities ~5-15x cash. A normal distribution would imply that an opportunity two standard deviations above the mean is in the ~98th percentile. These figures would support more skepticism towards an opportunity from the same rough distribution (evidence-based global health interventions) that is claimed to be even more cost-effective (e.g., 100x or 1,000x cash rather than 10x).

    Stepping back from the modeling, given the vast difference in treatment costs per person for different interventions (~$5 for bednets, $0.33-~$1 for deworming, ~$250 for cash transfers), it does seem plausible to have large (~10x) differences in cost-effectiveness.

    Even if scalable global health interventions were much worse than we currently think, and, say, only ~3x as cost-effective as cash transfers, I expect GiveWell’s foray into more leveraged interventions to yield substantial opportunities that are at least several times more cost-effective, pushing back towards ~10x cash transfers as a more relevant future benchmark for unfunded opportunities.

    Overall, given that GiveWell’s numbers imply something more like “1,000x” than “100x” for their current unfunded opportunities, that those numbers seem plausible (though by no means ironclad), and that they may find yet-more-cost-effective opportunities in the future, it looks like the relevant “bar to beat” going forward may be more like 1,000x than 100x.

    Our other grantmaking aimed at helping people today

    While we think a lot of our “near-termist, human-centric” grantmaking clears the 100x bar, we see less evidence that it will clear a ~1,000x bar.

    Since we initially adopted the cash transfer benchmark in 2015, we’ve made roughly 300 grants totalling almost $200 million in our near-termist, human-centric focus areas of criminal justice reform, immigration policy, land use reform, macroeconomic stabilization policy, and scientific research. To get a sense of our estimated returns for these grants, we looked at the largest grants and found 33 grants totalling $73M for which the grant investigator conducted an ex ante “back-of-the-envelope-calculation” (“BOTEC”) to roughly estimate the expected cost-effectiveness of the potential grant for Open Philanthropy decision-makers’ consideration.

    All of these 33 grants were estimated by their investigator to have an expected cost-effectiveness of at least 100x. This makes sense given the existence of our “100x bar.” Of those 33, only eight grants, representing approximately $32 million, had BOTECs of 1,000x or greater. Our large grant to Target Malaria accounts for more than half of that.

    Although we don’t typically make our internal BOTECs public, we compiled a set here (redacted somewhat to protect some grantees’ confidentiality) to give a flavor of what they look like. As you can see, they are exceedingly rough, and take at face value many controversial and uncertain claims (e.g., the cost of a prison-year, the benefit of a new housing unit in a supply-constrained area, the impact of monetary policy on wages, the likely impacts of various other policy changes, stated probabilities of our grantees’ work causing a policy change).

    We would guess that these uncertainties would generally lead our BOTECs to be over-optimistic (rather than merely adding unbiased noise) for a variety of reasons:

    • Program officers do the calculations themselves, and generally only do the calculations for grants they’re already inclined to recommend. Even if there’s zero cynicism or intentional manipulation to get “above the bar,” grantmakers (including me) seem likely to be more charitable to their grants than others would be.
    • Many of these estimates don’t adjust for relatively straightforward considerations that would systematically push towards lower estimated cost-effectiveness, like declining marginal returns to funding at the grantee level, time discounting, or potential non-replicability of the research our policy goals are based on. The comparison with the level of care in the GiveWell cost-effectiveness models on these features is pretty stark.
    • Holden made some more general arguments along these lines in 2011.

    We think it’s notable that despite likely being systematically over-optimistic in this way, it’s still rare for us to find grant opportunities in U.S. policy and scientific research that appear to score better than GiveWell’s top charities.

    Of course, compared to GiveWell, we make many more grants, to more diverse activities, and with an explicit policy of trying to rely more on program officer judgment than these BOTECs. So the idea that our models look less robust than GiveWell’s is not a surprise — we’ve always expected that to be the case — but combining that with GiveWell’s rising bar is a more substantive update.

    Some counter-considerations in favor of our work

    As we’re grappling with the considerations above, we don’t want to give short shrift to the arguments in favor of our work. We see two broad categories of arguments in this vein: (a) this work may be substantively better than the BOTECs imply; and (b) it’s a worthwhile experiment.

    This work may be better than the BOTECs imply

    There are a couple big reasons why Open Phil’s near-termist, human-centric work could turn out to be better than implied by the figures above:

    • Values/moral weights. A logarithmic utility function and view that “all lives have equal value” push strongly towards work focused on the global poor. But many people endorse much flatter utility functions in money and the use of context-specific “value of a statistical life” figures, both of which would make work in the U.S. generally look much more attractive. And of course many people think we have stronger normative obligations to attend to our neighbors and fellow citizens, which would also make our non-GiveWell near-termist work look more valuable (though we have historically been skeptical of such normative views). (You could make similar arguments on instrumental rather normative grounds too, e.g., by arguing that flow-through effects from work in the U.S. would be larger.) Arguably we should put some weight on ideas like these in our worldview diversification process.
    • Hits. We are explicitly pursuing a hits-based approach to philanthropy with much of this work, and accordingly might expect just one or two “hits” from our portfolio to carry the whole. In particular, if one or two of our large science grants ended up 10x more cost-effective than GiveWell’s top charities, our portfolio to date would cumulatively come out ahead. In fact, the dollar-weighted average of the 33 BOTECs we collected above is (modestly) above the 1,000x bar, reflecting our ex ante assessment of that possibility. But the concerns about the informational value of those BOTECs remain, and most of our grants seems noticeably less likely to deliver such “hits.”
    • Mistaken analysis. As we’ve noted, we consider our BOTECs to be extremely rough. We think it’s more likely than not that better-quality BOTECs would make the work discussed above look still weaker, relative to GiveWell top charities – but we are far from certain of this, and it could go either way, especially if our policy reform efforts could contribute meaningfully to “tipping points” that lead to accelerating policy changes in the future.

    It’s a worthwhile experiment

    Our near-termist, human-centric giving since adopting the cash benchmark can be broken into roughly three groups: ~$100M for U.S. policy, ~$100M for scientific research, and ~$300M based on GiveWell recommendations in global health and development. We think given the amount of giving we anticipate doing in the future, an experimental effort of that scale is worth running. As we’ve discussed before, we see many benefits from giving in multiple different kinds of causes that are not fully captured by the impact of the grants themselves, including:

    • Learning. If our only giving were in long-termist and animal-oriented causes and to the GiveWell top charities, we think we’d learn a lot less about effective giving writ large and the full suite of tools available to a philanthropist aiming to effect change. We think that would make us less effective overall (though we give limited weight to this consideration).
    • Developing a concrete track record. We see a lot of value for ourselves and others in working in some areas with relatively short feedback loops, where it’s easier to observe whether our giving is achieving its intended impact. We would like it to be possible for us and others to recognize whether we are achieving any of our desired impacts, and that looks far more likely in our near-termist human-centric causes than in the bulk of the other causes we work on.
    • Option value. Developing staff capacity to work in many (different kinds of) causes provides the ability to adjust if our desired worldview allocation changes over time (which seems quite possible).
    • Helping other donors. Giving in a diverse set of causes increases our long-run odds of having large effects on the general dialogue around philanthropy, since we could provide tangibly useful information to a larger set of donors.

    We see a number of other practical benefits to working in a broad variety of causes, including presenting an accurate public-facing picture of our values and making our organization a more appealing place to work.

    Finally, it is worth noting that while we think GiveWell’s cost-effectiveness estimates are (far) more reliable than the very rough BOTECs we have done, we do not think their estimates (or any cost-effectiveness estimates we’ve ever seen) can be taken literally, or even used with much confidence.

    It should be possible to outperform the GiveWell top charities

    Although this post describes some doubts about how some of our giving to date may compare to the GiveWell top charities, we continue to think it should be possible to achieve more cost-effective results than the current GiveWell top charities via advocacy or scientific research funding rather than direct services. To the extent that there is a single overarching update here–which we are uncertain about–we think it is likely to be against the possibility of achieving sufficient leverage via advocacy or scientific research aimed at benefiting people in the U.S. or other wealthy countries alone. We have only explored a small portion of the space of possible causes in this broad area, and continue to expect that advocacy or scientific research, perhaps more squarely aimed at the global poor, could have outsized impacts. Indeed, GiveWell seems to agree this is possible, with their expansion into considering advocacy opportunities within global health and development.

    As we look more closely at our returns to date and going forward, we’re also interested in exploring other causes that may have especially high returns. One hypothesis we’re interested in exploring is the idea of combining multiple sources of leverage for philanthropic impact (e.g., advocacy, scientific research, helping the global poor) to get more humanitarian impact per dollar (for instance via advocacy around scientific research funding or policies, or scientific research around global health interventions, or policy around global health and development). Additionally, on the advocacy side, we’re interested in exploring opportunities outside the U.S.; we initially focused on U.S. policy for epistemic rather than moral reasons, and expect most of the most promising opportunities to be elsewhere.

    If this sounds interesting, you should consider applying: we’re hiring for researchers to help.

    Conclusion

    We are still in the process of thinking through the implications of these claims, and we are not planning any rapid changes to our grantmaking at this time. We currently plan to continue making grants in our current focus areas at approximately the same level as we have for the last few years while we try to come to more confident conclusions about the balance of considerations above. As Holden outlined in a recent blog post, a major priority for the next couple years is building out our impact evaluation function. We expect that will help us develop a more confident read on our impact in our most mature portfolio areas, and accordingly will place us in a better position to approach big programmatic decisions. We will hopefully improve the overall quality of our BOTECs in other ways as well.

    If, after building out this impact evaluation function and applying it to our work to date, we decided to substantially reduce or wind down our giving in any of our current focus areas, we’d do so gradually and responsibly, with ample warning and at least a year or more of additional funding (as much as we feel is necessary for a responsible transition) to our key partner organizations. We have no current plans to do this, and we know funders communicating openly about this kind of uncertainty is unusual and can be unnerving, but our hope is that sharing our latest thinking will be useful for others.

    Finally, we’re planning to write more at a later date about the cost-effectiveness of our “long-termist” and animal-inclusive grantmaking and the implications for our future resource allocation.

    Footnotes

    Footnotes
    1 See e.g. Subjective Well‐Being and Income: Is There Any Evidence of Satiation? (archive)

    For instance Deaton (2008) and Stevenson and Wolfers (2008) find that the well-being–income relationship is roughly a linear-log relationship, such that, while each additional dollar of income yields a greater increment to measured happiness for the poor than for the rich, there is no satiation point.

    2 We’re eliding a huge amount of complexity here in terms of modeling the domestic welfare impacts of various policy changes, which we recognize. In practice, our calculations are often very crude, though we try to be roughly consistent in considering distributional issues and weighing whether incomes are increasing due to productivity changes, prevented waste, or other causes.
    3 See footnote 33 in GiveWell’s writeup on GiveDirectly.
    4 2017 average U.S., per capita income was $34,489, per the U.S. Census. (archive)
    5 $34,000 / ($288.35 / 0.9) = ~106. Using median U.S. income rather than mean would reduce this ~20% but seems less apt as a comparison since we’re partially modeling foregone spending and taxes are moderately progressive.
    6 See Economic Growth and Subjective Well-Being: Reassessing the Easterlin Paradox. (archive)
    7 Too much: there is some evidence of satiation (archive) in terms of self-reported wellbeing even in log terms as incomes get very high by global standards. Additionally, if you think very high incomes carry net negative externalities (e.g., through carbon emissions or excess political influence), you may even think additional income at the high end should be treated as negative. Finally, placing high moral weight on marginal consumption for high-income people seems to imply that their lives “have a lot more value in them” or “are worth a lot more,” which seems problematic.

    Too little: people continue to exercise substantial effort to increase their own income, even at high levels, and there seem to be obvious benefits beyond subjective wellbeing that accrue to them from doing so (such as increased lifespan or educational access). Additionally, if you’re discounting income or consumption logarithmically or more, even very small positive spillovers from high income people to others (e.g., through employment, charity, or bequests) could swamp the first order effects in a utility calculation.

    8 This 5-10x cash range translated to roughly ~$2,000-4,000 per “life saved equivalent” in the 2015 cost-effectiveness calculation – XLSX.
    9 Based on the median results from GiveWell’s final 2018 cost-effectiveness calculation, 8x cash implies a “cost-per outcome as good as saving an under-5 life” of ~$1,500. This is not directly comparable to the figures from 2015 because GiveWell made some changes in the values and framework used in their cost-effectiveness calculation, which affect both the outcome measures and the comparisons between them.
    10 Another way to get similarly high overall ROI figures is from comparing GiveWell’s top charity “cost per life saved equivalent” figures to rich world “value of a statistical life” figures:

    To be clear, this calculation violates the standard assumptions of value of a statistical life, one of which is that the value of a life depends on the income of a person who lives it, and is not endorsed by GiveWell (which has a more complicated moral weights system for comparing outcomes).

    11 Since this post was first written, we came across Five-Hundred Life-Saving Interventions and Their Cost-Effectiveness (archive).
    12 When we looked closely at one of the calculations in the the DCP2, we found serious errors. We haven’t looked closely at the other sources at all. Overall, we expect the project of trying to estimate the cost-effectiveness of many different interventions in uniform terms to be extremely difficult and error-prone, so we don’t mean to endorse these specific estimates.
    13 Some discussion of this in the comments of GiveWell’s 2011 post on errors in the DCP2.

Projects, People and Processes

One of the challenges of large-scale philanthropy is: how can a small number of decision-maker(s) (e.g., donors) find a large number of giving opportunities that they understand well enough to feel good about funding?

Most of the organizations I’ve seen seem to use some combination of project-based, people-based, and process-based approaches to delegation. To illustrate these, I’ll use the hypothetical example of a grant to fund research into new malaria treatments. I use the term “Program Officers” to refer to the staff primarily responsible for making recommendations to decision-makers.

  • Project-based approaches: the decision-makers hire Program Officers to look for projects; decision-makers ultimately evaluate the projects themselves. Thus, decision-makers delegate the process of searching for potential grants, but don’t delegate judgment and decision-making. For example, a Program Officer might learn about proposed research on new malaria treatments and then make a presentation to a donor or foundation Board, explaining how the project will work, and trying to convince the donor or Board that it is likely to succeed.
  • People-based approaches: decision-makers delegate essentially everything to trusted individuals. They look for staff they trust, and then defer heavily to them. For example, a Program Officer might become convinced of the merits of research on new malaria treatments and propose a grant, with the funder deferring to their judgment despite not knowing the details of the proposed research.
  • Process-based approaches: the decision-makers establish consistent, systematic criteria for grants, and processes that aim to meet these criteria. Decisions are often made by aggregating opinions from multiple grant reviewers. For example, a donor might solicit proposals for research on new malaria treatments, assemble a technical review board, ask each reviewer to rate each proposal on several criteria, and use a pre-determined aggregation system to make the final decisions about which grants are funded. Government funders such as the National Institutes of Health often use such approaches. These approaches often seek to minimize the need for individual judgment, effectively delegating to a process.These different classifications can also be useful in thinking about how Program Officers relate to grantees. Program Officers can recommend grants based on being personally convinced of a particular project; recommend grants based primarily on the people involved, deferring heavily on the details of those people’s plans; or recommend grants based on processes that they set up to capture certain criteria.This post discusses how I currently see the pros and cons of each, and what our current approach is. In large part, we find the people-based approach ideal for the kind of hits-based giving we’re focused on. But we use elements of project-based evaluation (and to a much lesser degree, process-based evaluation) as well – largely in order to help us better evaluate people over time.

    Projects, processes, and people

    In some ways, project-based approaches are the most intuitive, and probably the easiest method for a funder to feel confident in by default. Organizations propose specific activities to Program Officers, who try to identify the ones that will appeal to decision-makers and make the case for them. Thus, decision-makers delegate the process of searching for potential grants, but don’t delegate judgment and decision-making.

    I think the fundamental problem with project-based approaches is that the decision-maker generally has much less knowledge and context than the Program Officer (who in turn has much less than the organization they’re evaluating). This is a general problem with any kind of “top-down” decision-making process, but I think it is a particularly severe problem for philanthropy, because:

    • There’s often a very small number of people (a wealthy individual or family) who are trying to give away a large amount of money.
    • They often do not personally have extensive background knowledge for the causes they’re working in, and do not work full-time on any particular cause (and in many cases do not work full-time on philanthropy generally).
    • Decisions usually can’t be subject to any straightforward or quick performance measurement. It usually takes a good deal of subjective judgment to decide whether a grant is going well.

    From what I’ve seen, the result can often be that Program Officers recommend the giving opportunities they think they can easily justify, rather than the giving opportunities they personally think are best. This risks wasting much of the expertise and deep context Program Officers bring to the job. I think this is a major problem when trying to do hits-based giving, for reasons outlined previously.

    People-based approaches are the opposite in some sense. Grants are made based on trust in the people recommending them, rather than based on agreement with the specific activities proposed. This is “bottom-up” where project-based giving is “top-down”; it delegates judgments to individuals, where project-based giving doesn’t delegate judgment at all. I think the advantages here are fairly clear: the people with the most expertise and context are the ones who lead the decision-making. People-based approaches seem likely, to me, to achieve the best results when done well.

    However, I also see major challenges to people-based approaches:

    • Everything comes down to picking the right people. I generally consider it very hard to evaluate people, and don’t know of any reliable and reasonably quick way to do so. Our experiences recruiting have generally left us feeling that the only good way to evaluate someone is to work with them for an extended period of time, and we’ve heard similar sentiments when seeking advice from other organizations.
    • The more one defers to people, the more one is leaving the possibility open that they might make decisions arbitrarily, based on e.g. relationships rather than merit.
    • I also think there is at least some tension between taking people-based approaches – at least pure people-based approaches that focus on “general impressiveness” of people – and pursuing neglected causes. If one simply finds the most impressive people (in a general sense) and defers to them, one is likely to end up supporting people who have already had success and achieved influence, and one is likely to end up supporting fields that are already popular. I believe that we are bringing a fairly novel angle to philanthropy – focusing on important, neglected, tractable causes. This angle is novel enough that we don’t think we should be restricting ourselves to working with people who fully share it (this would greatly reduce the pool of people to choose from). But if we instead evaluate people only for general impressiveness, we risk losing much of this angle, and risk doing status-quo-biased giving.

    Process-based approaches are another approach to scaling understanding. Rather than try to evaluate each project, or defer to individuals’ judgment, the funder sets up processes that try to capture high-level criteria. For example, many National Institutes of Health (NIH) grants seek to optimize on criteria such as the significance of the work, the experience/training/track records of the investigator(s), the degree to which a project is innovative, etc. Process-based approaches often (as with the NIH) involve systematic aggregation of a large number of individual judgments, reducing reliance on any one individual’s judgment.

    I think the appeal of process-based approaches is that they can integrate expertise and deep context into decisions more reliably than project-based approaches (which rely on the judgment of the decision-maker(s)), while also avoiding the disadvantages listed for people-based approaches: difficulty of choosing people, risks of arbitrariness and conflicts of interest, difficulty maintaining fidelity to unusual angles on giving that the decision-maker(s) might have. However, I believe that process-based approaches bring their own problems:

    • Well-defined criteria and processes make it possible for potential grantees to “game the system” – coming up with grant proposals that are designed to get through the process rather than to propose and communicate the best work possible. And once gaming the system becomes possible, it may quickly become necessary in a competitive environment.
    • Process-based approaches tend (I believe) to be fairly slow, unpredictable and inconvenient from a grantee’s perspective. They make it hard to have honest, informative conversations with potential grantees about their likelihood of getting funded. They also tend to be rigid: unable to support work that’s very different from (but perhaps better than) what the process designers anticipated.
    • Process-based approaches tend to minimize the role of individual judgment, often by aggregating many judgments. But the more they do this, the less room they leave for the kind of high-risk, high-reward, creative, unusual decisionmaking we associate with hits-based giving.

    My current view is that process-based approaches can be excellent for funders seeking to minimize risk (of being perceived as unfair, of supporting low-quality work, of supporting work for the wrong reasons). Government funders often fit this description. But process-based approaches seem much less appealing for hits-based giving, unless they are very carefully designed by people with strong expertise and context with the specific goal of pursuing “hits.”

    Note that the above classifications are fairly simplified, and many funders’ decision-making processes have elements of more than one.

    Our current approach

    Our current approach is based on the idea that people-based giving is ideal for the kind of hits-based work we’re trying to do. Specifically, our ideal is to find people who make decisions as we would, if we had more expertise, context and time for decision-making. (Here “we” refers to myself and Cari Tuna, currently the people who sign off on Open Philanthropy Project grants.) We also encourage our Program Officers to take this attitude when seeking potential grantees, though they can ultimately choose whatever mix they want of project-, people- and process-based approaches for making recommendations to us.

    Our ideal can be pursued at different levels of breadth. We can set particular focus areas, then try to find Program Officers who make decisions as we would for each focus area. Program Officers can then try to find people who make decisions as they would for particular sub-areas of the focus area they work in; for example, after identifying corporate cage-free reforms as a promising sub-area of farm animal welfare, we sought to support a set of people working on cage-free reforms.

    The major challenge of this approach is determining which people we want to trust, and in what domains. We might find a particular person to be a great representative of our values in one area, but a poor representative in another. This is where a more project-based mentality comes in.

    • We try to understand a focus area well enough to have fairly detailed discussions of strategy during our hiring process (for more on what we look for, see this post).
    • Our grantmaking process involves Program Officers’ writing up the case for each grant and answering a number of key questions. We try to have enough general knowledge about the area they’re working in to evaluate the case they’re making at a high level. We ask critical questions and try to find ways in which we can learn from each other.

    A fairly common dynamic with new Program Officers has been that we know far less about their field than they do, and we often learn about the field when we question the parts of a grant writeup we find counterintuitive; at the same time, we often (at first) have better-developed views on philanthropy-specific topics, such as assessing room for more funding. Our goal over time is to reach increasing common understanding with Program Officers and increasingly defer to their judgment.

    One principle we’ve been experimenting with is a “50/40/10” rule:

    • We want to have fairly good understanding of, and high excitement about, at least 50% of a Program Officer’s portfolio.
    • We want another 40% of the portfolio to fit the description, “We can see how this might appear very exciting if we had more context, though we don’t feel personally convinced.”
    • We’re willing to defer entirely to the Program Officer on the remaining 10% of the portfolio, even if we can’t see the case for it at all (as long as the downside risks are manageable).

    The idea here is to try to stay synced up with Program Officers on enough of their work – using a “project-based” approach – to continually justify our confidence in them as decision-makers, while allowing a lot of leeway for them to use their own judgment and recommend grants that require deep expertise and context to appreciate.

    Similar principles apply to how we support and evaluate grantees. Even when the main reason we’re supporting an organization is as a bet on the people involved, we still find it helpful to have an outline of the projects they plan on. This helps us evaluate whether these people are aligned with our goals and whether our funds will help us do much more than they could have otherwise. But once we’ve determined that the proposed activities seem promising and sensible, we tend to provide support with no strings attached, in case plans change.

    Ultimately, I think that our work tends to look very “project-based” in a sense: we put a lot of effort into learning about our focus areas and we ask Program Officers a lot of questions about their recommended grants, and Program Officers in turn tend to ask potential grantees a lot of questions about the specifics of their plans. But the intent of this is more to spot-check our alignment and understanding than to comprehensively understand the grants. When a particular question is hard to resolve, we tend to defer to the people with the most expertise and context. We know we’ll never have the whole picture, and our goal is to understand enough of it to extrapolate the rest – by trusting the right people for the right purposes.

New Report on Early Field Growth

As part of our research on the history of philanthropy, I recently investigated several case studies of early field growth, especially those in which philanthropists purposely tried to grow the size and impact of a (typically) young and small field of research or advocacy.

The full report includes brief case studies of bioethics, cryonics, molecular nanotechnology, neoliberalism, the conservative legal movement, American geriatrics, American environmentalism, and animal advocacy. My key takeaways are:

  • Most of the “obvious” methods for building up a young field have been tried, and those methods often work. For example, when trying to build up a young field of academic research, it often works to fund workshops, conferences, fellowships, courses, professorships, centers, requests for proposals, etc. Or when trying to build up a new advocacy community, it often works to fund student clubs, local gatherings, popular media, etc.
  • Fields vary hugely along several dimensions, including (1) primary sources of funding (e.g. large philanthropists, many small donors, governments, companies), (2) whether engaged philanthropists were “active” or “passive” in their funding strategy, and (3) how much the growth of the field can be attributed to endogenous factors (e.g. explicit movement-building work) vs. exogenous factors (e.g. changing geopolitical conditions).


Besides these major takeaways, I also learned many more specific things about particular fields. For example:

  • The rise of bioethics seems to be a case study in the transfer of authority over a domain (medical ethics) from one group (doctors) to another (bioethicists), in large part due to the first group’s relative neglect of that domain. [More]
  • In the case of cryonics and molecular nanotechnology, plausibly growth-stunting adversarial dynamics arose between advocates of these young fields and the scientists in adjacent fields (cryobiology and chemistry, respectively). These adversarial dynamics seem to have arisen, in part, due to the young fields’ early focus on popular outreach prior to doing much scientific or technical work, and their disparagement of those in adjacent fields. [More]
  • The rise of neoliberalism is a victory for an explicit strategy of decades-long investment in the academic development and intellectual spreading of a particular set of ideas, though this model may not work as well when the ideas themselves don’t happen to benefit a naturally well-resourced set of funders (large corporations and their wealthy owners, as in the case of neoliberalism). [More]
  • A small group of funders of the conservative legal movement managed to critique their own (joint) strategy, change course, and succeed as a result. [More]
  • The rise of the environmental and animal advocacy movements contrast sharply with the cases above, both because they grew mostly via a large network of small funders rather than a small network of large funders, and because many of those movements’ activities do not materially benefit any funder or political actor (e.g. in the case of wilderness preservation or campaigns against factory farming). [More]

For more detail, see the full report.

Radical Empathy

One theme of our work is trying to help populations that many people don’t feel are worth helping at all. We’ve seen major opportunities to improve the welfare of factory-farmed animals, because so few others are trying to do it. When working on immigration reform, we’ve seen big debates about how immigration affects wages for people already in the U.S., and much less discussion of how it affects immigrants. Even our interest in global health and development is fairly unusual: many Americans may agree that charitable dollars go further overseas, but prefer to give domestically because they so strongly prioritize people in their own country compared to people in the rest of the world.[1]For example, according to data from Giving USA, only approximately 4% of US giving in 2015 was focused on international aid. (Reported by Charity Navigator here.)

The question, “Who deserves empathy and moral concern?” is central for us. We think it’s one of the most important questions for effective giving, and generally. Unfortunately, we don’t think we can trust conventional wisdom and intuition on the matter: history has too many cases where entire populations were dismissed, mistreated and deprived of basic rights for reasons that fit the conventional wisdom of the time but today look indefensible. Instead, we aspire to radical empathy: working hard to extend empathy to everyone it should be extended to, even when it is unusual or seems strange to do so.

To clarify the choice of terminology:

  • “Radical” is intended as the opposite of “traditional” or “conventional.” It doesn’t necessarily mean “extreme” or “all-inclusive”: we don’t extend empathy to everyone and everything (this would leave us essentially no basis for making decisions about morality). It refers to working hard to make the best choices we can, without anchoring to convention.
  • “Empathy” is intended to capture the idea that one could imagine oneself in another’s position, and recognizes the other as having experiences that are worthy of consideration. It is not intended to refer to literally feeling what another feels, and is therefore distinct from the “empathy” critiqued in Against Empathy (a book that acknowledges the multiple meanings of the term and explicitly focuses on one).

Conventional wisdom and intuition aren’t good enough

In The Expanding Circle, Peter Singer discusses how, over the course of history, “The circle of altruism has broadened from the family and tribe to the nation and race … to all human beings” (and adds that “The process should not stop there”).[2]Page 120.

By today’s standards, the earliest cases he describes are striking:

At first [the] insider/ outsider distinction applied even between the citizens of neighboring Greek city-states; thus there is a tombstone of the mid-fifth century B.C. which reads:

This memorial is set over the body of a very good man. Pythion, from Megara, slew seven men and broke off seven spear points in their bodies . . . This man, who saved three Athenian regiments . . . having brought sorrow to no one among all men who dwell on earth, went down to the underworld felicitated in the eyes of all.

This is quite consistent with the comic way in which Aristophanes treats the starvation of the Greek enemies of the Athenians, starvation which resulted from the devastation the Athenians had themselves inflicted. Plato, however, suggested an advance on this morality: he argued that Greeks should not, in war, enslave other Greeks, lay waste their lands or raze their houses; they should do these things only to non-Greeks. These examples could be multiplied almost indefinitely. The ancient Assyrian kings boastfully recorded in stone how they had tortured their non-Assyrian enemies and covered the valleys and mountains with their corpses. Romans looked on barbarians as beings who could be captured like animals for use as slaves or made to entertain the crowds by killing each other in the Colosseum. In modern times Europeans have stopped treating each other in this way, but less than two hundred years ago some still regarded Africans as outside the bounds of ethics, and therefore a resource which should be harvested and put to useful work. Similarly Australian aborigines were, to many early settlers from England, a kind of pest, to be hunted and killed whenever they proved troublesome.[3]Pages 112-113.

The end of the quote transitions to more recent, familiar failures of morality. In recent centuries, extreme racism, sexism and other forms of bigotry – including slavery – have been practiced explicitly and without apology, and often widely accepted by the most respected people in society.

From today’s vantage point, these seem like extraordinarily shameful behaviors, and people who were early to reject them – such as early abolitionists and early feminists – look to have done extraordinary amounts of good. But at the time, looking to conventional wisdom and intuition wouldn’t necessarily have helped avoid the shameful behaviors or seek out the helpful ones.

Today’s norms seem superior in some respects. For example, racism is much more rarely explicitly advocated (which is not to say that it is rarely practiced). However, we think today’s norms are still fundamentally inadequate for the question of who deserves empathy and moral concern. One sign of this is the discourse in the U.S. around immigrants, which tends to avoid explicit racism but often to embrace nationalism – to exclude or downplay the rights and concerns of people who aren’t American citizens (and even more so, people who aren’t in the U.S. but would like to be).

Intellect vs. emotion

I sometimes hear the sentiment that moral atrocities tend to come from thinking of morality abstractly, losing sight of the basic emotional basis for empathy, and distancing oneself from the people one’s actions affect.

I think this is true in some cases, but importantly false in others. People living peaceful lives are often squeamish about violence, but it seems that this squeamishness can be overcome disturbingly quickly with experience. There are ample examples throughout history where large numbers of “conventional” people casually and even happily practiced direct cruelty and violence to those whose rights they didn’t recognize.[4]Many examples available in the first chapter of Better Angels of our Nature.

Today, watching the casualness with which factory farm workers handle animals (as shown in this gruesome video), I doubt that people would eat much less meat if they had to kill animals themselves. I don’t think the key is whether people see and feel the consequences of their actions. More important is whether they recognize those their actions affect as fellow persons, meriting moral consideration.

On the flipside, there seems to be at least some precedent for using logical reasoning to reach moral conclusions that look strikingly prescient in retrospect. For example, see Wikipedia on Jeremy Bentham, who is known for basing his morality on the  straightforward, quantitative logic of utilitarianism:

He advocated individual and economic freedom, the separation of church and state, freedom of expression, equal rights for women, the right to divorce, and the decriminalising of homosexual acts. [My note: he lived from 1747-1832, well before most of these views were common.] He called for the abolition of slavery, the abolition of the death penalty, and the abolition of physical punishment, including that of children. He has also become known in recent years as an early advocate of animal rights.

Aspiring to radical empathy

Who deserves empathy and moral concern? To the extent that we get this question wrong, we risk making atrocious choices. If we can get it right to an unusual degree, we might be able to do outsized amounts of good.

Unfortunately, we don’t think it is necessarily easy to get it right, and we’re far from confident that we are doing so. But here are a few principles we try to follow, in making our best attempt:

Acknowledge our uncertainty. For example, we’re quite unsure of where animals should fit into our moral framework. My own reflections and reasoning about philosophy of mind have, so far, seemed to indicate against the idea that e.g. chickens merit moral concern. And my intuitions value humans astronomically more. However, I don’t think either my reflections or my intuitions are highly reliable, especially given that many thoughtful people disagree. And if chickens do indeed merit moral concern, the amount and extent of their mistreatment is staggering. With worldview diversification in mind, I don’t want us to pass up the potentially considerable opportunities to improve their welfare.

I think the uncertainty we have on this point warrants putting significant resources into farm animal welfare, as well as working to generally avoid language that implies that only humans are morally relevant.[5]As a side note, it is often tricky to avoid such language. We generally use the term “persons” when we want to refer to beings that merit moral concern, without pre-judging whether such beings are human and also without causing too much distraction for casual readers. A more precise … Continue reading

That said, I don’t feel uncertain about all of our unusual choices. I’m confident that differences in geography, nationality, and race ought not affect moral concern, and our giving should reflect this.

Be extremely careful about too quickly dismissing “strange” arguments on this topic. Relatively small numbers of people argue that insects, and even some algorithms run on today’s computers, merit moral concern. It’s easy and intuitive to laugh these viewpoints off, since they seem so strange on their face and have such radical implications. But as argued above, I think we should be highly suspicious of our instincts to dismiss unusual viewpoints on who merits moral concern. And the stakes could certainly be high if these viewpoints turn out to be more reasonable than they appear at first.

So far I remain unconvinced that insects, or any algorithms run on today’s computers, are strong candidates for meriting moral concern. But I think it’s important to keep an open mind.

Explore the idea of supporting deeper analysis. Luke Muehlhauser is currently exploring the current state of research and argumentation on the question of who merits moral concern (which he calls the question of moral patienthood). It’s possible that if we identify gaps in the literature, and opportunities to become better informed, we’ll recommend funding further work. In the near future, work along these lines could affect our priorities within farm animal welfare – for example, it could affect how we prioritize work focused on improving treatment of fish. Ideally, our views on moral patienthood would be informed by an extensive literature drawing on as much deep reflection, empirical investigation and principled argumentation as possible.

Don’t limit ourselves to the “frontier.” Widely recognized problems still do a great deal of damage. In our work we often find ourselves focusing on unconventional targets for charitable giving, such as farm animal welfare and potential risks from advanced artificial intelligence. This is because we often find that opportunities to do disproportionate amounts of good are in areas that have been, in our view, relatively neglected by others. However, our goal is to do the most good we can, not to seek out and support those causes which are most “radical” in our present society. When we see great opportunities to play a role in addressing harms in more widely-acknowledged areas — for example, in the U.S. criminal justice system — we take them.

Footnotes

Footnotes
1 For example, according to data from Giving USA, only approximately 4% of US giving in 2015 was focused on international aid. (Reported by Charity Navigator here.)
2 Page 120.
3 Pages 112-113.
4 Many examples available in the first chapter of Better Angels of our Nature.
5 As a side note, it is often tricky to avoid such language. We generally use the term “persons” when we want to refer to beings that merit moral concern, without pre-judging whether such beings are human and also without causing too much distraction for casual readers. A more precise term is “moral patients.

Worldview Diversification

In principle, we try to find the best giving opportunities by comparing many possibilities. However, many of the comparisons we’d like to make hinge on very debatable, uncertain questions.

For example:

  • Some people think that animals such as chickens have essentially no moral significance compared to that of humans; others think that they should be considered comparably important, or at least 1-10% as important. If you accept the latter view, farm animal welfare looks like an extraordinarily outstanding cause, potentially to the point of dominating other options: billions of chickens are treated incredibly cruelly each year on factory farms, and we estimate that corporate campaigns can spare over 200 hens from cage confinement for each dollar spent. But if you accept the former view, this work is arguably a poor use of money.
  • Some have argued that the majority of our impact will come via its effect on the long-term future. If true, this could be an argument that reducing global catastrophic risks has overwhelming importance, or that accelerating scientific research does, or that improving the overall functioning of society via policy does. Given how difficult it is to make predictions about the long-term future, it’s very hard to compare work in any of these categories to evidence-backed interventions serving the global poor.
  • We have additional uncertainty over how we should resolve these sorts of uncertainty. We could try to quantify our uncertainties using probabilities (e.g. “There’s a 10% chance that I should value chickens 10% as much as humans”), and arrive at a kind of expected value calculation for each of many broad approaches to giving. But most of the parameters in such a calculation would be very poorly grounded and non-robust, and it’s unclear how to weigh calculations with that property. In addition, such a calculation would run into challenges around normative uncertainty (uncertainty about morality), and it’s quite unclear how to handle such challenges.

In this post, I’ll use “worldview” to refer to a set of highly debatable (and perhaps impossible to evaluate) beliefs that favor a certain kind of giving. One worldview might imply that evidence-backed charities serving the global poor are far more worthwhile than either of the types of giving discussed above; another might imply that farm animal welfare is; another might imply that global catastrophic risk reduction is. A given worldview represents a combination of views, sometimes very difficult to disentangle, such that uncertainty between worldviews is constituted by a mix of empirical uncertainty (uncertainty about facts), normative uncertainty (uncertainty about morality), and methodological uncertainty (e.g. uncertainty about how to handle uncertainty, as laid out in the third bullet point above). Some slightly more detailed descriptions of example worldviews are in a footnote.[1]One might fully accept total utilitarianism, plus the argument in Astronomical Waste, as well as some other premises, and believe that work on global catastrophic risks has far higher expected value than work on other causes. One might accept total utilitarianism and the idea that the moral value … Continue reading

A challenge we face is that we consider multiple different worldviews plausible. We’re drawn to multiple giving opportunities that some would consider outstanding and others would consider relatively low-value. We have to decide how to weigh different worldviews, as we try to do as much good as possible with limited resources.

When deciding between worldviews, there is a case to be made for simply taking our best guess[2]Specifically, our best guess about which worldview or combination of worldviews is most worth operating on in order to accomplish as much good as possible. This isn’t necessarily the same as which worldview is most likely to represent a set of maximally correct beliefs, values and approaches; … Continue reading and sticking with it. If we did this, we would focus exclusively on animal welfare, or on global catastrophic risks, or global health and development, or on another category of giving, with no attention to the others. However, that’s not the approach we’re currently taking.

Instead, we’re practicing worldview diversification: putting significant resources behind each worldview that we find highly plausible. We think it’s possible for us to be a transformative funder in each of a number of different causes, and we don’t – as of today – want to pass up that opportunity to focus exclusively on one and get rapidly diminishing returns.

This post outlines the reasons we practice worldview diversification. In a nutshell:

  • I will first discuss the case against worldview diversification. When seeking to maximize expected positive impact, without being worried about the “risk” of doing no good, there is a case that we should simply put all available resources behind the worldview that our best-guess thinking favors.
  • I will then list several reasons for practicing worldview diversification, in situations where (a) we have high uncertainty and find multiple worldviews highly plausible; (b) there would be strongly diminishing returns if we put all our resources behind any one worldview.
    • First, under a set of basic assumptions including (a) and (b) above, worldview diversification can maximize expected value.
    • Second, if we imagined that different worldviews represented different fundamental values (not just different opinions, such that one would ultimately be “the right one” if we had perfect information), and that the people holding different values were trying to reach agreement on common principles behind a veil of ignorance (explained more below), it seems likely that they would agree to some form of worldview diversification as a desirable practice for anyone who ends up with outsized resources.
    • Practicing worldview diversification means developing staff capacity to work in many causes. This provides option value (the ability to adjust if our best-guess worldview changes over time). It also increases our long-run odds of having large effects on the general dialogue around philanthropy, since we can provide tangibly useful information to a larger set of donors.
    • There are a number of other practical benefits to working in a broad variety of causes, including the opportunity to use lessons learned in one area to improve our work in another; presenting an accurate public-facing picture of our values; and increasing the degree to which, over the long run, our expected impact matches our actual impact. (The latter could be beneficial for our own, and others’, ability to evaluate how we’re doing.)
  • Finally, I’ll briefly discuss the key conditions under which worldview diversification seems like a good idea, and give some rough notes on how we currently implement it in practice.

Note that worldview diversification is simply a broad term for putting significant resources behind multiple worldviews – it does not mean anything as specific as “divide resources evenly between worldviews.” This post discusses benefits of worldview diversification, without saying exactly how (or to what degree) one should allocate resources between worldviews. In the future, we hope to put more effort into reflecting on – and discussing – which specific worldviews we find most compelling and how we weigh them against each other.

Also note that this post focuses on deciding how to allocate resources between different plausible already-identified causes, not on the process for identifying promising causes.

The case against worldview diversification

It seems likely that if we had perfect information and perfect insight into our own values, we’d see that some worldviews are much better guides to giving than others. For a relatively clear example, consider GiveWell’s top charities vs. our work so far on farm animal welfare:

  • GiveWell estimates that its top charity (Against Malaria Foundation) can prevent the loss of one year of life for every $100 or so.
  • We’ve estimated that corporate campaigns can spare over 200 hens from cage confinement for each dollar spent. If we roughly imagine that each hen gains two years of 25%-improved life, this is equivalent to one hen-life-year for every $0.01 spent.
  • If you value chicken life-years equally to human life-years, this implies that corporate campaigns do about 10,000x as much good per dollar as top charities. If you believe that chickens do not suffer in a morally relevant way, this implies that corporate campaigns do no good.[3](Bayesian adjustments should attenuate this difference to some degree, though it’s unclear how much, if you believe – as I do – that both estimates are fairly informed and reasonable though far from precise or reliable. I will put this consideration aside here.)
  • One could, of course, value chickens while valuing humans more. If one values humans 10-100x as much, this still implies that corporate campaigns are a far better use of funds (100-1,000x). If one values humans astronomically more, this still implies that top charities are a far better use of funds. It seems unlikely that the ratio would be in the precise, narrow range needed for these two uses of funds to have similar cost-effectiveness.

I think similar considerations broadly apply to other comparisons, such as reducing global catastrophic risks vs. improving policy, though quantifying such causes is much more fraught.

One might therefore imagine that there is some “best worldview” (if we had perfect information) that can guide us to do far more good than any of the others. And if that’s right, one might argue that we should focus exclusively on a “best guess worldview”[4]Specifically, our best guess about which worldview is most worth operating on in order to accomplish as much good as possible. This isn’t necessarily the same as which worldview is most likely to represent a set of maximally correct beliefs, values and approaches; it could be that a … Continue reading in order to maximize how much good we do in expected value terms. For example, if we think that one worldview seems 10,000x better than the others, even a 1-10% chance of being right would still imply that we can do much more good by focusing on that worldview.

This argument presumes that we are “risk neutral”: that our goal is only to maximize the expected value of how much good we do. That is, it assumes we are comfortable with the “risk” that we make the wrong call, put all of our resources into a misguided worldview, and ultimately accomplish very little. Being risk neutral to such a degree often seems strange to people who are used to investing metaphors: investors rarely feel that the possibility of doubling one’s money fully compensates for the possibility of losing it all, and they generally use diversification to reduce the variance of their returns (they aren’t just focused on expected returns). However, we don’t have the same reasons to fear failure that for-profit investors do. There are no special outsized consequences for “failing to do any good,” as there are for going bankrupt, so it’s a risk we’re happy to take as long as it’s balanced by the possibility of doing a great deal of good. The Open Philanthropy Project aims to be risk neutral in the way laid out here, though there are some other reasons (discussed below) that putting all our eggs in one basket could be problematic.

The case for worldview diversification

I think the case for worldview diversification largely hinges on a couple of key factors:

Strong uncertainty about which worldviews are most reasonable. We recognize that any given worldview might turn out to look misguided if we had perfect information – but even beyond that, we believe that any given worldview might turn out to look misguided if we reflected more rationally on the information that is available. In other words, we feel there are multiple worldviews that each might qualify for “what we should consider the best worldview to be basing our giving on, and the worldview that conceptually maximizes our expected value, if we thought more intelligently about the matter.” We could imagine someday finding any of these worldviews to be the best-seeming one. We feel this way partly because we see intelligent, reasonable people who are aware of the arguments for each worldview and still reject it.

Some people recognize that their best-guess worldview might be wrong, but still think that it is clearly the best to bet on in expected-value terms. For example, some argue that focusing on the far future is best even if there is a >99% chance that the arguments in favor of doing so are misguided, because the value of focusing on the far future is so great if the arguments turn out to be valid. In effect, these people seem to be leaving open no realistic possibility of changing their minds on this front. We have a different kind of uncertainty, that I find difficult to model formally, but that is probably something along the lines of cluster thinking. All things considered – including things like our uncertainty about our fundamental way of modeling expected value – I tend to think of the different plausible worldviews as being in the same ballpark of expected value.

Diminishing returns to putting resources behind any given worldview. When looking at a focus area such as farm animal welfare or potential risks from advanced AI, it seems to me that giving in the range of tens of millions of dollars per year (over the next decade or so) can likely fund the best opportunities, help relevant fields and disciplines grow, and greatly improve the chances that the cause pulls in other sources of funding (both private donors and governments). Giving much more than this would hit strongly diminishing returns. For causes like these, I might roughly quantify my intuition by saying that (at the relevant margin) giving 10x as much would only accomplish about 2x as much. (There are other causes where this dynamic does not apply nearly as much; for example, we don’t see much in the way of diminishing returns when it comes to supporting cash transfers to the global poor.)

With these two factors in mind, there are a number of arguments for worldview diversification.

Expected value

When accounting for strong uncertainty and diminishing returns, worldview diversification can maximize expected value even when one worldview looks “better” than the others in expectation. One way of putting this is that if we were choosing between 10 worldviews, and one were 5x as good as the other nine, investing all our resources in that one would – at the relevant margin, due to the “diminishing returns” point – be worse than spreading across the ten.[5]Specifically, say X is the amount of good we could accomplish by investing $Y in any of the nine worldviews other than the “best” one, and imagine that $Y is around the point of diminishing returns where investing 10x as much only accomplishes 2x as much good. This would then imply that … Continue reading

I think this dynamic is enhanced by the fact that there is so much we don’t know, and any given worldview could turn out to be much better or much worse than it appears for subtle and unanticipated reasons, including those related to flow-through effects.[6]For example, say we return to the above hypothetical (see previous footnote) but also imagine that our estimates of the worldviews’ value includes some mistakes, such that an unknown one of the ten actually has 1000X value and another unknown one actually has 0 value at the relevant margin. … Continue reading

It isn’t clear to me how much sense it makes to think in these terms. Part of our uncertainty about worldviews is our uncertainty about moral values: to a significant degree, different worldviews might be incommensurate, in that there is no meaningful way to compare “good accomplished” between them. Some explicit frameworks have been proposed for dealing with uncertainty between incommensurate moral systems,[7]For example, see MacAskill 2014. but we have significant uncertainty about how useful these frameworks these are and how to use them.

Note that the argument in this section only holds for worldviews with reasonably similar overall expected value. If one believes that a particular worldview points to giving opportunities that are orders of magnitude better than others’, this likely outweighs the issue of diminishing returns.

The ethics of the “veil of ignorance”

Another case for worldview diversification derives from, in some sense, the opposite approach. Rather than thinking of different worldviews as different “guesses” at how to do the most good, such that each has an expected value and they are ultimately compared in the same terms, presume that different worldviews represent the perspectives of different people[8]I’m using the term “people” for simplicity, though in theory I could imagine extending the analysis in this section to the value systems of animals etc. with different, incommensurable values and frameworks. For example, it may be the case that some people care as deeply about animals as they do about people, while others don’t value animal welfare at all, and that no amount of learning or reflection would change any of this. When choosing between worldviews, we’re choosing which sorts of people we most identify and sympathize with, and we have strong uncertainty on the matter.

One way of thinking about the ethics of how people with different values should interact with each other is to consider a kind of veil of ignorance: imagine the agreements such people would come to about how they should use resources, if they were negotiating before knowing how much resources each of them would individually have available.[9]I recognize that this setup has some differences with the well-known “veil of ignorance” proposed by Rawls, but still think it is useful for conveying intuitions in this case. One such agreement might be: “If one of us ends up with access to vastly more resources than the others, that person should put some resources into the causes most important to each of us – up to some point of diminishing returns – rather than putting all the resources into that person’s own favorite cause.” Each person might accept (based on the diminishing returns model above) that if they end up with vastly more resources than the others, this agreement will end up making them worse off, but only by 50%; whereas if someone else ends up with vastly more resources, this agreement will end up making them far better off.

This is only a rough outline of what an appealing principle might look like. Additional details might be added, such as “The person with outsized resources should invest more in areas where they can be more transformative, e.g. in more neglected areas.”

We see multiple appealing worldviews that seem to have relatively few resources behind them, and we have the opportunity to have a transformative impact according to multiple such worldviews. Taking this opportunity is the ethical thing to do in the sense that it reflects an agreement we would have made under a “veil of ignorance,” and it means that we can improve the world greatly according to multiple different value sets that we feel uncertain between. I think that considering and putting weight on “veil of ignorance” based ethical concerns such as this one is a generally good heuristic for consequentialists and non-consequentialists alike, especially when one does not have a solid framework for comparing “expected good accomplished” across different options.

Capacity building and option value

Last year, we described our process of capacity building:

Our goals, and our efforts, have revolved around (a) selecting focus areas; (b) hiring people to lead our work in these areas (see our most recent update); (c) most recently, working intensively with new hires and trial hires on their early proposed grant recommendations.

Collectively, we think of these activities as capacity building. If we succeed, the end result will be an expanded team of people who are (a) working on well-chosen focus areas; (b) invested (justifiably) with a great deal of trust and autonomy; (c) capable of finding many great giving opportunities in the areas they’re working on.

In addition to building internal capacity (staff), we are hoping to support the growth of the fields we work in, and to gain knowledge over time that makes us more effective at working in each cause. Collectively, all of this is “capacity building” in the sense that it will, in the long run, improve our ability to give effectively at scale. There are a number of benefits to building capacity in a variety of causes that are appealing according to different worldviews (i.e., to building capacity in criminal justice reform, farm animal welfare, biosecurity and pandemic preparedness and more).

One benefit is option value. Over time, we expect that our thinking on which worldviews are most appealing will evolve. For example, I recently discussed three key issues I’ve changed my mind about over the last several years, with major implications for how promising I find different causes. It’s very possible that ten years from now, some particular worldview (and its associated causes) will look much stronger to us than the others – and that it won’t match our current best guess. If this happens, we’ll be glad to have invested in years of capacity building so we can quickly and significantly ramp up our support.

Another long-term benefit is that we can be useful to donors with diverse worldviews. If we worked exclusively in causes matching our “best guess” worldview, we’d primarily be useful to donors with the same best guess; if we do work corresponding to all of the worldviews we find highly compelling, we’ll be useful to any donor whose values and approach are broadly similar to ours. That’s a big difference: I believe there are many people with fundamentally similar values to ours, but different best guesses on some highly uncertain but fundamental questions – for example, how to value reducing global catastrophic risks vs. accelerating scientific research vs. improving policy.

With worldview diversification, we can hope to appeal to – and be referred to – any donor looking to maximize the positive impact of their giving. Over the long run, I think this means we have good prospects for making many connections via word-of-mouth, helping many donors give more effectively, and affecting the general dialogue around philanthropy.

Other benefits to worldview diversification

Worldview diversification means working on a variety of causes that differ noticeably from each other. There are a number of practical benefits to this.

We can use lessons learned in one area to improve our work in another. For example:

  • Some of the causes we work in are very neglected and “thin,” in the sense that there are few organizations working on them. Others were chosen for reasons other than neglectedness, and have many organizations working on them. Understanding the latter can give us a sense for what kinds of activities we might hope to eventually support in the former.
  • Some of the causes we work on involve very long-term goals with little in the way of intermediate feedback (this tends to be true of efforts to reduce global catastrophic risks). In other causes, we can more expect to see progress and learn from our results (for example, criminal justice reform, which we selected largely for its tractability).
  • Different causes have different cultures, and by working in a number of disparate ones, we work with a number of Program Officers whose different styles and approaches can inform each other.

It is easier for casual observers (such as the press) to understand our values and motivations. Some of the areas we work in are quite unconventional for philanthropy, and we’ve sometimes come across people who question our motivations. By working in a broad variety of causes, some of which are easier to see the case for than others, we make it easier for casual observers to discern the pattern behind our choices and get an accurate read on our core values. Since media coverage affects many people’s preconceptions, this benefit could make a substantial long-term difference to our brand and credibility.

Over the long run, our actual impact will better approximate our expected impact. Our hits-based giving approach means that in many cases, we’ll put substantial resources into a cause even though we think it’s more likely than not that we’ll fail to have any impact. (Potential risks from artificial intelligence is one such cause.) If we put all our resources behind our best-guess worldview, we might never have any successful grants even if we make intelligent, high-expected-value grants. Conversely, we might “get lucky” and appear far more reliably correct and successful than we actually are. In either case, our ability to realistically assess our own track record, and learn from it, is severely limited. Others’ ability to assess our work, in order to decide how much weight they should put on our views, is as well.

Worldview diversification lessens this problem, to a degree. If we eventually put substantial resources into ten very different causes, then we can reasonably hope to get one or more “hits” even if each cause is a long shot. If we get no “hits,” we have some evidence that we’re doing something wrong, and if we get one or more, this is likely to help our credibility.

We’re still ultimately making a relatively small number of “bets,” and there are common elements to the reasoning and approach we bring to each, so the benefit we get on this front is limited.

Morale and recruiting. Working in a variety of causes makes our organization a more interesting place to work. It means that our work remains exciting and motivating even as our views and our “best guesses” shift, and even when there is little progress on a particular cause for a long time. It means that our work resonates with more people, broadening the community we can engage with positively. This point wouldn’t be enough by itself to make the case for worldview diversification, but it is a factor in my mind, and I’d be remiss not to mention it.

When and how should one practice worldview diversification?

As discussed above, the case for worldview diversification relies heavily on two factors: (a) we have high uncertainty and find multiple worldviews highly plausible; (b) there would be strongly diminishing returns if we put all our resources behind any one worldview. Some of the secondary benefits discussed in the previous section are also specific to a public-facing organization with multiple staff. I think worldview diversification makes sense for relatively large funders, especially those with the opportunity to have a transformative impact according to multiple different highly appealing worldviews. I do not think it makes sense for an individual giving $100 or even $100,000 per year. I also do not think it makes sense for someone who is highly confident that one cause is far better than the rest.

We haven’t worked out much detail regarding the “how” of worldview diversification. In theory, one might be able to develop a formal approach that accounts for both the direct benefits of each potential grant and the myriad benefits of worldview diversification in order to arrive at conclusions about how much to allocate to each cause. One might also incorporate considerations like “I’m not sure whether worldviews A and B are commensurate or not; there’s an X% chance they are, in which case we should allocate one way, and a Y% chance they aren’t, in which case we should allocate another way.” But while we’ve discussed these sorts of issues, we haven’t yet come up with a detailed framework along these lines. Nor have we thoroughly reflected on, and explicitly noted, which specific worldviews we find most compelling and how we weigh them against each other.

We will likely put in more effort on this front in the coming year, though it won’t necessarily lead to a complete or satisfying account of our views and framework. For now, some very brief notes on our practices to date:

Currently, we tend to invest resources in each cause up to the point where it seems like there are strongly diminishing returns, or the point where it seems the returns are clearly worse than what we could achieve by reallocating the resources – whichever comes first. A bit more specifically:

  • In terms of staff capacity, so far it seems to me that there is a huge benefit to having one full-time staffer working on a given cause, supported by 1-3 other staff who spend enough time on the cause to provide informed feedback. Allocating additional staff beyond this seems generally likely to have rapidly diminishing returns, though we are taking a case-by-case approach and allocating additional staff to a cause when it seems like this could substantially improve our grantmaking.
  • In terms of money, so far we have tried to roughly benchmark potential grants against direct cash transfers; when it isn’t possible to make a comparison, we’ve often used heuristics such as “Does this grant seem reasonably likely to substantially strengthen an important aspect of the community of people/organizations working on this cause?” as a way to very roughly and intuitively locate the point of strongly diminishing returns. We tend to move forward with any grant that we understand the case for reasonably well and that seems – intuitively, heuristically – strong by the standards of its cause/associated worldview (and appears at least reasonably likely, given our high uncertainty, to be competitive with grants in other causes/worldviews, including cash transfers). For causes that seem particularly promising, and/or neglected (such that we can be particularly transformative in them), we use the lower bar of funding “reasonably strong” opportunities; for other causes, we tend more to look for “very strong” opportunities. This approach is far from ideal, but has the advantage that it is fairly easy to execute in practice, given that we currently have enough resources to move forward with all grants fitting these descriptions.

As noted above, we hope to put more thought into these issues in the coming year. Ideas for more principled, systematic ways of practicing worldview diversification would be very interesting to us.

Footnotes

Footnotes
1

  • One might fully accept total utilitarianism, plus the argument in Astronomical Waste, as well as some other premises, and believe that work on global catastrophic risks has far higher expected value than work on other causes.
  • One might accept total utilitarianism and the idea that the moral value of the far future overwhelms other considerations – but also believe that our impact on the far future is prohibitively hard to understand and predict, and that the right way to handle radical uncertainty about our impact is to instead focus on improving the world in measurable, robustly good ways. This view could be consistent with a number of different opinions about which causes are most worth working on.
  • One might put some credence in total utilitarianism and some credence in the idea that we have special duties to persons who live in today’s society, suffer unjustly, and can benefit tangibly and observably from our actions. Depending on how one handles the “normative uncertainty” between the two, this could lead to a variety of different conclusions about which causes to prioritize.

Any of the above could constitute a “worldview” as I’ve defined it. Views about the moral weight of animals vs. humans could additionally complicate the points above.

2 Specifically, our best guess about which worldview or combination of worldviews is most worth operating on in order to accomplish as much good as possible. This isn’t necessarily the same as which worldview is most likely to represent a set of maximally correct beliefs, values and approaches; it could be that a particular worldview is only 20% likely to represent a set of maximally correct beliefs, values, and approaches, but that if it does, following it would lead to >100x the positive impact of following any other worldview. If such a thing were true (and knowable), then this would be the best worldview to operate on.
3 (Bayesian adjustments should attenuate this difference to some degree, though it’s unclear how much, if you believe – as I do – that both estimates are fairly informed and reasonable though far from precise or reliable. I will put this consideration aside here.)
4 Specifically, our best guess about which worldview is most worth operating on in order to accomplish as much good as possible. This isn’t necessarily the same as which worldview is most likely to represent a set of maximally correct beliefs, values and approaches; it could be that a particular worldview is only 20% likely to represent a set of maximally correct beliefs, values, and approaches, but that if it does, following it would lead to >100x the positive impact of following any other worldview. If such a thing were true (and knowable), then this would be the best worldview to operate on.
5 Specifically, say X is the amount of good we could accomplish by investing $Y in any of the nine worldviews other than the “best” one, and imagine that $Y is around the point of diminishing returns where investing 10x as much only accomplishes 2x as much good. This would then imply that putting $Y into each of the ten worldviews would have good accomplished equal to 14X (5X for the “best” one, X for each of the other nine), while putting $10*Y into the “best” worldview would have good accomplished equal to 10X. So the diversified approach is about as 1.4x as good by these assumptions.
6 For example, say we return to the above hypothetical (see previous footnote) but also imagine that our estimates of the worldviews’ value includes some mistakes, such that an unknown one of the ten actually has 1000X value and another unknown one actually has 0 value at the relevant margin. (The diminishing returns continue to work the same way.) Then putting $Y into each of the ten worldviews would have good accomplished equal to at least 1008X while putting $10*Y into the “best” worldview would have good accomplished equal to about 208X (the latter is 2*(10%*1000X + 10%*8 + 80%*5X)). While in the previous case the diversified approach looked about 1.4X as good, here it looks nearly 5x as good.
7 For example, see MacAskill 2014.
8 I’m using the term “people” for simplicity, though in theory I could imagine extending the analysis in this section to the value systems of animals etc.
9 I recognize that this setup has some differences with the well-known “veil of ignorance” proposed by Rawls, but still think it is useful for conveying intuitions in this case.

Efforts to Improve the Accuracy of Our Judgments and Forecasts

Our grantmaking decisions rely crucially on our uncertain, subjective judgments — about the quality of some body of evidence, about the capabilities of our grantees, about what will happen if we make a certain grant, about what will happen if we don’t make that grant, and so on.

In some cases, we need to make judgments about relatively tangible outcomes in the relatively near future, as when we have supported campaigning work for criminal justice reform. In others, our work relies on speculative forecasts about the much longer term, as for example with potential risks from advanced artificial intelligence. We often try to quantify our judgments in the form of probabilities — for example, the former link estimates a 20% chance of success for a particular campaign, while the latter estimates a 10% chance that a particular sort of technology will be developed in the next 20 years.

We think it’s important to improve the accuracy of our judgments and forecasts if we can. I’ve been working on a project to explore whether there is good research on the general question of how to make good and accurate forecasts, and/or specialists in this topic who might help us do so. Some preliminary thoughts follow.

In brief:

  • There is a relatively thin literature on the science of forecasting.1 It seems to me that its findings so far are substantive and helpful, and that more research in this area could be promising.
  • This literature recommends a small set of “best practices” for making accurate forecasts that we are thinking about how to incorporate into our process. It seems to me that these “best practices” are likely to be useful, and surprisingly uncommon given that.
  • In one case, we are contracting to build a simple online application for credence calibration training: training the user to accurately determine how confident they should be in an opinion, and to express this confidence in a consistent and quantified way. I consider this a very useful skill across a wide variety of domains, and one that (it seems) can be learned with just a few hours of training. (Update: This calibration training app is now available.)

I first discuss the last of these points (credence calibration training), since I think it is a good introduction to the kinds of tangible things one can do to improve forecasting ability.

1. Calibration training

An important component of accuracy is called “calibration.” If you are “well-calibrated,” what that means is that statements (including predictions) you make with 30% confidence are true about 30% of the time, statements you make with 70% confidence are true about 70% of the time, and so on.

Without training, most people are not well-calibrated, but instead overconfident. Statements they make with 90% confidence might be true only 70% of the time, and statements they make with 75% confidence might be true only 60% of the time.2 But it is possible to “practice” calibration by assigning probabilities to factual statements, then checking whether the statements are true, and tracking one’s performance over time. In a few hours, one can practice on hundreds of questions and discover patterns like “When I’m 80% confident, I’m right only 65% of the time; maybe I should adjust so that I report 65% for the level of internally-experienced confidence I previously associated with 80%.”

I recently attended a calibration training webinar run by Hubbard Decision Research, which was essentially an abbreviated version of the classic calibration training exercise described in Lichtenstein & Fischhoff (1980). It was also attended by two participants from other organizations, who did not seem to be familiar with the idea of calibration and, as expected, were grossly overconfident on the first set of questions.3 But, as the training continued, their scores on the question sets began to improve until, on the final question set, they both achieved perfect calibration.

For me, this was somewhat inspiring to watch. It isn’t often the case that a cognitive skill as useful and domain-general as probability calibration can be trained, with such objectively-measured dramatic improvements, in so short a time.

The research I’ve reviewed broadly supports this impression. For example:

  • Rieber (2004) lists “training for calibration feedback” as his first recommendation for improving calibration, and summarizes a number of studies indicating both short- and long-term improvements on calibration.4 In particular, decades ago, Royal Dutch Shell began to provide calibration for their geologists, who are now (reportedly) quite well-calibrated when forecasting which sites will produce oil.5
  • Since 2001, Hubbard Decision Research trained over 1,000 people across a variety of industries. Analyzing the data from these participants, Doug Hubbard reports that 80% of people achieve perfect calibration (on trivia questions) after just a few hours of training. He also claims that, according to his data and at least one controlled (but not randomized) trial, this training predicts subsequent real-world forecasting success.6

I should note that calibration isn’t sufficient by itself for good forecasting. For example, you can be well-calibrated on a set of true/false statements, for which about half the statements happen to be true, simply by responding “True, with 50% confidence” to every statement. This performance would be well-calibrated but not very informative. Ideally, an expert would assign high confidence to statements that are likely to be true, and low confidence to statements that are unlikely to be true. An expert that can do so is not just well-calibrated, but also exhibits good “resolution” (sometimes called “discrimination”). If we combine calibration and resolution, we arrive at a measure of accuracy called a “proper scoring rule.”7 The calibration trainings described above sometimes involve proper scoring rules, and likely train people to be well-calibrated while exhibiting at least some resolution, though the main benefit they seem to have (based on the research and my observations) pertains to calibration specifically.

The primary source of my earlier training in calibration was a game intended to automate the process. The Open Philanthropy Project is now working with developers to create a more extensive calibration training game for training our staff; we will also make the game available publicly.

2. Further advice for improving judgment accuracy

Below I list some common advice for improving judgment and forecasting accuracy (in the absence of strong causal models or much statistical data) that has at least some support in the academic literature, and which I find intuitively likely to be helpful.8

  1. Train probabilistic reasoning: In one especially compelling study (Chang et al. 2016), a single hour of training in probabilistic reasoning noticeably improved forecasting accuracy.9 Similar training has improved judgmental accuracy in some earlier studies,10 and is sometimes included in calibration training.11
  2. Incentivize accuracy: In many domains, incentives for accuracy are overwhelmed by stronger incentives for other things, such as incentives for appearing confident, being entertaining, or signaling group loyalty. Some studies suggest that accuracy can be improved merely by providing sufficiently strong incentives for accuracy such as money or the approval of peers.12
  3. Think of alternatives: Some studies suggest that judgmental accuracy can be improved by prompting subjects to consider alternate hypotheses.13
  4. Decompose the problem: Another common recommendation is to break each problem into easier-to-estimate sub-problems.14
  5. Combine multiple judgments: Often, a weighted (and sometimes “extremized”15) combination of multiple subjects’ judgments outperforms the judgments of any one person.16
  6. Correlates of judgmental accuracy: According to some of the most compelling studies on forecasting accuracy I’ve seen,17 correlates of good forecasting ability include “thinking like a fox” (i.e. eschewing grand theories for attention to lots of messy details), strong domain knowledge, general cognitive ability, and high scores on “need for cognition,” “actively open-minded thinking,” and “cognitive reflection” scales.
  7. Prediction markets: I’ve seen it argued, and find it intuitive, that an organization might improve forecasting accuracy by using prediction markets. I haven’t studied the performance of prediction markets yet.
  8. Learn a lot about the phenomena you want to forecast: This one probably sounds obvious, but I think it’s important to flag, to avoid leaving the impression that forecasting ability is more cross-domain/generalizable than it is. Several studies suggest that accuracy can be boosted by having (or acquiring) domain expertise. A commonly-held hypothesis, which I find intuitively plausible, is that calibration training is especially helpful for improving calibration, and that domain expertise is helpful for improving resolution.18

Another interesting takeaway from the forecasting literature is the degree to which – and consistency with which – some experts exhibit better accuracy than others. For example, tournament-level bridge players tend to show reliably good accuracy, whereas TV pundits, political scientists, and professional futurists seem not to.19 A famous recent result in comparative real-world accuracy comes from a series of IARPA forecasting tournaments, in which ordinary people competed with each other and with professional intelligence analysts (who also had access to expensively-collected classified information) to forecast geopolitical events. As reported in Tetlock & Gardner’s Superforecasting, forecasts made by combining (in a certain way) the forecasts of the best-performing ordinary people were (repeatedly) more accurate than those of the trained intelligence analysts.

3. How commonly do people seek to improve the accuracy of their subjective judgments?

Certainly many organizations, from financial institutions (e.g. see Fabozzi 2012) to sports teams (e.g. see Moneyball), use sophisticated quantitative models to improve the accuracy of their estimates. But the question I’m asking here is: In the absence of strong models and/or good data, when decision-makers must rely almost entirely on human subjective judgment, how common is it for those decision-makers to explicitly invest substantial effort into improving the (objectively-measured) accuracy of those subjective judgments?

Overall, my impression is that the answer to this question is “Somewhat rarely, in most industries, even though the techniques listed above are well-known to experts in judgment and forecasting accuracy.”

Why do I think that? It’s difficult to get good evidence on this question, but I provide some data points in a footnote.20

4. Ideas we’re exploring to improve accuracy for GiveWell and Open Philanthropy Project staff

Below is a list of activities, aimed at improving the accuracy of our judgments and forecasts, that are either ongoing, under development, or under consideration at GiveWell and the Open Philanthropy Project:

  • As noted above, we have contracted a team of software developers to create a calibration training web/phone application for staff and public use. (Update: This calibration training app is now available.)
  • We encourage staff to participate in prediction markets and forecasting tournaments such as PredictIt and Good Judgment Open, and some staff do so.
  • Both the Open Philanthropy Project and GiveWell recently began to make probabilistic forecasts about our grants. For the Open Philanthropy Project, see e.g. our forecasts about recent grants to Philip Tetlock and CIWF. For GiveWell, see e.g. forecasts about recent grants to Evidence Action and IPA. We also make and track some additional grant-related forecasts privately. The idea here is to be able to measure our accuracy later, as those predictions come true or are falsified, and perhaps to improve our accuracy from past experience. So far, we are simply encouraging predictions without putting much effort into ensuring their later measurability.
  • We’re going to experiment with some forecasting sessions led by an experienced “forecast facilitator” – someone who helps elicit forecasts from people about the work they’re doing, in a way that tries to be as informative and helpful as possible. This might improve the forecasts mentioned in the previous bullet point.

I’m currently the main person responsible for improving forecasting at the Open Philanthropy Project, and I’d be very interested in further ideas for what we could do.

Three Key Issues I’ve Changed My Mind About

Philanthropy – especially hits-based philanthropy – is driven by a large number of judgment calls. At the Open Philanthropy Project, we’ve explicitly designed our process to put major weight on the views of individual leaders and program officers in decisions about the strategies we pursue, causes we prioritize, and grants we ultimately make. As such, we think it’s helpful for individual staff members to discuss major ways in which our personal thinking has changed, not only about particular causes and grants, but also about our background worldviews.

I recently wrote up a relatively detailed discussion of how my personal thinking has changed about three interrelated topics: (1) the importance of potential risks from advanced artificial intelligence, particularly the value alignment problem; (2) the potential of many of the ideas and people associated with the effective altruism community; (3) the properties to look for when assessing an idea or intervention, and in particular how much weight to put on metrics and “feedback loops” compared to other properties. My views on these subjects have changed fairly dramatically over the past several years, contributing to a significant shift in how we approach them as an organization.

I’ve posted my full writeup as a personal Google doc. A summary follows.

1. Changing my mind about potential risks from advanced artificial intelligence

I first encountered the idea of potential risks from advanced artificial intelligence – and in particular, the value alignment problem – in 2007. There were aspects of this idea I found intriguing, and aspects I felt didn’t make sense. The most important question, in my mind, was “Why are there no (or few) people with relevant-seeming expertise who seem concerned about the value alignment problem?”

I initially guessed that relevant experts had strong reasons for being unconcerned, and were simply not bothering to engage with people who argued for the importance of the risks in question. I believed that the tool-agent distinction was a strong candidate for such a reason. But as I got to know the AI and machine learning communities better, saw how Superintelligence was received, heard reports from the Future of Life Institute’s safety conference in Puerto Rico, and updated on a variety of other fronts, I changed my view.

I now believe that there simply is no mainstream academic or other field (as of today) that can be considered to be “the locus of relevant expertise” regarding potential risks from advanced AI. These risks involve a combination of technical and social considerations that don’t pertain directly to any recognizable near-term problems in the world, and aren’t naturally relevant to any particular branch of computer science. This is a major update for me: I’ve been very surprised that an issue so potentially important has, to date, commanded so little attention – and that the attention it has received has been significantly (though not exclusively) due to people in the effective altruism community.

More detail on this topic

2. Changing my mind about the effective altruism (EA) community

I’ve had a longstanding interest in the effective altruism community. I identify as part of this community, and I share some core values with it (in particular, the goal of doing as much good as possible). However, for a long time, I placed very limited weight on the views of a particular subset1 of  the people I encountered through this community. This was largely because they seemed to have a tendency toward reaching very unusual conclusions based on seemingly simple logic unaccompanied by deep investigation. I had the impression that they tended to be far more willing than I was to “accept extraordinary claims without extraordinary evidence” in some sense, a topic I’ve written about several times (here, here and here).

A number of things have changed.

  • Potential risks from advanced AI, discussed above, is one topic I’ve changed my mind about: I previously saw this as a strange preoccupation of the EA community, and now see it as a major case where the community was early to highlight an important issue.
  • More generally, I’ve seen the outputs from a good amount of cause selection work at the Open Philanthropy Project. I now believe that the preponderance of the causes that I’ve seen the most excitement about in the effective altruism community are outstanding by our criteria of importance, neglectedness and tractability. These causes include farm animal welfare and biosecurity and pandemic preparedness in addition to potential risks from advanced artificial intelligence. They aren’t the only outstanding causes we’ve identified, but overall, I’ve increased my estimate of how well excitement from the effective altruism community predicts what I will find promising after more investigation.
  • I’ve seen EA-focused organizations make progress on galvanizing interest in effective altruism and growing the community. I’ve seen some effects of this directly, including more attention, donors, and strong employee candidates for GiveWell and the Open Philanthropy Project.
  • I’ve gotten to know some community members better generally, and my views on some general topics (below) have changed in ways that have somewhat reduced my skepticism of the kinds of ideas effective altruists pursue.

I now feel the EA community contains the closest thing the Open Philanthropy Project has to a natural “peer group” – a set of people who consistently share our basic goal (doing as much good as possible), and therefore have the potential to help with that goal in a wide variety of ways, including both collaboration and critique. I also value other sorts of collaboration and critique, including from people who question the entire premise of doing as much good as possible, and can bring insights and abilities that we lack. But people who share our basic premises have a unique sort of usefulness as both collaborators and critics, and I’ve come to feel that the effective altruism community is the most logical place to find such people.

This isn’t to say I support the effective altruism community unreservedly; I have concerns and objections regarding many ideas associated with it and some of the specific people and organizations within it. But I’ve become more positive compared to my early impressions.

More detail on this topic

3. Changing my mind about general properties of promising ideas and interventions

Of the topics discussed here, this one is the hardest to trace the evolution of my thinking on, and the hardest to summarize.

I used to think one should be pessimistic about any intervention or idea that doesn’t involve helpful “feedback loops” (trying something, seeing how it goes, making small adjustments, and trying again many times) or useful selective processes (where many people try different ideas and interventions, and the ones that are successful in some tangible way become more prominent, powerful, and imitated). I was highly skeptical of attempts to make predictions and improve the world based primarily on logic and reflection, when unaccompanied by strong feedback loops and selective processes.

I still think these things (feedback loops, selective processes) are very powerful and desirable; that we should be more careful about interventions that don’t involve them; that there is a strong case for preferring charities (such as GiveWell’s top charities) that are relatively stronger in terms of these properties; and that much of the effective altruism community, including the people I’ve been most impressed by, continues to underweight these considerations. However, I have moderated significantly in my view. I now see a reasonable degree of hope for having strong positive impact while lacking these things, particularly when using logical, empirical, and scientific reasoning.

Learning about the history of philanthropy – and learning more about history more broadly – has been a major factor in changing my mind. I’ve come across many cases where a philanthropist, or someone else, seems to have had remarkable prescience and/or impact primarily through reasoning and reflection. Even accounting for survivorship bias, my impression is that these cases are frequent and major enough that it is worth trying to emulate this sort of impact. This change in viewpoint has both influenced and been influenced by the two topics discussed above.

More detail on this topic

4. Conclusion

Over the last several years, I have become more positive on the cause of potential risks from advanced AI, on the effective altruism community, and on the general prospects for changing the world through relatively speculative, long-term projects grounded largely in intellectual reasoning (sometimes including reasoning that leads to “wacky” ideas) rather than direct feedback mechanisms. These changes in my thinking have been driven by a number of factors, including by each other.

One cross-cutting theme is that I’ve become more interested in arguments with the general profile of “simple, logical argument with no clear flaws; has surprising and unusual implications; produces reflexive dissent and discomfort in many people.” I previously was very suspicious of arguments like this, and expected them not to hold up on investigation. However, I now think that arguments of this form are generally worth paying serious attention to until and unless flaws are uncovered, because they often represent positive innovations.

The changes discussed here have caused me to shift from being a skeptic of supporting work on potential risks from advanced AI and effective altruism organizations to being an advocate, which in turn has been a major factor in the Open Philanthropy Project’s taking on work in these areas. As discussed at the top of this post, I believe that sort of relationship between personal views and institutional priorities is appropriate given the work we’re doing.

I’m not certain that I’ve been correct to change my mind in the ways described here, and I still have a good deal of sympathy for people whose current views are closer to my former ones. But hopefully I have given a sense of where the changes have come from.

More detail is available here:

Some Key Ways in Which I’ve Changed My Mind Over the Last Several Years