## New Shallow Investigations: Telecommunications and Civil Conflict Reduction

We recently published two shallow investigations on potential focus areas to the Effective Altruism Forum. Shallow investigations, which are part of our cause selection process, are mainly intended as quick writeups for internal audiences and aren’t optimized for public consumption. However, we’re sharing these two publicly in case others find them useful.

The default outcome for shallow investigations is that we do not move forward to a deeper investigation or grantmaking, though we investigate further when results are particularly promising.

If you have thoughts or questions on either of these investigations, please use this feedback form or leave a comment on the EA Forum.

## Telecommunications in Low and Middle-Income Countries (LMICs)

By Research Fellow Lauren Gilbert (EA Forum link)

• Lauren finds that expanding cellular phone and internet access appears to cost-effectively increase incomes. Randomized trials and quasi-experimental studies in LMICs showed that gaining internet access led to substantial increases in income, with high social returns on investment.
• We find these reported effects surprisingly large, and are continuing to dig into them more.
• Lauren estimates that 3-9% of the world’s population do not have access to cellular service, and ~40% of the world’s population either have no access to mobile internet or do not use it. Lauren finds that the biggest barrier to usage is the cost of devices and coverage. These coverage gaps and costs are shrinking over time.
• A large majority of spending on telecommunications is private/commercial, with a smaller amount of philanthropic spending. While the private investments are large, they aren’t as focused as a philanthropist might be on improving access for poor and rural communities.
• Philanthropists could potentially help improve access by subsidizing investments in cell phone towers to improve coverage, and in internet cables to reduce the cost of internet. Lauren’s rough back-of-the-envelope calculation suggests that these investments may be cost-effective. A funder could also potentially lobby for policy changes to reduce costs — for example, reducing tariffs on imported electronics or changing the rules around how spectrum can be licensed.

## Civil Conflict Reduction

Also by Lauren Gilbert (EA Forum link)

• Civil conflict is a very important problem. Lauren estimates that civil wars directly and indirectly cause the loss of around 1/2 as many disability-adjusted life years as malaria and neglected tropical diseases combined. Civil wars also substantially impede economic growth, mostly in countries that are already very poor.
• While civil conflict is important and arguably neglected, it isn’t clear how tractable it is. However, some interventions have shown promise.
• Lauren finds some evidence that UN peacekeeping missions are effective, and argues philanthropists could lobby for more funding.
• Some micro-level interventions, such as mediation or cognitive behavioral therapy, also have suggestive empirical evidence behind them. Philanthropists could fund more research into these interventions.

## Incoming Program Officer for Effective Altruism Community Building (Global Health and Wellbeing): James Snowden

Earlier this year, I wrote that Open Philanthropy was looking for someone to help us direct funding for our newest cause area:

We are searching for a program officer to help us launch a new grantmaking program. The program would support projects and organizations in the effective altruism community (EA) with a focus on improving global health and wellbeing (GHW) […] We’re looking to hire someone who is very familiar with the EA community, has ideas about how to grow and develop it, and is passionate about supporting projects in global health and wellbeing.

Today, I’m excited to announce that we’ve hired a Program Officer who exemplifies these qualities: James Snowden.

James spent his last 5+ years as a researcher and program officer at GiveWell; in the latter role, he led GiveWell’s work on policy and advocacy. Before GiveWell, he worked at the Centre for Effective Altruism and as a strategy consultant. He holds a B.A. in philosophy, politics and economics from Oxford University and an M.Sc. in philosophy and economics from the London School of Economics.

You can hear a sample of how James thinks in this podcast interview, and read some of his work on GiveWell’s blog.

## What James might work on

We’ve already made a few grants to EA projects that were highly promising from a GHW perspective, including Charity Entrepreneurship (supporting the creation of new animal welfare charities) and Founders Pledge (increasing donations from entrepreneurs to outstanding charities).

Areas of potential interest include:

• Increasing engagement with effective altruism in a broader range of countries
• Encouraging charitable donations to effective organizations working in GHW-related areas
• Incubating new ideas for highly impactful charities
• Creating resources to facilitate impactful career decisions within GHW

We still have a lot of growth ahead of us and will be expanding to start more programs in the coming months and years — check out our jobs page if you’re interested in helping drive that growth!

## Report on Social Returns to Productivity Growth

Historically, economic growth has had huge social benefits, lifting billions out of poverty and improving health outcomes around the world. This leads some to argue that accelerating economic growth, or at least productivity growth,[1]If environmental constraints require that we reduce our use of various natural resources, productivity growth can allow us to maintain our standards of living while using fewer of these scarce inputs. should be a major philanthropic and social priority going forward.[2]For example: in Stubborn Attachments, Tyler Cowen argues that the best way to improve the long-run future is to maximize the rate of sustainable economic growth. A similar view is held by many of those involved in the Progress Studies community.

I’ve written a report in which I evaluate this view in order to inform Open Philanthropy’s Global Health and Wellbeing (GHW) grantmaking. Specifically, I use a relatively simple model to estimate the social returns to directly funding research and development (R&D). I focus on R&D spending because it seems like a particularly promising way to accelerate productivity growth, but I think broadly similar conclusions would apply to other innovative activities.

My estimate, which draws heavily on the methodology of Jones and Summers (2020), asks two primary questions:

1. How much would a little bit of extra R&D today increase people’s incomes into the future, holding fixed the amount of R&D conducted at later times?[3]An example of an intervention causing a temporary boost in R&D activity would be to fund some researchers for a limited period of time. Another example would be to bring forward in time a policy change that permanently increases the number of researchers.
2. How much welfare is produced by this increase in income?

In brief, I find that:

• The social returns to marginal R&D are high, but typically not as high as the returns in other areas we’re interested in. Measured in our units of impact (where “1x” is giving cash to someone earning $50k/year) I estimate that the cost-effectiveness of funding R&D is 45x. This is ~4% as impactful as the (roughly 1,000x) GHW bar for funding. • Put another way, I estimate that$20 billion to “average” R&D has the same welfare benefit as increasing the incomes of 180 million people by 10% each for one year.
• That said, the best R&D projects might have much higher returns. So could projects aimed at increasing the amount of R&D (for example, improving science policy).
• This estimate is very rough, and I could readily imagine it being off by a factor of 2-3 in either direction, even before accounting for the limitations below.
• Returns to R&D were plausibly much higher in the past. This is because R&D was much more neglected, and because of feedback loops where R&D increased the amount of R&D occurring at later times.
• My estimate has many important limitations. For example, it omits potential downsides to R&D (e.g. increasing global catastrophic risks), and it focuses on a specific scenario in which historical rates of return to R&D continue to apply even as population growth stagnates.
• Alternative scenarios might change the bottom line. For instance, R&D today might speed up the development of some future technology that drastically accelerates R&D progress. This would significantly increase the returns to R&D, but in my view would also strengthen the case for Open Phil to focus on reducing risks from that technology rather than accelerating its development.

Overall, the model implies that the best R&D-related projects might be above our GHW bar, but it also leaves us relatively skeptical of arguments that accelerating innovation should be the primary social priority going forward.

In the full report, I also discuss:

• How alternative scenarios might affect social returns to R&D.
• What these returns might have looked like in the year 1800.
• How my estimates compare to those of economics papers that use statistical techniques to estimate returns to R&D growth.
• The ways in which my current views differ from those of certain thinkers in the Progress Studies movement.

Footnotes

↑1 If environmental constraints require that we reduce our use of various natural resources, productivity growth can allow us to maintain our standards of living while using fewer of these scarce inputs. For example: in Stubborn Attachments, Tyler Cowen argues that the best way to improve the long-run future is to maximize the rate of sustainable economic growth. A similar view is held by many of those involved in the Progress Studies community. An example of an intervention causing a temporary boost in R&D activity would be to fund some researchers for a limited period of time. Another example would be to bring forward in time a policy change that permanently increases the number of researchers.

Last year, we recommended $300 million of grants to GiveWell’s evidence-backed, cost-effective recommendations in global health and development, up from$100 million the year before. We recently decided that our total allocation for this year will be $350 million. That’s a$50 million increase over last year, significantly driven by GiveWell’s impressive progress on finding more cost-effective opportunities. We expected GiveWell to identify roughly $430 million of 2022 “room for more funding” in opportunities at least 8 times more cost effective than cash transfers to the global poor. Instead, we estimate that GiveWell is on track to identify at least$600 million in such opportunities. However, due to reductions in our asset base, the growth in our commitment this year is smaller than the $200 million increase we had projected last fall. The rest of this post: • Reviews our framework for allocating funding across opportunities to maximize impact. (More) • Discusses how changes in asset values influence the appropriate distribution of funding across our program areas. (More) • Explains how we chose this year’s allocation to GiveWell, given our asset changes and GiveWell’s significant progress in finding more cost-effective opportunities. (More) • Shares Alexander’s personal thoughts on why GiveWell seems like an unusually compelling opportunity for individual donors this year. (More) ## Our framework for allocating funding across opportunities We want to help others as much as we can with the resources available to us. When choosing how much funding to allocate — whether to GiveWell’s recommendations, or to other areas in global health and wellbeing — we think about how the choice will affect our funding options in future years. If we had estimates of the cost-effectiveness of every grant we could possibly make this year, and similarly for each of the next 50 years, then we could take the following approach to maximize the impact of our spending and estimate our optimal threshold for cost-effectiveness: 1. Rank all the opportunities (all 50 years’ worth[1]The 50-year horizon ensures that this analysis takes into account our funders’ desire to spend down their assets within their lifetimes.) in terms of their expected cost-effectiveness.[2]Of course, we’d have to discount future costs by the expected rate of return on assets, since we only need to set aside 1 dollar now in order to fund 1+r dollars of grantmaking next year. For example, if we expect a compounding 7% rate of return, a grant we’d make for$1 million in 2022 would … Continue reading
2. Set aside funds for the most cost-effective grant, then the next most cost-effective, and so on, going down the list until we’d allocated all our resources.
3. Look at the marginal grant that exhausts our resources, and use its estimated cost-effectiveness as the “bar” for all our other opportunities.

Of course, our cost-effectiveness estimates and predictions about the future aren’t nearly certain enough or granular enough to actually list every future opportunity. But we have done a very rough, abstract, assumptions-driven exercise (which we hope to publish in the next year) aimed at the same goals of figuring out what our “bar” should be and how to fund the most cost-effective opportunities across time.

One stylized finding from this exercise is that we can maximize our expected impact by setting a cost-effectiveness “bar” such that in any given year we spend roughly 8-10% of our remaining assets (of the assets we plan to eventually spend on interventions like GiveWell’s recommended charities).[3]This is the output of a Monte Carlo simulation approach, which we hope to write more about once we’re more confident in the model and findings. We simulate the interaction of many factors that could make our future spending more or less cost-effective. For example, some of our most cost-effective … Continue reading

## How changes in assets influence optimal allocation across portfolio areas

In the hypothetical exercise above, we ranked each potential grant by its cost-effectiveness and then allocated our resources to the top grants, moving down the list until our resources were exhausted. If our assets were to shrink in value, they’d be exhausted further up the list, meaning that we’d fund fewer grants. Equivalently, since our cost-effectiveness “bar” is defined by the marginal grant, fewer resources means setting a higher bar for cost-effectiveness.

While we don’t have the information to actually do this exercise at the grant level across years, we can think about the implications for different portfolio areas. Some portfolios or categories of interventions will have many grants with cost-effectiveness near the bar; those grants could be ruled in or out by small fluctuations in the bar, so our giving in those categories will be especially sensitive to changes in asset values. Others will have most of their grants far above or far below the bar, which means our giving in those categories will not be very sensitive to changes in asset values.

GiveWell’s recommendations are an example of the first kind of grantmaking category, and so in theory our giving to GiveWell top charities should be especially sensitive to our asset values. That’s because, compared to most of our grantees, GiveWell’s work is highly elastic: its cost-effectiveness is not very sensitive to the annual scale of funding.

For example:

• If a charity focused on distributing bednets receives an extra $10 million, they can probably buy a lot more distribution, because this kind of work is highly scalable. Bednets are relatively easy to purchase and distribute; one might be able to spend a large amount of marginal funding at a similar level of cost-effectiveness by (for example) expanding to a new location with only slightly lower malaria prevalence. • In other words, if the first$10 million in funding to this organization has a cost-effectiveness of 1100x, the next $10 million might be 1000x, because it buys a similar amount of distribution in a new location. • By contrast, if a charity focused on researching new malaria treatments receives an extra$10 million, they may have a harder time “buying” more research, because this work isn’t as scalable. Even if the funding would pay for a dozen new researchers, there may simply not be enough relevant specialized candidates on the job market, which makes it hard to spend the money as effectively.
• In other words, even if the first $10 million in funding to this organization has a cost-effectiveness of 2000x, the next$10 million might be 500x, because it buys much less research.

This difference in elasticity means that a moderate change in our bar could rule out a lot of GiveWell grants while ruling out fewer grants in our other Global Health and Wellbeing (GHW) cause areas, which are more focused on research and advocacy.

## How we chose this year’s allocation to GiveWell

Due to the recent stock market decline, our available assets have declined by about 40% since the end of last year,[4]This just reflects a decline in the market; our main donors are still planning to give away virtually all of their wealth within their lifetimes. which changes the optimal allocation across causes. All else being equal, our planned 2022 allocation to GiveWell should respond more than proportionally, while our allocation to less elastic portfolio areas should respond less than proportionately.

However, asset values are not the only thing that’s changed since we last projected our GiveWell support. As we noted above, GiveWell has found much more cost-effective opportunities over the last nine months than we or they were expecting.

Incorporating that update, and adjusting for various constraints on our current opportunity set,[5]The “abstract, assumptions-driven” model we described above assumes that our grantmaking opportunity set is fully isoelastic (i.e. cost-effectiveness and scale trade off at a fixed proportional rate, no matter how much or how quickly we scale up) and doesn’t distinguish between different … Continue reading our model suggests that the optimal cost-effectiveness bar for our Global Health and Wellbeing spending is roughly 1100-1200x in our units.[6]We refer to our bar using a multiplier — for example, using a “1000X” bar would mean we wanted each dollar of funding to have 1000 times as much impact as giving $1 to someone earning$50,000 per year. Our current bar is higher than the one we discussed last year because it reflects the … Continue reading That’s consistent with us giving roughly $250-$450 million to GiveWell top charities this year, depending on various assumptions about the opportunities GiveWell finds and their fundraising from other sources.[7]Even without those sources of uncertainty, we’d have a fairly wide range of roughly-optimal amounts we could give this year. Our analysis suggests that giving $100 million more or less than the truly “optimal” amount to GiveWell top charities only reduces our overall impact by roughly as much … Continue reading We don’t want our giving to create incentive problems by funging other donors,[8]For more on this point, see the “coordination issues” section of this post. so we’re committing now to a number in the middle of that range. GiveWell’s increasingly cost-effective opportunity set means that, even though our available assets have fallen since the end of last year, and our allocation to GiveWell should respond more than proportionally, our planned 2022 support to GiveWell top charities has fallen (from our original tentative plan) by less than asset values, and still grown in absolute terms. Grants we’ve made to GiveWell’s recommendations this year include: • Up to$64.7 million to Dispensers for Safe Water to install, maintain, and promote use of chlorine dispensers at rural or remote water collection sites. They project that this funding “will see us providing over 10% of Uganda’s population, and over 15% of Malawi’s, with access to safe water”.
• $14 million to Evidence Action to scope and scale potentially cost-effective interventions that don’t have clear existing implementers. In expectation, GiveWell believes this grant will open up roughly$40 million per year in new cost-effective funding opportunities by 2025.
• $8.2 million to Fortify Health to expand its partnerships with flour mills in India. The organization provides equipment and materials which allow its partner mills to fortify wheat flour with iron, folic acid, and vitamin B12 in an effort to reduce health issues like anemia and cognitive impairment. •$5 million to PATH to support ministries of health in Ghana, Kenya, and Malawi in the implementation of the RTS,S malaria vaccine. GiveWell believes that financing this opportunity will speed up implementation by roughly a year in the areas covered by the grant, and could be similarly cost-effective to Malaria Consortium’s seasonal malaria chemoprevention program (a GiveWell top charity).

When our GiveWell spending growth is combined with growth in other Global Health and Wellbeing causes, including new programs in South Asian Air Quality and Global Aid Policy, we plan to spend more in 2022 than in any previous year, both in absolute terms and as a share of assets. We expect that growth to continue beyond this year.

## Alexander’s final thoughts

When we published our initial plans for 2022, we were excited by GiveWell’s progress and eager to fund their future recommendations. Over the last nine months, they have exceeded our high expectations, which is why we are continuing to grow our support.

For other donors: I (Alexander) think it’s worth noting that GiveWell’s recommendations look very cost-effective this year. Compared to last year, we’d guess that their marginal recommendation this year will be ~20% more impactful; they also expect to have a sizable funding gap this year.[9]I feel some responsibility for the gap. I think that originally discussing our tentatively planned $500 million allocation for 2022, along with GiveWell’s related disclosure of their expectation that they’d be rolling funding forward on the margin in 2021, led some donors to hold off on … Continue reading A few years ago, my wife and I contributed to a donor-advised fund to save for an exceptional future donation opportunity. This year, in addition to our standard annual giving, we plan to recommend half of the balance to GiveWell’s recommendations. I think that other donors interested in cost-effective and evidence-backed giving opportunities should take a close look at GiveWell this year. Footnotes ↑1 The 50-year horizon ensures that this analysis takes into account our funders’ desire to spend down their assets within their lifetimes. Of course, we’d have to discount future costs by the expected rate of return on assets, since we only need to set aside 1 dollar now in order to fund 1+r dollars of grantmaking next year. For example, if we expect a compounding 7% rate of return, a grant we’d make for$1 million in 2022 would need to have roughly the same estimated impact as a grant we’d make for $2 million in 2032 for the two grants to be similarly ranked — because we could get roughly$2 million in 2032 by investing the $1 million now. This is the output of a Monte Carlo simulation approach, which we hope to write more about once we’re more confident in the model and findings. We simulate the interaction of many factors that could make our future spending more or less cost-effective. For example, some of our most cost-effective philanthropic opportunities will shrink over time as child mortality and global poverty decline, and the entry of new funders might mean there is less need for our spending in the future. On the other hand, waiting longer means we have more resources due to investment returns, and additional research might reveal new opportunities. Our current best estimate is that, for interventions like GiveWell’s recommended charities, our optimal strategy is to spend 8-10% of relevant assets each year.. This means that whatever level of assets we have, our cost-effectiveness “bar” for GiveWell-like interventions should be set so that the opportunities above this bar in the next year add up to 8-10% of such assets. This just reflects a decline in the market; our main donors are still planning to give away virtually all of their wealth within their lifetimes. The “abstract, assumptions-driven” model we described above assumes that our grantmaking opportunity set is fully isoelastic (i.e. cost-effectiveness and scale trade off at a fixed proportional rate, no matter how much or how quickly we scale up) and doesn’t distinguish between different categories of GHW spending. More realistic constraints would, among other effects, limit how much we can spend in the next few years on non-GiveWell opportunities. For example, if we set the goal “quadruple our planned spending in 2023”, we’d probably have to set a much lower bar for that year. By contrast, if our goal was “quadruple our planned spending over the next decade”, we probably wouldn’t have to lower the bar as much (since we’d have more time to build a strategy around the new goal). We refer to our bar using a multiplier — for example, using a “1000X” bar would mean we wanted each dollar of funding to have 1000 times as much impact as giving$1 to someone earning $50,000 per year. Our current bar is higher than the one we discussed last year because it reflects the update from assets declining and GiveWell finding more cost-effective opportunities than we expected, both of which raise the optimal bar. Even without those sources of uncertainty, we’d have a fairly wide range of roughly-optimal amounts we could give this year. Our analysis suggests that giving$100 million more or less than the truly “optimal” amount to GiveWell top charities only reduces our overall impact by roughly as much as if we’d spent $5 million of our GHW assets to have zero impact. The impact of small deviations from the optimal path is small because, if we were perfectly optimally allocated across categories (and years) of spending, the marginal cost-effectiveness would be equalized for each category — that is, an extra dollar of giving would accomplish the same amount of good, no matter which year or category we allocated it to. Therefore, if we get our allocations a bit wrong, but are still near optimal, those deviations don’t reduce our impact too much. (Very roughly, they reduce our impact by the amount misallocated, times half the difference in marginal cost-effectiveness between the misallocated category and our overall “bar”. That’s because impact is the area under the cost-effectiveness curve, which is roughly approximated by a triangle at these scales.) For more on this point, see the “coordination issues” section of this post. I feel some responsibility for the gap. I think that originally discussing our tentatively planned$500 million allocation for 2022, along with GiveWell’s related disclosure of their expectation that they’d be rolling funding forward on the margin in 2021, led some donors to hold off on donations they might otherwise have made. While we still think it was correct to share our projections with GiveWell in order for them to plan correctly, we should have done more, privately and publicly, to emphasize that our plans were tentative, and that GiveWell could readily end up with more exciting grant opportunities than funding to fill them.

## How accurate are our predictions?

When investigating a grant, Open Philanthropy staff often make probabilistic predictions about grant-related outcomes they care about, e.g. “I’m 70% confident the grantee will achieve milestone #1 within 1 year.” This allows us to learn from the success and failure of our past predictions and get better over time at predicting what will happen if we make one grant vs. another, pursue one strategy vs. another, etc. We hope that this practice will help us make better decisions and thereby enable us to help others as much as possible with our limited time and funding.[1]Here is a fuller list of reasons we make explicit quantified forecasts and later check them for accuracy, as described in an internal document by Luke Muehlhauser: There is some evidence that making and checking quantified forecasts can help you improve the accuracy of your predictions over time, … Continue reading

Thanks to the work of many people, we now have some data on our forecasting accuracy as an organization. In this blog post, I will:

1. Explain how our internal forecasting works. [more]
2. Present some key statistics about the volume and accuracy of our predictions. [more]
3. Discuss several caveats and sources of bias in our forecasting data: predictions are typically scored by the same person that made them, our set of scored forecasts is not a random or necessarily representative sample of all our forecasts, and all hypotheses discussed here are exploratory. [more]

## 1. How we make and check our forecasts

Grant investigators at Open Philanthropy recommend grants via an internal write-up. This write-up typically includes the case for the grant, reservations and uncertainties about it, and logistical details, among other things. One of the (optional) sections in that write-up is reserved for making predictions.

The prompt looks like this (we’ve included sample answers):

Do you have any new predictions you’re willing to make for this grant? […] A quick tip is to scan your write-up for expectations or worries you could make predictions about. […]

 Predictions Scoring (you can leave this blank until you’re able to score) With X% confidence… …I predict that (yes/no or confidence interval prediction)… …by time Y (ideally a date, not e.g. “in one year”) Score (please stick to True / False / Not Assessed) Comments or caveats about your score 30% The grantee will produce outcome Z End of 2021

After a grant recommendation is submitted and approved, the predictions in that table are logged into our Salesforce database for future scoring (as true or false). If the grant is renewed, scoring typically happens during the renewal investigation phase, since that’s when the grant investigator will be collecting information about how the original grant went. If the grant is not renewed, grant investigators are asked to score their predictions after they come due.[2]In some rare cases, it’s possible for the people managing the database to score predictions using information available to them. However, predictions tend to be very in-the-weeds, so scoring them typically requires input from the grant investigators who made them. Scores are then logged into our database, and that information is used to produce calibration dashboards for individual grant investigators and teams of investigators working in the same focus area.

A user’s calibration dashboard (in Salesforce) looks like this:

The calibration curve tells the user where they are well-calibrated vs. overconfident vs. underconfident. If a forecaster is well-calibrated for a given forecast “bucket” (e.g. forecasts they made with 65%-75% confidence), then the percent of forecasts that resolved as “true” should match that bucket’s confidence level (e.g. they should have come true 65%-75% of the time). On the chart, their observed calibration (the red dot) should be close to perfect calibration (the gray dot) for that bucket.[3]The horizontal coordinate of the gray dots is calculated by averaging the confidence of all the predictions in each bin. Note that this is in general different from the midpoint of the bin; for example, if there are only two predictions in the 45%-55% bin and they have 46% and 48% confidence, … Continue reading If it’s not, then the forecaster may be overconfident or underconfident for that bucket — for example, if things they predict with 65%-75% confidence happen only 40% of the time (overconfidence). (A bucket can also be empty if the user hasn’t made any forecasts within that confidence range.)

Each bucket also shows a 90% credible interval (the blue line) that indicates how strong the evidence is that the forecaster’s calibration in that bucket matches their observed calibration, based on how many predictions they’ve made in that bucket. As a rule of thumb, if the credible interval overlaps with the line of perfect calibration, that means there’s no strong evidence that they are miscalibrated in that bucket. As a user makes more predictions, the blue lines shrink, giving that user a clearer picture of their calibration.

In the future, we hope to add more features to these dashboards, such as more powerful filters and additional metrics of accuracy (e.g. Brier scores).

## 2. Results

#### 2.1 Key takeaways

1. We’ve made 2850 predictions so far. 743 of these have come due and been scored as true or false. [more]
2. Overall, we are reasonably well-calibrated, except for being overconfident about the predictions we make with 90%+ confidence. [more]
3. The organization-wide Brier score (measuring both calibration and resolution) is .217, which is somewhat better than chance (.250). This requires careful interpretation, but in short we think that our reasonably good Brier score is mostly driven by good calibration, while resolution has more room for improvement (but this may not be worth the effort). [more]
4. About half (49%) of our predictions have a time horizon of ≤2 years, and only 13% of predictions have a time horizon of ≥4 years. There’s no clear relationship between accuracy and time horizon, suggesting that shorter-range forecasts aren’t inherently easier, at least among the short- and long-term forecasts we’re choosing to make. [more]

#### 2.2 How many predictions have we made?

As of March 16, 2022, we’ve made 2850 predictions. Of the 1345 that are ready to be scored, we’ve thus far assessed 743 of them as true or false. (Many “overdue” predictions will be scored when the relevant grant comes up for renewal.) Further details are in a footnote.[4]Our stats as of 2022-03-16 are as follows (italics means the percentage is taken over scored predictions, not total): Status Number % Scored True 382 45% False 361 42% Not Assessed 115 13% Total Scored 858 30% Not scored Not Yet Due 1,448 51% Overdue 487 17% Missing End … Continue reading

What kinds of predictions do we make? Here are some examples:

• “[20% chance that] at least one human challenge trial study is conducted on a COVID-19 vaccine candidate [by Jul 1, 2022]”
• [The grantee] will play a lead role… in securing >20 new global cage-free commitments by the end of 2019, improving the welfare of >20M hens if implemented”
• “[70% chance that] by Jan 1, 2018, [the grantee] will have staff working in at least two European countries apart from [the UK]”
• “60% chance [the grantee] will hire analysts and other support staff within 3 months of receiving this grant and 2-3 senior associates and a comms person within 6-9 months of receiving this grant”
• “70% chance that the project identifies ~100 geographically diverse advocates and groups for re-grants”
• “[80% chance that] we will want to renew [this grant]”
• “75% chance that [an expert we trust] will think [the grantee’s] work is ‘very good’ after 2 years”

Some focus areas[5]We’re leaving out focus areas with less than $10M moved in the subsequent analyses. The excluded focus areas are South Asian Air Quality, History of Philanthropy, and Global Health and Wellbeing. are responsible for most predictions, but this is mainly driven by the number of grant write-ups produced for each focus area. The number of predictions per grant write-up ranges from 3 to 8 and is similar across focus areas. Larger grants tend to have more predictions attached to them. We averaged about 1 prediction per$1 million moved, with significant differences across grants and focus areas.

#### 2.3 Calibration

Good predictors should be calibrated. If a predictor is well-calibrated, that means that things they expect to happen with 20% confidence do in fact happen roughly 20% of the time, things they expect with 80% confidence happen roughly 80% of the time, and so on.[6]This sentence and some other explanatory language in this report are borrowed from an internal guide about forecasting written by Luke Muehlhauser. Our organization-wide calibration curve looks like this:

To produce this plot, prediction confidences were binned in 10% increments. For example, the leftmost dot summarizes all predictions made with 0%-10% confidence. It appears at the 6% confidence mark because that’s the average confidence of predictions in the 0%-10% range, and it shows that 12% of those predictions came true. The dashed gray line represents perfect calibration.

The vertical black lines are 90% credible intervals around the point estimates for each bin. If the bar is wider, that generally means we’re less sure about our calibration for that confidence range because we have fewer data points in that confidence range.[7]These intervals assume a uniform prior over (0, 1). This means that, for a bin with T true predictions and F false predictions, the intervals are calculated using a Beta(T+1, F+1) distribution. All the bins have at least 40 resolved predictions except the last one, which only has 8 – hence the wider interval. A table with the number of true / false predictions in each bin can be found in a footnote.[8]Detailed calibration data for each bin are provided below. Note that intervals are open to the left and closed to the right; a 30% prediction would be included in the 20-30 bin, but a 20% prediction would be included in the 10-20 bin. Confidence … Continue reading

The plot shows that Open Philanthropy is reasonably well-calibrated as a whole, except for predictions we made with 90%+ confidence (those events only happened slightly more than half the time) and possibly also in the 70%-80% range (those events happened slightly less than 70% of the time). In light of this, the “typical” Open Phil predictor should be less bold and push predictions that feel “almost certain” towards a lower number.[9]However, given that there is high variance in calibration across predictors, this may not be the best idea in all cases. For personal advice, predictors may wish to refer to their own calibration curve, or their team’s curve.

#### 2.4 Brier scores and resolution

On top of being well calibrated, good predictors should give high probability to events that end up happening and low probability to events that don’t. This isn’t captured by calibration. For example, imagine a simplified world in which individual stocks go up and down in price but the overall value of the stock market stays the same, and there aren’t any trading fees. In this world, one way to be well-calibrated is to make predictions about whether randomly chosen stocks will go up or down over the next month, and for each prediction just say “I’m 50% confident it’ll go up.” Since a randomly chosen stock will indeed go up over the next month about 50% of the time (and down the other 50% of the time), you’ll achieve perfect calibration! This good calibration will spare you from the pain of losing money, but it won’t help you make any money either. However, you will make lots of money if you can predict with 60% (calibrated) confidence which stocks will go up vs. down, and you’ll make even more money if you can predict with 80% calibrated confidence which stocks will go up vs. down. If you could do that, then your stock predictions would be not just well-calibrated but also have good “resolution.”

A metric that captures both aspects of what makes a good predictor is the Brier score (also explained in a footnote[10]For binary events, the Brier score can be defined as $$BS\,=\,\frac{1}{n} \sum_{i=1}^n (P_i\,-\,Y_i)^2$$$Where $$i = 1,…,N$$ ranges over events, $$p_i$$ is the forecasted probability that the i-th event resolves True, and $$Y_i$$ is the actual outcome of the i-th event (1 if True, … Continue reading). The most illustrative examples are: 1. A perfect predictor (100% confidence on things that happen, 0% confidence on things that don’t) would get a Brier score of 0. 2. A perfect anti-predictor (0% confidence on things that happen, 100% confidence on things that don’t) would get a score of 1. 3. A predictor that always predicts 50% would get a score of 0.25 (assuming the events they predict happen half the time). Thus, a score higher than 0.25 means someone’s accuracy is no better than if they simply guessed 50% for everything. The mean Brier score across all our predictions is 0.217, and the median is 0.160. (Remember, lower is better.) 75% of focus area Brier scores are under 0.25 (i.e. they’re better than chance).[11]A score of 0.25 is a reasonable baseline in our case because the base rate for past predictions happens to be very close to 50%. This means that predictors in the future could state 50% confidence on all predictions and, assuming the base rate stays the same (i.e. the population of questions that … Continue reading This rather modest[12]For comparison, first-year participants in the Good Judgment Project (GJP) that were not given any training got a score of 0.21 (appears as 0.42 in table 4 here; Tetlock et al. scale their Brier score such that, for binary questions, we’d need to multiply our scores by 2 to get numbers with … Continue reading Brier score together with overall good calibration implies our forecasts have low resolution.[13]For a base rate of 50%, resolution ranges from 0 (worst) to 0.25 (best). OP’s resolution is 0.037. Luke’s intuition on why there’s a significant difference in performance between these two dimensions of accuracy is that good calibration can probably be achieved through sheer reflection and training, just by being aware of the limits of one’s own knowledge, whereas resolution requires gathering and evaluating information about the topic at hand and carefully using it to produce a quantified forecast, something our grant investigators aren’t typically doing in much detail (most of our forecasts are produced in seconds or minutes). If this explanation is right, getting better Brier scores would require spending significantly more time on each forecast. We’re uncertain whether this would be worth the effort, since calibration alone can be fairly useful for decision-making and is probably much less costly to achieve, and our grant investigators have many other responsibilities besides making predictions. #### 2.5 Longer time horizons don’t hurt accuracy Almost half of all our predictions are made less than 2 years before they will resolve (e.g. the prediction might be “X will happen within two years”),[14]A caveat about this data: I’m taking the difference between ‘End Date’ (i.e. when a prediction is ready to be assessed) and ‘Investigation Close Date’ (the date the investigator submitted their request for conditional approval). This underestimates the time span … Continue reading with ~75% being less than 3 years out. Very few predictions are about events decades into the future. It’s reasonable to assume that (all else equal) the longer the time horizon, the harder it is to make accurate predictions.[15]This is in line with evidence from GJP and (less so) Metaculus showing that accuracy drops as time until question resolution increases. However, note that the opposite holds for PredictionBook, i.e. Brier scores tend to get better the longer the time horizon. Our working hypothesis to explain this … Continue reading However, our longer-horizon forecasts are about as accurate as our shorter-horizon forecasts. A possible explanation is question selection. Grant investigators may be less willing to produce long-range forecasts about things that are particularly hard to predict because the inherent uncertainty looks insurmountable. This may not be the case for short-range forecasts, since for these most of the information is already available.[16]This selection effect could come about through several mechanisms. One such mechanism could be picking well-defined processes more often in long-range forecasts than in short-range ones. In those cases, what matters is not the calendar time elapsed between start and end but the number and … Continue reading In other words, we might be choosing which specific things to forecast based on how difficult we think they are to forecast regardless of their time horizon, which could explain why our accuracy doesn’t vary much by time horizon. ## 3. Caveats and sources of bias There are several reasons why our data and analyses could be biased. While we don’t think these issues undermine our forecasting efforts entirely, we believe it’s important for us to explain them in order to clarify how strong the evidence is for any of our claims. The main issues we could identify are: 1. Predictions are typically written and then later scored by the same person, because the grant investigator who made each prediction is typically also our primary point of contact with the relevant grantee, from whom we typically learn which predictions came true vs. false. This may introduce several biases. For example, predictors may choose events that are inherently easier to predict. Or, they may score ambiguous predictions in a way that benefits their accuracy score. Both things could happen subconsciously. 2. There may be selection effects on which predictions have been scored. For example, many predictions have overdue scores, i.e. they are ready to be evaluated but have not been scored yet. The main reason for this is that some predictions are associated with active grants, i.e. grants that may be renewed in the future. When this happens, our current process is to leave them unscored until the grant investigator writes up the renewal, during which they are prompted to score past predictions. It shouldn’t be assumed that these unscored predictions are a random sample of all predictions, so excluding them from our analyses may introduce some hard-to-understand biases. 3. The analyses presented here are completely exploratory. All hypotheses were put forward after looking at the data, so this whole exercise should be better thought of as “narrative speculations” rather than “scientific hypothesis testing.” Footnotes Footnotes 1 Here is a fuller list of reasons we make explicit quantified forecasts and later check them for accuracy, as described in an internal document by Luke Muehlhauser: 1. There is some evidence that making and checking quantified forecasts can help you improve the accuracy of your predictions over time, which in theory should improve the quality of our grantmaking decisions (on average, in the long run). 2. Quantified predictions can enable clearer communication between grant investigators and decision-makers. For example, if you just say it “seems likely” the grantee will hit their key milestone, it’s unclear whether you mean a 55% chance or a 90% chance. 3. Explicit quantified predictions can help you assess grantee performance relative to initial expectations, since it’s easy to forget exactly what you expected them to accomplish, and with what confidence, unless you wrote down your expectations when you originally made the grant. 4. The impact of our work is often difficult to measure, so it can be difficult for us to identify meaningful feedback loops that can help us learn how to be more effective and hold ourselves accountable to our mission to help others as much as possible. In the absence of clear information about the impact of our work (which is often difficult to obtain in a philanthropic setting), we can sometimes at least learn how accurate our predictions were and hold ourselves accountable to that. For example, we might never know whether our grant caused a grantee to succeed at X and Y, but we can at least check whether the things we predicted would happen did in fact happen, with roughly the frequencies we predicted. 2 In some rare cases, it’s possible for the people managing the database to score predictions using information available to them. However, predictions tend to be very in-the-weeds, so scoring them typically requires input from the grant investigators who made them. 3 The horizontal coordinate of the gray dots is calculated by averaging the confidence of all the predictions in each bin. Note that this is in general different from the midpoint of the bin; for example, if there are only two predictions in the 45%-55% bin and they have 46% and 48% confidence, respectively, then the point of perfect calibration in that bin would be 47%, not 50%. 4 Our stats as of 2022-03-16 are as follows (italics means the percentage is taken over scored predictions, not total):  Status Number % Scored True 382 45% False 361 42% Not Assessed 115 13% Total Scored 858 30% Not scored Not Yet Due 1,448 51% Overdue 487 17% Missing End Date 57 2% Total Not Scored 1992 70% Total 2850 100% Some categories in the table above deserve further comments: • Not Assessed: There are several reasons why some predictions are not assessed: • Some predictions had vague / subjective resolution criteria (so that it was unclear whether the event happened or not). • We didn’t check some predictions because it would have taken too much time or effort to do so. • Some predictions were premised on a condition that wasn’t fulfilled (e.g. “if X happens, the grantee will achieve Y”, if X never happens). • Some predictions were about grants that didn’t happen. We don’t yet have systematic data to determine which of these reasons are more prevalent, but we may be able to say more about this in the future. • Overdue: Some predictions have overdue scores because they are associated with active grants that may be renewed in the future. In these cases, we don’t request scores from grant investigators until they write up the renewal grant. There may also be some scores we haven’t logged yet due to lack of capacity. • Missing End Date: Predictions with no end date can’t be scored as False (because the event may still happen in the future). We’re currently working with grant investigators to log reasonable end dates for these. 5 We’re leaving out focus areas with less than$10M moved in the subsequent analyses. The excluded focus areas are South Asian Air Quality, History of Philanthropy, and Global Health and Wellbeing.
6 This sentence and some other explanatory language in this report are borrowed from an internal guide about forecasting written by Luke Muehlhauser.
7 These intervals assume a uniform prior over (0, 1). This means that, for a bin with T true predictions and F false predictions, the intervals are calculated using a Beta(T+1, F+1) distribution.
8 Detailed calibration data for each bin are provided below. Note that intervals are open to the left and closed to the right; a 30% prediction would be included in the 20-30 bin, but a 20% prediction would be included in the 10-20 bin.

 Confidence [%] True False Total 0-10 5 39 44 10-20 10 32 42 20-30 20 53 73 30-40 24 36 60 40-50 69 82 151 50-60 64 36 100 60-70 86 44 130 70-80 65 29 94 80-90 34 7 41 90-100 5 3 8

9 However, given that there is high variance in calibration across predictors, this may not be the best idea in all cases. For personal advice, predictors may wish to refer to their own calibration curve, or their team’s curve.
10 For binary events, the Brier score can be defined as

$$BS\,=\,\frac{1}{n} \sum_{i=1}^n (P_i\,-\,Y_i)^2$$$Where $$i = 1,…,N$$ ranges over events, $$p_i$$ is the forecasted probability that the i-th event resolves True, and $$Y_i$$ is the actual outcome of the i-th event (1 if True, 0 if False). A predictor that knows the base rate, b, of future events and predicts that on every event gets a Brier score of * (1 – b). For example, if = 50% (as is roughly the case for us), the expected Brier is 0.25. A prefect predictor (100% confidence on things that happen, 0% confidence on things that don’t) would get a Brief score of 0. A predictor that is perfectly anticorrelated with reality (predicts the exact opposite as a perfect predictor) would get a score of 1. The Brier score can be decomposed into a sum of 3 components as $$BS\,=\,E(p\, -\,P[Y|p])^2\,-\,E(P[Y|p]\,-\,b)^2\,+\,b\,*\,(1\,-\,b)$$$

Where $$i = 1$$ denotes expectation, $$p_i$$ is the forecasted probability of the event $$Y$$, $$P[Y|p]$$ is the actual probability of $$Y$$ given that the forecasted probability was $$p$$, and $$b$$ is the base rate of $$Y$$. The components can be interpreted as follows:

1. The first one measures miscalibration. It is the mean squared error between forecasted and actual probabilities. It ranges from 0 (perfect calibration) to 1 (worst).
2. The second one measures resolution. It is the expected improvement of one’s forecasts over the blind strategy that always outputs the base rate. It ranges from 0 (worst) to b(1-b) (best).
3. The third one measures the inherent uncertainty of the events being forecasted. It is just the entropy of a binary event that happens with probability b.

In practice, because it is unlikely that any two events have the same forecasted probability, $$P[Y | p]$$ is calculated by binning forecasts and averaging within each bin, i.e. the empirical estimate is $$P[Y | p]$$ = (# of true predictions in that bin) / (total # of predictions in that bin). This is exactly what we do in our dashboards.

11 A score of 0.25 is a reasonable baseline in our case because the base rate for past predictions happens to be very close to 50%. This means that predictors in the future could state 50% confidence on all predictions and, assuming the base rate stays the same (i.e. the population of questions that predictors sample from is stable over time), get close to perfect calibration without achieving any resolution.
12 For comparison, first-year participants in the Good Judgment Project (GJP) that were not given any training got a score of 0.21 (appears as 0.42 in table 4 here; Tetlock et al. scale their Brier score such that, for binary questions, we’d need to multiply our scores by 2 to get numbers with the same meaning). The Metaculus community averages 0.150 on binary questions as of this writing (May 2022). Both comparisons have very obvious caveats: the population of questions on GJP or Metaculus is very different from ours and both platforms calculate average Brier scores over time, taking into account updates to the initial forecast, while our grant investigators only submit one forecast and never try to refine it later.
13 For a base rate of 50%, resolution ranges from 0 (worst) to 0.25 (best). OP’s resolution is 0.037.
14 A caveat about this data: I’m taking the difference between ‘End Date’ (i.e. when a prediction is ready to be assessed) and ‘Investigation Close Date’ (the date the investigator submitted their request for conditional approval). This underestimates the time span between forecast and resolution because predictions are made before the investigation closes. This explains the fact that some time deltas are slightly negative. The most likely explanation for this is that the grant investigator wrote the prediction long before submitting the write-up for conditional approval.
15 This is in line with evidence from GJP and (less so) Metaculus showing that accuracy drops as time until question resolution increases. However, note that the opposite holds for PredictionBook, i.e. Brier scores tend to get better the longer the time horizon. Our working hypothesis to explain this paradoxical result is that, when users get to select the questions they forecast on (as they do on PredictionBook), they will only pick “easy” long-range questions. When the questions are chosen by external parties (as in GJP), they tend to be more similar in difficulty across time horizons. Metaculus sits somewhere in the middle, with community members posting most questions and opening them to the public. We may be able to test this hypothesis in the future by looking at data from Hypermind, which should fall closer to GJP than to the others because questions on the platform are commissioned by external parties.
16 This selection effect could come about through several mechanisms. One such mechanism could be picking well-defined processes more often in long-range forecasts than in short-range ones. In those cases, what matters is not the calendar time elapsed between start and end but the number and complexity of steps in the process. For example, a research grant may contain predictions about the likely output of that research (some finding or publication) that can’t be scored until the research has been conducted. If the research was delayed for some reason, or if it happens earlier than expected due to e.g. a sudden influx of funding, that doesn’t change the intrinsic difficulty of predicting anything about the research outcomes themselves.

## Announcing the launch of our new website

We’ve just launched a new version of our website. We think the new design will make our content easier to navigate, so that readers have an easier time learning about our work and our thinking.

As part of the launch, we’ve updated language on a number of core pages to better reflect how our work has evolved in the years since our previous website was created.

This includes updates to our mission statement, which had been in place since our incubation as a project of GiveWell. The new statement is more concise, and we think it better reflects the breadth of our work:

“Our mission is to help others as much as we can with the resources available to us.”

• The ability to sort and filter much of our published content, including blog posts, research reports, and notable lessons.
• Statistics on our giving in each of our focus areas.
• A new page explaining the difference between our two grantmaking portfolios (Global Health & Wellbeing and Longtermism).
• New pages for our newest focus areas, South Asian Air Quality and Global Aid Policy.

If you experience any issues using the new site, or want to suggest a change, we would appreciate your feedback! Contact [email protected] to get in touch.

## Open Philanthropy’s Cause Exploration Prizes

At Open Philanthropy, we aim to give as effectively as we can. To find the best opportunities, we’ve looked at many different causes, some of which have become our current focus areas.

Even after a decade of research, we think there are many excellent grantmaking ideas we haven’t yet uncovered. So we’ve launched the Cause Exploration Prizes around a set of questions that will help us explore new areas.

We’re most interested in responses to our open prompt: “What new cause area should Open Philanthropy consider funding?”

We also have prompts in the following areas:

We’re looking for responses of up to 5,000 words that clearly convey your findings. It’s fine to use bullet points and informal language. For more detail, see our guidance for authors. To submit, go to this page.

We hope that the Prizes help us to:

• Identify new cause areas and funding strategies.
• Develop our thinking on how best to measure impact.
• Find people who might be a good fit to work with us in the future.

You can read more about the Cause Exploration Prizes on our dedicated website. You’ll also be able to read all of the submissions on the Effective Altruism Forum later this summer – stay tuned!

All work must be submitted by 11:00 pm PDT on August 4, 2022.

You are almost certainly eligible! We think these questions can be approached from many directions, and you don’t need to be an expert or have a PhD to apply.

There’s a $25,000 prize for the top submission, and three$15,000 prizes. Anyone who wins one of these prizes will be invited to present their work to Open Philanthropy’s cause prioritization team in San Francisco (and be compensated for their time and travel). And we will follow up with authors if their work contributes to Open Phil grantmaking decisions!

We will also award twenty honorable mentions ($500), and a participation award ($200) for the first 200 submissions made in good faith and not awarded another prize.

All submissions will be shared on the Effective Altruism Forum to allow others to learn from them. If participants prefer, their submission can be published anonymously, and we can handle the logistics of posting to the Forum. See more detail here.

For full eligibility requirements and prize details, see our rules and FAQs.

## Our Progress in 2021 and Plans for 2022

This post compares our progress with the goals we set forth a year ago, and lays out our plans for the coming year.

In brief:

• We recommended over $400 million worth of grants in 2021. The bulk of this came from recommendations to support GiveWell’s top charities and from our major current focus areas. [More] • We launched several new program areas — South Asian air quality, global aid policy, and effective altruism community building with a focus on global health and wellbeing — and have either hired or are currently hiring program officers to lead each of those areas. [More] • We revisited the case for our US policy causes — spinning out our criminal justice reform program as an independent organization, making exit grants in US macroeconomic stabilization policy, and updating our approaches to land use reform and immigration policy. [More] • We shared our latest framework for evaluating global health and wellbeing interventions, as well as several reports on key topics in potential risk from advanced AI. [More] ## Continued grantmaking Last year, we wrote: We expect to continue grantmaking in potential risks from advanced AIbiosecurity and pandemic preparednesscriminal justice reformfarm animal welfarescientific research, and effective altruism, as well as recommending support for GiveWell’s top charities. We expect that the total across these areas will be over$200 million.

We wound up recommending over $400 million across those areas. Some highlights: ### Plans for 2022 This year, we expect to continue grantmaking in potential risks from advanced AIbiosecurity and pandemic preparednessSouth Asian air qualityglobal aid policyfarm animal welfarescientific research, and effective altruism community building (focused on both longtermism and global health and wellbeing), as well as recommending support for GiveWell’s top charities. We aim to roughly double the amount we recommend this year relative to last year, and triple it by 2025. ## New program areas We launched two new program areas this year: South Asian Air Quality and Global Aid Policy. We’re thrilled to have hired Santosh Harish and Norma Altshuler to lead these programs, and we look forward to sharing some of the grants they’re making in next year’s annual review. We also announced plans to launch another new program area: supporting the effective altruism community with a focus on global health and wellbeing. We are still in the process of hiring a program officer to lead this area. ### Plans for 2022 This year, our global health and wellbeing cause prioritization team aims to launch three more new program areas where we can find scalable opportunities above our bar, and to continue laying the groundwork for more growth in future years. ## Revisiting our older US policy causes We made our initial selection of US policy causes in 2014 and 2015. We chose criminal justice reform, macroeconomic stabilization policy, immigration policy, and land use reform in order to try to get experience across a variety of causes that stood out on different criteria (immigration on importance, CJR on tractability, land use reform and macro on neglectedness). We’ve learned a lot from our experience funding in these fields, but over time have updated towards a more unified ROI framework that lets us more explicitly compare across causes. (Also, the world has changed a lot over the last 7 years.) We gave an initial update on our revised thinking back in 2019, and we are still evaluating our performance and plans for the future as of 2022. On our new website, we are no longer referring to these causes as full “focus areas” because we do not have full-time staff leading any of them. But our particular plans for the future vary across the four causes: • We spun out our criminal justice reform program into an independent organization — Just Impact, which we supported with$50 million in seed funding.
• After the rapid recovery of the U.S. from the most recent recession, we decided to wind down our giving to U.S. grantees in macroeconomic policy. (We made some exit grants, as we often do when we decide not to renew support to organizations we’ve supported for a long time.) We currently expect to continue to support regranting on this issue within Europe via Dezernat Zukunft, but will revisit depending on how economic and policy conditions evolve. We hope to write more about our thinking on and lessons learned in this area in the future.
• On land use reform: we recently completed an updated review on the performance of our grantees and the valuation of a marginal housing unit in key supply-constrained regions, which made us think that our returns to date have been well above our bar and that there is room for expansion. We’ve commissioned an outside review of our analysis; pending the results of that review, we’re considering hiring someone to lead a bigger portfolio in this space.
• On immigration policy:
• We have never had a clear theory of how to change the political economy to be supportive of substantially larger immigration flows, which is what would be necessary to achieve the global poverty improvements that motivate our interest in this issue. Accordingly, our recent spending has been lower than in macro or land use reform.
• Over the last few years, the US political climate for immigration reform has come to look even less promising than when we initially explored this space.
• We’re currently planning to continue supporting Michael Clemens’ work (which is what motivated our interest in this cause), make occasional opportunistic grants that fit with our overall ROI framework, and explore whether we should have a program around STEM immigration. But we are not planning to do more on US immigration policy per se.

## New approaches to funding

This year, we created a number of new programs to openly solicit funding requests from individuals, groups, and organizations. This represents a different approach from the proactive searching and networking we use to find most of our grants, and we are excited by the potential for these programs to unearth strong opportunities we wouldn’t have found otherwise.

## Request for proposals: Help Open Philanthropy quantify biological risk

Open Philanthropy is seeking proposals from those interested in contributing to a research project informing and estimating biosecurity-relevant numbers and ‘base rates’. We welcome proposals from both research organizations and individuals (at any career stage, including undergraduate and postgraduate students). The work can be structured via contract or grant.

The application deadline is June 5th.

## Background

How likely is a biological catastrophe? Do the biggest risks come from states or terrorists? Accidents or intentional attacks?

These parameters are directly decision-relevant for Open Philanthropy’s biosecurity and pandemic preparedness strategy. They determine the degree to which we prioritize biosecurity compared to other causes, and inform how we prioritize different biosecurity interventions (e.g. do we focus on lab safety to reduce accidents, or push for better DNA synthesis screening to impede terrorists?).

One way of estimating biological risk that we do not recommend is ‘threat assessment’—investigating various ways that one could cause a biological catastrophe. This approach may be valuable in certain situations, but the information hazards involved make it inherently risky. In our view, the harms outweigh the benefits in most cases.

A second, less risky approach is to abstract away most biological details and instead consider general ‘base rates’. The aim is to estimate the likelihood of a biological attack or accident using historical data and base rates of analogous scenarios, and of risk factors such as warfare or terrorism. A few examples include:

• Estimating the rate at which states or terrorist groups have historically sought biological, chemical, nuclear, or radiological weapons.
• Forecasting the risk of great power war over the next 50 years (combining historical data with current geopolitical trends).
• Estimating the rate at which lab leaks have occurred in state programs.
• Enumerating possible ‘phase transitions’ that would cause a radical departure from relevant historical base rates, e.g. total collapse of the taboo on biological weapons, such that they become a normal part of military doctrine.

This information allows us to better estimate the probability of biological catastrophe in a variety of scenarios, with some hypothetical concrete examples described in the next section.

The biosecurity team at Open Philanthropy (primarily Damon) is currently working on developing such models. However, given the broad scope of the work, we would be keen to see additional research on this question. While we are interested in independent models that attempt to estimate the overall risk of a global biological catastrophe (GCBR), we are particularly keen on projects that simply collect huge amounts of relevant historical data, or thoroughly explore one or more sub-questions.

One aspect of this approach we find particularly exciting is that it is threat-agnostic and thus relevant across a wide range of scenarios, foreseen and unforeseen. It also helps us to think about the misuse of other dangerous transformative technologies, such as atomically precise manufacturing or AI.

We are therefore calling for applications from anyone who would be interested in spending up to the next four months (or possibly longer) helping us better understand these aspects of biorisk. We could imagine successful proposals from many different backgrounds—for example, a history undergraduate looking for a summer research project, a postdoc or superforecaster looking for part-time work, or a quantitative research organization with an entire full-time team.

## What is our goal?

We are interested in supporting work that will help us better quantitatively estimate the risk of a GCBR without creating information hazards. To do this, we can imagine treating the biological details of a threat as a ‘black box’ and instead quantifying risk within hypothetical scenarios like the following:

Scenario 1 (states or big terrorist groups): In January 2025, scientists publish a paper that inadvertently provides the outlines for a biological agent that, if created and released, would kill hundreds of millions of people. Creating the pathogen would require a team of 10 PhD-level biologists working full time for 2 years, and a budget of $10 million. Scenario 2 (small terrorist groups or lone wolf): Same as Scenario 1, but creating the pathogen requires a single PhD-level biologist working full time for 2 years, and a budget of$1 million.

For each of these scenarios, what is the ‘annual rate’ at which the biological agent is created and released (either accidentally or intentionally)?

We are interested in research proposals that aim to estimate this rate or estimate rates in similar scenarios. Proposals don’t need to do this directly, but could instead aim to quantitatively understand ‘upstream’ aspects of the world that could affect the risk of catastrophe.

The scenarios serve as a concrete litmus test for the kinds of proposals we are interested in—if the proposal wouldn’t directly help a researcher estimate the annual risk rate in these toy scenarios, it is unlikely to be of interest to us. In particular, purely qualitative approaches, such as trying to understand specific case studies in detail, or trying to understand the ideological drivers of terrorism, are unlikely to be a good fit. Similarly, very theoretical approaches, such as those with empirical unverifiable parameters or unfalsifiable elements, are also unlikely to be successful.

## What a successful proposal might look like

An ideal proposal could propose things like one or more of the following:

• Create a database of (a well-scoped class of) terrorist attacks, particularly those that required a significant degree of planning, expertise, and/or resources
• Forecast the number of wars over the next few decades
• Estimate the likelihood that WMDs are used in those wars
• Estimate the likelihood that biological weapons specifically are used in those wars
• Create a database of (a well-scoped class of) dangerous actors:
• This might include terrorist groups, cults, paramilitary groups, or rebel factions
• The database would include, for each group, estimates of their size, budget, ideological commitments, and other similar such information
• Doing ‘all’ of them could be overly ambitious and it may make more sense to more narrowly scope—a complete database for a limited time period and geographic region (for instance) may be more useful than an incomplete one with greater scope
• Estimate the amount of resources, such as money, person-hours, equipment, and expertise, that have historically been used by state bioweapon programs. Could include further breakdowns based on:
• Purpose of the spending (e.g. offensive vs defensive, strategic vs tactical vs assassination)
• Technical focus of the work (e.g. on pathogens themselves vs on delivery systems)
• Nature of the pathogens (e.g. contagious vs non-contagious, targeting humans vs agriculture).
• Estimate the future fraction of resources spent in bioweapons programs devoted to contagious vs non-contagious weapons.
• Perhaps one could analyze cyberweapon development, comparing the fraction of targeted weapons to those that are designed to create widespread economic havoc.
• Estimating the fraction of military resources that get spent on ‘absolutely insane stuff’, e.g. mind control, slowing down the Earth’s rotation, or weapons that could be catastrophic and have very unclear military use (even if they were possible).
• Create a database of historical biological accidents
• Conduct large expert surveys asking which countries worldwide would seek nuclear weapons if they didn’t require rare materials, only cost $10 million (for the whole program), and required only 10 scientists • A version of this, but asking how many countries would pursue an omnicidal ‘cobalt bomb’ for similar costs (both with and without the assumption that the regular$10 million nuke option is available)
• Perhaps also repeated for historical eras to get a larger ‘sample size’
• Quantitatively scope ‘fads in terrorism’ both in ideology and methodology. For example, analyzing the extent to which tactics like suicide bombing, vehicle ramming, plane hijacking, etc. ‘took off’ after one or two successful demonstrations, or the extent to which ISIS inspired lone wolves.
• Create a database of the most impressive technical feats accomplished by terrorist groups, or non-state actors such as criminal gangs (e.g. bank heists, drug smuggling with submarines, etc.)
• Quantitatively estimate how likely the taboo on biological weapons is to totally collapse, perhaps based on past taboos collapsing
• Estimate the rate at which terrorist groups become compromised by government surveillance, destroyed, or disbanded
• Survey experts to assess the likelihood that a military will follow omnicidal orders or other catastrophic actions in various situations, such as a nuclear first strike, or in response to a nuclear attack
• Estimate the rate at which state secrets, such as information on biological or nuclear weaponry, are leaked. This includes both information about the existence of programs, and also leaks of technical information, research, or blueprints.
• Create a database of actions committed by countries or para-state groups that strongly violate international norms and treaties (eg, genocide, seeking of WMDs, sponsoring of terrorist attacks, violating arms control)
• Estimate the fraction of individuals who, if given the opportunity, would choose to commit very destructive acts
• Estimate the number of biologists worldwide with various different technical skills, and their level of access to funds and equipment
• Estimate the fraction of motivated people who could spend years of their time doing something “impressive” by themselves (e.g. building a complicated technical item like a nuclear reactor, or hacking a secure target without being traced).
• Forecast the size and budgets of major biotech industry players, by country, company, and/or specific R&D focus
• Model the probability that a regime develops and/or deploys biological weapons. This might entail:
• Making a database of countries under strong “existential pressure” (real or perceived), and investigating which did and did not seek deterrence of a similar nature.
• Numerating historic dictatorships to categorize their decision-making, particularly with regards to acquiring WMDs or committing atrocities.
• Creating a historical database getting at the question of what fraction of wars have at least one faction that would ‘take the world with them’ if given the opportunity (e.g. Hitler in bunker scenarios).
• Note that negative examples may be very informative, in which there may have been strong pressure to develop or use WMDs or other deplorable strategies, but warfare stayed conventional (Saddam Hussein in 2004, or Ukraine in 2022).

We are interested in proposals of any length or scope, ranging from a full time 4-6+ month commitment to a small 10-hour project. In some instances, we might respond to a proposal by suggesting a closer ongoing collaboration.

## How do I apply?

Applications are via this Google form, and are due on Sunday, June 5th, at 11:59 pm PDT. You’ll be required to submit:

• CVs of any project team members
• A research proposal, up to two pages, outlining what you would like to investigate and why. This should include a rough estimate of the project timeline and a budget proposal to account for your time along with any project costs.
• If applying as an organization, information about your research organization. Organizations can submit multiple separate proposals if desired; please use one application but keep the budgets separate for each project in the budget document.

We expect to fund between $500,000 and$2 million worth of proposals, depending on the quality and scope of proposals. In exceptional circumstances, we could expand this amount substantially.

If you would like to provide anonymous or non-anonymous feedback to Open Philanthropy’s Biosecurity & Pandemic Preparedness team relevant to this project, please use this form.

## Acknowledgements

Thank you to Carl Shulman for initially suggesting this research approach and providing comments. We appreciate additional comments/advice from many others, particularly Chris Bakerlee, Rocco Casagrande, and Gregory Lewis.

## New grantmaking program: supporting the effective altruism community around Global Health and Wellbeing

We are searching for a program officer to help us launch a new grantmaking program. The program would support projects and organizations in the effective altruism community (EA) with a focus on improving global health and wellbeing (GHW).

### Background

We have an existing program in effective altruism community growth. Like our new program, the existing program is focused on community building — increasing the number of people working to do as much good as possible with their time and resources, and helping them to work more effectively. However, the existing program evaluates grants through the lens of longtermism, focusing on projects that aim to raise the chance of a very long-lasting and positive future (including by reducing risks from existential catastrophes).

By contrast, the new program will focus on projects related to areas in our GHW portfolio, which is focused on improving health and wellbeing for humans and animals around the world. This portfolio currently includes our work on global health and developmentfarm animal welfareglobal aid policySouth Asian air quality, and scientific research; our cause prioritization team is also actively working to identify additional cause areas.

### Why we’re looking for a program officer

We view the EA community as pursuing a similar endeavor to Open Philanthropy; people in the community aim to do as much good as possible, and consider a broad range of approaches. We think there are many projects within this community that don’t fit the longtermist focus of our existing program, but do have an expected impact above the “bar” we use to evaluate GHW grants.

We’ve already made some grants to EA projects that were highly promising from a GHW perspective1, including Charity Entrepreneurship (supporting the creation of new animal welfare charities) and Founders Pledge (increasing donations from entrepreneurs to outstanding charities).

However, hiring a program officer to focus on this category will give us the capacity to significantly expand our grantmaking, develop our knowledge and strategy in this area, and evaluate the progress of our grantees.

We’re looking to hire someone who is very familiar with the EA community, has ideas about how to grow and develop it, and is passionate about supporting projects in global health and wellbeing. To see more details and apply, visit our job description.

### How the program will operate

Our founding program officer will have at least \$10 million in available funds to allocate in their first year, and funding could grow significantly from there depending on the volume of good opportunities they find.

The new program will not impact our funding for EA community projects with a longtermist focus, and we don’t expect to reduce our grantmaking in that area.

While both EA programs will operate independently, they may co-fund grants when there is a strong case for impact as assessed through both our GHW and longtermist frameworks.