## Does putting kids in school now put money in their pockets later? Revisiting a natural experiment in Indonesia

Open Philanthropy’s Global Health and Wellbeing team continues to investigate potential areas for grantmaking. One of those is education in poorer countries. These countries have massively expanded schooling in the last half century. but many of their students lack minimal numeracy and literacy

To support the team’s assessment of the scope for doing good through education, I reviewed prominent research on the effect of schooling on how much children earn after they grow up. Here, I will describe my reanalysis of a study published by Esther Duflo in 2001. It finds that a big primary schooling expansion in Indonesia in the 1970s caused boys to go to school more — by 0.25–0.40 years on average over their childhoods — and boosted their wages as young adults, by 6.8–10.6% per extra year of schooling.

I reproduced the original findings, introduced some technical changes, ran fresh tests, and thought hard about what is generating the patterns in the data. I wound up skeptical that the paper made its case. I think building primary schools probably led more kids to finish primary school (which is not a given in poor regions of a poor country). I’m less sure that it lifted pay in adulthood.

Key points behind this conclusion:

• The study’s “margins of error” — the indications of uncertainty — are too narrow. The reasons are several and technical. I hold this view mostly because, in the 21 years since the study was published, economists including Duflo have improved collective understanding of how to estimate uncertainty in these kinds of studies.
• The reported impact on wages does not clearly persist through life, at least according to a method I constructed to look for a statistical fingerprint of the school-building campaign.
• Under the study’s methods, normal patterns in Indonesian pay scales and the allocation of school funding can generate the appearance of an impact even if there was none.
• Switching to a modern method which filters out that mirage also erases the statistical results of the study.

My full report is here. Data and code (to the extent shareable) are here.

## Background

The Indonesia study started out as the first chapter of Esther Duflo’s Ph.D. thesis in 1999. It appeared in final form in the prestigious American Economic Review in 2001, which marked Duflo as a rising star. Within economics, the paper was emblematic of an ascendant emphasis on exploiting natural experiments in order to identify cause and effect (think Freakonomics).

Here, the natural experiment was a sudden campaign to build tens of thousands of three-room schoolhouses across Indonesia. The country’s dictator, Suharto, launched the big push with a Presidential Instruction (Instruksi Presiden, or Inpres) in late 1973, soon after the first global oil shock sent revenue pouring into the nation’s treasury. I suspect that Suharto wanted not only to improve the lot of the poor, but also to consolidate the control of his government — which had come to power through a bloody coup in 1967 — over the ethnically fractious population of the far-flung and colonially constructed nation.

I live near the Library of Congress, so I biked over there to peruse a copy of that 1973 presidential instruction. It reminded me of James Scott’s Seeing Like a State, which is about how public bureaucracies impose homogenizing paradigms on the polities they strive to control. After the legal text come neat tables decreeing how many schools are to be built in each regency. (Regencies are the second-level administrative unit in Indonesia, below provinces.) After the tables come pages of architectural plans, like the one at the top of this post.

The instruction even specifies the design of the easels, chairs, and desks. Here’s a desk:

Sure enough, if you search Google images for “Inpres Sekolah Dasar” (Inpres primary school), you’ll find those schools and those desks (source):

The Inpres campaign doubled the stock of primary schools in the country in just six years. Economists call that a “schooling shock”.

## Methods and results

The Duflo study looks for reverberations of this educational earthquake in data from a household survey that the Indonesian government fielded in 1995. By 1995, the first kids who went to the new schools had grown up and started working. The study examines whether boys with more opportunity to attend a new school, by virtue of how young they were and where they lived, actually went to school more and then earned more.[2]The study restricts to men because they more uniformly engage in paid employment or self-employment across their careers, which enhances comparability across age groups. Separately, Duflo studied effects on girls.

To perform this calculation, the study takes difference-in-differences. It looks not at whether men from regencies that got more schools earned more—a difference—but whether the pay differential between young and old men in 1995 was narrower for natives of regencies that got more schools, which is a difference in differences.[3]More precisely, wages are taken in logarithms, so the “pay gap” is a ratio.

Why look at that? Regencies ranged along a spectrum in how many new schools they got per child. To understand the study’s theory of measurement, I like to split the spectrum into regencies that got fewer schools and those that got more. If the Inpres schools did increase future pay, here’s how the world would look in this framing:

 Regencies getting fewer schools per child in 1970s Regencies getting more schools per child in 1970s Older natives too old to have gone to new schools Older natives too old to have gone to new schools Fewer young natives could have gone to the schools More young natives could have gone to the schools Fewer young natives get pay boost in adulthood More young natives get pay boost in adulthood Larger young-old pay gap among natives in 1995 Smaller young-old pay gap among natives in 1995 Regencies getting fewer schools per child in 1970s Regencies getting more schools per child in 1970s

The bottom line (as it were): natives of places that got more schools in the 1970s would exhibit a smaller young-old pay gap in 1995. That is the correlation that the Duflo study looks for…

…and finds. The study (Table 4, panel A) calculates that each additional planned Inpres school, per 1,000 children in a regency, increased boys’ future wage earnings by about 1.5%. The 1.5% number pertains to employees, meaning people who work for other people. (The government surveyors in 1995 didn’t ask self-employed people, including farmers, how much they earned, so they fall out of this analysis.)

In the same way, it is calculated that during childhood those future workers spent a fifth of a year more in school for each Inpres school built per 1,000 children. I think of that finding as an extra year in school for every fifth boy.

If an extra fifth of a year of schooling bumped wages by an average of 1.5%, then a full year would have increased them by about 5 × 1.5% = 7.5%.

## Association and causation

That 7.5% payoff rate for a year in the classroom is known as the “return to schooling”. Economists have estimated it thousands of times using data from various times and places. Yet among all those estimates, Duflo’s stands out. It comes from the developing world, which is where most people live. It comes from a big schooling expansion, which adds realism if you’re interested in national-level education policy. And the use of difference-in-differences gives the study a certain rigor, for it rules out some potential critiques. Few other studies can check all those boxes (though some can — see this from Kenya or this from India).

To expand on that last strength: If the Duflo study had only computed differences, then, for example, a simple finding that men from regencies that got more schools earned more, if presented as evidence of impact, could be easily challenged. Maybe everything just costs more — and everyone earns more — in the megalopolis of Jakarta; and maybe Jakarta, as the capital, got more schools per capita. Then we would not need to believe that Inpres schools made a difference in order to explain why men from regencies that got more schools earned more. On the other hand, if urban inflation raised everyone’s wages within Jakarta the same amount, then the young-old pay gap would be the same in Jakarta and beyond. It would not be misleadingly associated with the number of schools each district got. And that, as I said, is what the Duflo study actually checks.

Notice that “if urban inflation…” in the previous paragraph. Despite the rigor of difference-in-differences, you still need to assume something nontrivial about the world in order to fully buy the study’s findings.

Fortunately, the Duflo analysis contains a potentially more compelling basis for proving impact. It has to do with timing.[4]Duflo (2004, p. 350) sees an additional virtue in this natural experiment: “Identification is made possible because the allocation rule for schools is known (more schools were built in places with low initial enrollment rates).” But an allocation rule is no less endogenous for being … Continue reading Think of the opportunity to go to one of the new Inpres schools as a medicine. The dose of educational opportunity depended on kids’ ages. Approximately speaking, those 12 and up in 1974 were too old to get any of this schoolhouse-shaped drug, for they had aged out before any new schools got built. Kids who were 11 in in 1974 could get a one-year dose before aging out, at least if they lived near one of the new schools. Kids who were 10 could get a two-year dose. And so on. Because every year more neighborhoods and villages got new schools, well into the 1980s, the average schooling opportunity continued rising for younger and younger kids.

So the graph of Inpres schooling opportunity looks like this:

If we found a similar bend around age 12 in other data, such as on earnings in 1995, that would look like the fingerprint of Inpres carrying through, from cause to effect. And that is exactly what the Duflo study suggests happened, if with statistical noise. This graph is from the study:

Each dot in this graph is a measurement of the association framed in that table I showed above, between the young-old gap in schooling or pay and how many schools a regency was to receive per child. In that framing, we expect no association for the oldest men in the study, for all were too old to have gone to the new schools. But it should start to emerge—the dots should start to rise—as we scan to men who were 12 or younger in 1974. Duflo wrote:

These coefficients fluctuate around 0 until age 12 and start increasing after age 12. As expected, the program had no effect on the education of cohorts not exposed to it, and it had a positive effect on the education of younger cohorts.

Looking at that graph I wondered: do the trends really bend around age 12? Or should they be seen as straight? Because of the noise, neither characterization completely nails it; the question is whether one model clearly out-fits the other. If the overall trends were straight and long-term, perhaps they had little to do with Inpres. Just as in my reanalyses of Hoyt Bleakley’s studies of hookworm and malaria eradication, I set out to probe this question with a mathematical test.

## Starting the reanalysis

I started my quest with a request for the study’s data and computer code. Ironically, Duflo is now the editor of the journal that published her paper. That puts her in charge of enforcing the data and code-sharing policy that applied to her study. Sure enough, she promptly sent me files for reproducing most of the results.[5]Duflo’s license to the 1995 survey data did not permit her to share it. But through the gratefully appreciated assistance of Daniel Feenberg, I indirectly accessed the copy licensed by the NBER. Separately, IPUMS International hosts a large subset.

Once I had anchored myself in exact reproduction, I made changes to the code. Most owe to the passage of time: methods in empirical economics have improved since 2001, and Indonesian men of the generation in the Duflo study have continued tracing their way through life (and through government survey data).

While my biggest question going in was about timing, I stumbled on another first-order issue: an alternative explanation for the numerical findings.

I’ll explain a few technical concerns first, as non-technically as I can, then move to that alternative explanation and the search for bends in trends.

### Data corrections

Some numbers in the Duflo study come from government documents published in the 1970—presidential instructions and reports on Indonesia’s 1971 census. At the Library of Congress, I scanned pages in these books and double-checked the numbers Duflo sent me. In my experience, it is normal for such a check to expose errors, and normal for them not to affect conclusions much—as happened here. For about a tenth of regencies, my figures for planned schools per 1,000 children differ from Duflo’s. (See my Github repo.)

### Clustering

It’s a truism that the larger your sample, the more precise your statistics. The margin of error is tighter if you poll 1,000 people than if you poll 10. But margins of error must themselves be estimated, and determining the effective sample size for this purpose is often a head-scratcher. Should we view a study of the impact of state air pollution rules on asthma rates as being about 50 states or, say, 50 million people? The answer can radically affect how precise we take the results to be. One rule of thumb: the effective sample size is the number of treatment units. There are 50 states, with 50 air pollution laws, so 50 is your number, not 50 million.

In a striking turnabout, soon after finalizing the Indonesia study, Duflo coauthored a paper raising doubts about the methods she had just used: “How Much Should We Trust Differences-in-Differences Estimates?” This new paper was not purely destructive, for it demonstrated the value of a particular mathematical correction, called clustering, which allows one to crunch data on millions of individuals while computing margins of error as if the sample is much smaller. Under the influence of that paper, in returning to the Indonesia study, I cluster standard errors by regency. This widens confidence ranges by a factor of two or three.

### Overrepresentation of wealthy families

Governments run many surveys (to track how much people work, how healthy they are, how much they pay for housing, etc.) Some surveys are censuses, which ideally entail knocking on everyone‘s door, and even reaching the people who don’t have doors. But finding all those people, asking them lots of questions, and collating the answers all costs money. This is why most surveys, like polls, take samples.

As soon as one gives up on surveying everyone, the question arises: what is the best way to allocate surveying resources to get the most accurate statistical picture? Often, it is not to take a plain random sample, as when pollsters dial random phone numbers. It can be better to split the sample into strata — urban and rural, rich and poor. If some strata are known from censuses to be more homogeneous, then governments can get more precision for the money by sampling those strata less and others more. In a history of one of Indonesia’s national surveys, Parjung Surbakti explains it well:

The fact that an orange taken from a truckload of oranges all coming from the same orchard is sweet, gives adequate evidence to conclude that all the oranges in the truck are sweet. In this example, a very small sample size can provide an accurate conclusion about a large population when the population is homogeneous. It would be a different story if the oranges came from a number of orchards and consisted of different varieties. Then a sample of size 10 might not give as accurate a conclusion as that of the previous example. However, if the truckload of oranges can be sorted by varieties, i.e., the population is stratified, then sampling once again may be made more efficient.

It seems that in Indonesia, poorer people are thought to be more like oranges from the same orchard. For the government surveyors disproportionately visit wealthy households, where wealth is indicated by possessions such as toilets and diplomas.

The Indonesia survey data used in the Duflo study are accompanied by weights to document the oversampling of some groups. They indicate, say, that each surveyed household with a toilet stands for 100 others while each without stands for 200. However, the Duflo study mostly does not incorporate these weights.[6]This is not documented in the text but is made plain in the code. As a result, wealthier people are overrepresented.

Whether such weights should in general be factored in is a confusing question, so much so that three respected economists wrote “What Are We Weighting For?” to dispel their colleagues’ befuddlement. Here, my concern is that the data are being tilted on the basis of the outcomes of interest. People with more education and higher incomes were more likely to get a knock on the door from a surveyor in 1995 and thus to appear in the Duflo analysis. Imagine a study of the impact of smoking in which people who live are oversampled at the expense of people who die. That would make smoking look safer than it is.

That is why I prefer to incorporate the weights in the Indonesia analysis. For technical reasons, this not only shifts the impact estimates, but further widens margins of error.[7]For intuition, imagine concentrating all weight on a few observations. This effectively slashes sample size.

### Instability of ratios

I quoted the estimate that the Inpres campaign raised wages by 7.5% per year of extra schooling. That is a ratio: a 1.5% wage boost divided by 0.2 years (a fifth of a year) of extra schooling. Because the numbers going into the ratio are themselves averages from samples of Indonesians, each comes with its own margin of error. The true value of the schooling increase in the full population might be 0.3 years or 0.1 years — or 0.0. And if, as far as the math goes, there’s a nontrivial chance that Inpres led to zero additional years of schooling, then there’s a nontrivial chance that the ratio of wage increase to schooling increase is infinite.

The point is not that I think the return to schooling could be infinity (or negative infinity), but that ratios emerging from this sort of analysis can range wildly. Standard methods for computing margins of error can underestimate this uncertainty.

Since Duflo wrote about Indonesia, economists have made a lot of progress in recognizing and working around this devil in the details, which is called “weak identification”. In my reanalysis, I marshal a modern method called the wild-bootstrapped Anderson-Rubin test, which happens to be performed by a cool program I wrote. Like the clustering and weighting corrections, the new method widens the uncertainty bands around the estimated return to schooling.

### Bottom line after incorporating the technical comments

After I fix data errors, cluster, and compensate for the oversampling of wealthy households, it is surprisingly unclear whether Inpres caused boys to spend any more time in school. And because dividing by a number that is hard to distinguish from zero produces unstable results, the impact on wages per extra year in school is even less clear. Where Duflo brackets that 7.5% schooling return rate with a 95% confidence range of 1–15%, I widen to a huge span, –44% to +164%.[8]Here, I express these results as percentages rather than log points, i.e. as exp(x) – 1 where x is a primary statistical result.

To be fair, that wide range can mislead. My 70% confidence range is 0–23%. I conclude that incorporating the technical comments into the core Duflo analysis leaves it weakly favoring the view that Inpres-stimulated schooling raised wages.

## The alternative explanation: wage scale dilation

As I wrangled with those technicalities and worked to answer my original question about trend bending, I discovered another reason to doubt the Duflo study’s results. And once I did, I realized that Clément de Chaisemartin and Xavier D’Haultfœuille had already pointed to the heart of the issue. It turns out that some more mundane patterns in the data, when fed into the difference-in-differences machine, can produce the same statistical results.

Here are some universal truths, or at least as close as you get to that in economics:

1. Those who go to school more earn more, on average.
2. The earnings gap between the more- and less-schooled rises with age.

As an example of the second point, in the 1995 Indonesia data, the average employed, college-educated 21-year-old man earned 744 rupiah per hour, only 18% more than the 633 rupiah earned by a contemporary who dropped out of school before fourth grade. But at age 61, in the same data, the hourly pay for the primary school dropout was basically unchanged, 642 rupiah, while pay for the college graduate was more than six times higher than at age 21, at 4,852 rupiah.

This graph shows more fully how the wage scale widened with age among employed Indonesian men in 1995:

I imagine the people behind the bottom curve as farm workers whose pay had little to do with age. And I imagine that the top curve traces how the elite ascend the ranks of big corporations and government.

The widening of the wage scale feeds into the Duflo study in a mind-bending way. Suppose (correctly) that poorer regencies — the ones that produced more day laborers and fewer doctors and lawyers — received more Inpres schooling funding per child. Then we would see:

 Regencies getting fewer schools per child in the 1970s Regencies getting more schools per child in the 1970s Natives better off on average Natives poorer on average More kids grow up to be CEOs More kids grow up to be day laborers Average pay rises a lot during career Average pay rises little during career Larger young-old pay gap among natives in 1995 Smaller young-old pay gap among natives in 1995

This table starts and ends in the same places as the earlier table: getting more schools means a smaller young-old pay gap. But it goes by a different route. Nowhere does the new scenario assume or require that the Inpres school-building campaign had any effect.[9]Formally, I am suggesting a violation of the parallel trends assumption required for causal interpretation of difference-in-differences results. Thus, the study’s methods could lead to the conclusion that Inpres schools raised wages even if they did not.

You might push back against my skepticism: I’m undercutting an argument that schooling increases earnings by invoking the universal truth, confirmed in the graph above, that pay and education go hand in hand — which itself seems like powerful evidence that education increases earnings!

To which I reply: The Duflo study strives not merely to prove that education raises wages, but to measure the impact more sharply. It invokes a natural experiment to remove sources of statistical bias, such as my urban inflation hypothetical. It matters whether the natural experiment is working as intended.

## Fresh findings

To probe whether wage scale dilation is generating the Duflo study’s results — and to return to the search for bent trends — I pursued three strategies:

• As foreshadowed, I tested mathematically whether the trend really bends in that Duflo graph I showed earlier. The wage scale dilation theory amplifies the importance of this check: I have no reason to think that wage scale dilation suddenly kicks in at a particular age, so a clear bend in the trend of emergence of those differences-in-differences would favor the Duflo explanation as laid out in that first table above.
• I deployed a newer statistical method called changes-in-changes, which should be immune to wage scale dilation.
• I followed up later on the same generation of Indonesian men, in data from 2005, 2010, and 2013–14 (a selection dictated by whether the surveys asked the needed questions and whether the answers are publicly available). One reason was to see whether the reported link between schooling and earnings was consistent over men’s careers, or a one-off in 1995.

To convey the results, I’ll show you some graphs. All are constructed like that Duflo graph I showed before. But I’ve incorporated the technical fixes, such as correcting data errors, and added some visual elements.

First comes my update of the “education” contour in the Duflo graph. Again, a precise statement of the meaning of the dots — blue in mine, black in the Duflo graph — is a mouthful. Each shows how much the young-old gap in total years spent in school, among men of a particular age, was associated with how intense the Inpres program was in their home regency.[10]Each dot shows, for men who were a particular age in 1974, how much their total years spent in school increased for each additional Inpres school per 1,000 children in one’s native district, relative to the benchmark group, here taken to be those aged 2 in 1974. The sample is restricted to … Continue reading Around the dots I added gray bands to depict 95% confidence intervals.[11]Figure 1 of the Duflo study also shows confidence intervals, but for the schooling contour only. They remind us that because of noise in the data, each dot could have landed a bit lower or higher than it does. And I fit a line to the data, in red, while allowing it to kink at age 12:

The schooling trend hardly bends. From the standpoint of this search for the fingerprint, it’s not clear that building Inpres schools contributed to rising school attendance. A statistical test for whether the slopes of the two red segments differ returns a p-value of 0.60, which I printed in the upper left. That high probability means that a bend this small could easily have happened by chance, because of statistical noise, if the true line did not bend at all.[12]The same test applied to the uncorrected original returns a p-value of 0.09. See figure 6 of my write-up.

Now, the Inpres program built primary schools. So did it at least get more kids to finish primary school? Quite possibly. In the next graph, the vertical axis pertains to the share of workers in 1995 who had finished primary school rather than the total years they spent in school. Now the trend more clearly accelerates around age 12. The p-value for the bend is reassuringly low, at 0.01:

It’s a strange pair of findings: boys finished primary school more but didn’t go to school more? I think the first finding is closer to the truth. The surveyors in 1995 didn’t actually ask people how many years they went to school, but rather the highest grade they attended. Probably, when the schools were first built, some kids who were officially too old to attend them went anyway, rather than going to junior high schools farther away.[13]The primary school gross enrollment ratio, which is the ratio of the number of kids attending primary school to the number of official primary school age, temporarily surpassed 100% in the 1980s. Suharti, “Trends in education in Indonesia,” Figure 2.5. Even if they spent exactly the same number of years in school, the study would have coded them as having spent fewer, since on paper they only got as far as sixth grade rather than seventh or eighth.[14]In fact, the Duflo paper (page 804) finds a slight fall in secondary school attainment.

If Inpres at least got more boys through primary school, did that suffice to raise their pay in adulthood? The next graph gets at that possibility by switching the vertical axis to wages. Again, the trend bends up, with a reassuringly low p-value:

If we assume that the deflection in the primary schooling trend caused the deflection in the wage trend, then we can divide the second by the first to gauge the rate of impact. Unfortunately, the first (the increase in primary school completion) still does not differ from zero with enough certainty to stabilize the ratio. In my paper (Table 7, panel B, column 2) I calculate that finishing primary school changed wages in adulthood by somewhere between –12% and +∞, as a 95% confidence range.[15]The statistical method cannot rule out with 95% confidence that the Inpres schooling campaign had zero effect on the rate of primary school completion, and thus that the impact on wages per unit of gain in primary schooling was infinite.

Another source of doubt: when I checked on the same generation of men later in life, the upward bend in the wage trend didn’t persist as strongly. In 2005 (when the men aged 2–24 in 1974 had reached ages 33–55) and 2010 (ages 38–60), the line bends slightly downward.[16]For reasons of data availability, wages are defined differently in different years. In 1995 and 2010, they are the log hourly wage for wage workers. In 2005, they are log hourly wages as imputed from a model calibrated to 1995 data. In 2013-14, they are log typical monthly pay from all sources, … Continue reading In 2013–14 (ages 41–64), it bends more significantly upward. Rather than showing you all of those here, I’ll average them in a single graph;[17]The regressions behind this plot pool the data from all post-1995 follow-ups. The dependent variable is the one defined within each survey sample. Year dummies are added as controls.

The red line bends upward, with a p-value of 0.23. I’m not one to mechanically dismiss a finding as “insignificant” when the p-value exceeds 0.05. At face value, p = 0.23 means there’s less than a 1-in-4 chance of a bend this big in the data if the true pattern is no bend at all. On the other hand, I could have put the finding to an even more rigorous test by including more of the control variables used in the Duflo study.[18]One of these, Inpres water and sewer spending, could plausibly generate a trend break.

Separately, I ran the changes-in-changes method I mentioned, the one that should be immune to wage scale dilation. This approach finds no wage boost from Inpres.

## Conclusion

To recap:

• A representative result from the Duflo study is that Inpres-stimulated schooling increased future wages of boys by 7.5% per year of school, with a 95% confidence range of 1–15%.
• Technical adjustments widen that range hugely. The main reason is that it is surprisingly unclear whether the Inpres school construction led boys to go to school more. Dividing any wage increase by a number that cannot be confidently distinguished from zero makes for instability.
• It is more plausible that the program caused more boys to finish primary school.
• There’s another way to explain why the study finds that Inpres increased adult earnings. It is rooted in two facts: over their careers, more-educated people see their pay rise more (wage scale dilation); and poorer regencies got more schools per child.
• The changes-in-changes method, which is in effect designed to rule out wage scale dilation as an explanation, finds no wage boost from Inpres.
• On the other hand, an apparent fingerprint of Inpres, the trend bend, holds up fairly well in the wage data of 1995 despite my technical tweaks. And wage scale dilation would not be expected to cause such a bend.
• The fingerprint persists weakly later in life.

The findings reported here are important because they show that an unusually large government-administered intervention was effective in increasing both education and wages in Indonesia.

I am confident that, in retrospect, that reading is overconfident. But I wouldn’t swing to the opposite extreme of no confidence. It seems more likely than not that building all those schools (and hiring all those teachers) got more kids into school. And the big push may have left light fingerprints in the wage numbers decades later. Meanwhile, it is conceivable that the conservatism of the changes-in-changes method, which makes it less prone to generating false positives, also makes it more prone to generating false negatives.

Still, the rate of return to Inpres-stimulated schooling — wage gains per additional unit of schooling — is quite unclear.

One’s judgment about whether basic education in developing countries is a good thing should not hinge solely on the answers emerging from this study, nor even on the questions it asks. It could be that Inpres schools indeed made a large difference in Indonesia, but that the “natural experiment” was just not strong enough for the signal to shine through the noise. Or — more likely — the problem is that, as Lant Pritchett puts it, schooling ain’t learning. Maybe the Inpres schooling campaign was better at getting kids behind desks than knowledge into their heads. If billions of kids are passing through school and not learning much, there is huge room for improvement.

Moreover, higher pay is not the only reason to send kids to school. As I write, Duflo and coauthors are using randomly allocated scholarships in Ghana — an artificial rather than natural experiment — to research a wide array of potential consequences of secondary schooling, for girls as well as boys. Do the girls go on to have fewer unwanted pregnancies? Do fewer of their children die in the first year of life?

I am struck by how often the findings from studies of “natural experiments” fray under stress. An appreciation for that fact may explain why, soon after completing her dissertation, Esther Duflo became a champion of running actual experiments, such as the scholarship experiment in Ghana. Discarding some of what she learned in school would eventually pay high returns. It made Duflo the second woman to receive a Nobel prize in economics. And it drove her profession to produce more credible research.

Footnotes

↑1 See Table 1, panel B, of the paper. The study restricts to men because they more uniformly engage in paid employment or self-employment across their careers, which enhances comparability across age groups. Separately, Duflo studied effects on girls. More precisely, wages are taken in logarithms, so the “pay gap” is a ratio. Duflo (2004, p. 350) sees an additional virtue in this natural experiment: “Identification is made possible because the allocation rule for schools is known (more schools were built in places with low initial enrollment rates).” But an allocation rule is no less endogenous for being known. And Duflo (2001, Table 2) shows that the non-enrollment rate was a secondary correlate of allocation. It matters even less after data corrections (figure 1 of my write-up). Duflo’s license to the 1995 survey data did not permit her to share it. But through the gratefully appreciated assistance of Daniel Feenberg, I indirectly accessed the copy licensed by the NBER. Separately, IPUMS International hosts a large subset. This is not documented in the text but is made plain in the code. For intuition, imagine concentrating all weight on a few observations. This effectively slashes sample size. Here, I express these results as percentages rather than log points, i.e. as exp(x) – 1 where x is a primary statistical result. Formally, I am suggesting a violation of the parallel trends assumption required for causal interpretation of difference-in-differences results. Each dot shows, for men who were a particular age in 1974, how much their total years spent in school increased for each additional Inpres school per 1,000 children in one’s native district, relative to the benchmark group, here taken to be those aged 2 in 1974. The sample is restricted to wage earners. Figure 1 of the Duflo study also shows confidence intervals, but for the schooling contour only. The same test applied to the uncorrected original returns a p-value of 0.09. See figure 6 of my write-up. The primary school gross enrollment ratio, which is the ratio of the number of kids attending primary school to the number of official primary school age, temporarily surpassed 100% in the 1980s. Suharti, “Trends in education in Indonesia,” Figure 2.5. In fact, the Duflo paper (page 804) finds a slight fall in secondary school attainment. The statistical method cannot rule out with 95% confidence that the Inpres schooling campaign had zero effect on the rate of primary school completion, and thus that the impact on wages per unit of gain in primary schooling was infinite. For reasons of data availability, wages are defined differently in different years. In 1995 and 2010, they are the log hourly wage for wage workers. In 2005, they are log hourly wages as imputed from a model calibrated to 1995 data. In 2013-14, they are log typical monthly pay from all sources, including self-employment. The 2010 data have the disadvantage of being coded only by regency of workplace, not regency of birth. The regressions behind this plot pool the data from all post-1995 follow-ups. The dependent variable is the one defined within each survey sample. Year dummies are added as controls. One of these, Inpres water and sewer spending, could plausibly generate a trend break.

## Cause Exploration Prizes: Announcing our prizes

We were gratified to receive over 150 good-faith submissions to Open Philanthropy’s Cause Exploration Prizes, where we invited people to suggest a new area for us to support or respond to our suggested questions. We hoped that these submissions would help us find new ways to carry out our mission — helping others as much as possible with the resources available to us.

You can read them on the EA Forum. Below, we highlight the submissions to which we are awarding major prizes and honorable mentions.

We’re awarding these prizes to entries that we thought engaged well with our prompts and helped us to better understand the questions and issues they addressed. We have not investigated each and every claim made in these entries, and the awarding of a prize does not imply that we necessarily endorse their claims or arguments as correct.

### Future plans

As we stated in our announcement, this was a trial process for us. We’re grateful to those who sent us feedback and suggestions for how to improve. At this stage, we don’t know if or when we will repeat a process like this. We might write a public update later this year on what we have learned from this exercise and any plans to repeat this or a similar exercise again.

### Thank you

We are grateful to Lizka Vaintrob and the other operators of the EA Forum, and to everyone who engaged with or submitted an entry to the Cause Exploration Prizes for making this possible.

## New Shallow Investigations: Telecommunications and Civil Conflict Reduction

We recently published two shallow investigations on potential focus areas to the Effective Altruism Forum. Shallow investigations, which are part of our cause selection process, are mainly intended as quick writeups for internal audiences and aren’t optimized for public consumption. However, we’re sharing these two publicly in case others find them useful.

The default outcome for shallow investigations is that we do not move forward to a deeper investigation or grantmaking, though we investigate further when results are particularly promising.

If you have thoughts or questions on either of these investigations, please use this feedback form or leave a comment on the EA Forum.

## Telecommunications in Low and Middle-Income Countries (LMICs)

By Research Fellow Lauren Gilbert (EA Forum link)

• Lauren finds that expanding cellular phone and internet access appears to cost-effectively increase incomes. Randomized trials and quasi-experimental studies in LMICs showed that gaining internet access led to substantial increases in income, with high social returns on investment.
• We find these reported effects surprisingly large, and are continuing to dig into them more.
• Lauren estimates that 3-9% of the world’s population do not have access to cellular service, and ~40% of the world’s population either have no access to mobile internet or do not use it. Lauren finds that the biggest barrier to usage is the cost of devices and coverage. These coverage gaps and costs are shrinking over time.
• A large majority of spending on telecommunications is private/commercial, with a smaller amount of philanthropic spending. While the private investments are large, they aren’t as focused as a philanthropist might be on improving access for poor and rural communities.
• Philanthropists could potentially help improve access by subsidizing investments in cell phone towers to improve coverage, and in internet cables to reduce the cost of internet. Lauren’s rough back-of-the-envelope calculation suggests that these investments may be cost-effective. A funder could also potentially lobby for policy changes to reduce costs — for example, reducing tariffs on imported electronics or changing the rules around how spectrum can be licensed.

## Civil Conflict Reduction

Also by Lauren Gilbert (EA Forum link)

• Civil conflict is a very important problem. Lauren estimates that civil wars directly and indirectly cause the loss of around 1/2 as many disability-adjusted life years as malaria and neglected tropical diseases combined. Civil wars also substantially impede economic growth, mostly in countries that are already very poor.
• While civil conflict is important and arguably neglected, it isn’t clear how tractable it is. However, some interventions have shown promise.
• Lauren finds some evidence that UN peacekeeping missions are effective, and argues philanthropists could lobby for more funding.
• Some micro-level interventions, such as mediation or cognitive behavioral therapy, also have suggestive empirical evidence behind them. Philanthropists could fund more research into these interventions.

## Incoming Program Officer for Effective Altruism Community Building (Global Health and Wellbeing): James Snowden

Earlier this year, I wrote that Open Philanthropy was looking for someone to help us direct funding for our newest cause area:

We are searching for a program officer to help us launch a new grantmaking program. The program would support projects and organizations in the effective altruism community (EA) with a focus on improving global health and wellbeing (GHW) […] We’re looking to hire someone who is very familiar with the EA community, has ideas about how to grow and develop it, and is passionate about supporting projects in global health and wellbeing.

Today, I’m excited to announce that we’ve hired a Program Officer who exemplifies these qualities: James Snowden.

James spent his last 5+ years as a researcher and program officer at GiveWell; in the latter role, he led GiveWell’s work on policy and advocacy. Before GiveWell, he worked at the Centre for Effective Altruism and as a strategy consultant. He holds a B.A. in philosophy, politics and economics from Oxford University and an M.Sc. in philosophy and economics from the London School of Economics.

You can hear a sample of how James thinks in this podcast interview, and read some of his work on GiveWell’s blog.

## What James might work on

We’ve already made a few grants to EA projects that were highly promising from a GHW perspective, including Charity Entrepreneurship (supporting the creation of new animal welfare charities) and Founders Pledge (increasing donations from entrepreneurs to outstanding charities).

Areas of potential interest include:

• Increasing engagement with effective altruism in a broader range of countries
• Encouraging charitable donations to effective organizations working in GHW-related areas
• Incubating new ideas for highly impactful charities
• Creating resources to facilitate impactful career decisions within GHW

We still have a lot of growth ahead of us and will be expanding to start more programs in the coming months and years — check out our jobs page if you’re interested in helping drive that growth!

## Report on Social Returns to Productivity Growth

Historically, economic growth has had huge social benefits, lifting billions out of poverty and improving health outcomes around the world. This leads some to argue that accelerating economic growth, or at least productivity growth,[1]If environmental constraints require that we reduce our use of various natural resources, productivity growth can allow us to maintain our standards of living while using fewer of these scarce inputs. should be a major philanthropic and social priority going forward.[2]For example: in Stubborn Attachments, Tyler Cowen argues that the best way to improve the long-run future is to maximize the rate of sustainable economic growth. A similar view is held by many of those involved in the Progress Studies community.

I’ve written a report in which I evaluate this view in order to inform Open Philanthropy’s Global Health and Wellbeing (GHW) grantmaking. Specifically, I use a relatively simple model to estimate the social returns to directly funding research and development (R&D). I focus on R&D spending because it seems like a particularly promising way to accelerate productivity growth, but I think broadly similar conclusions would apply to other innovative activities.

My estimate, which draws heavily on the methodology of Jones and Summers (2020), asks two primary questions:

1. How much would a little bit of extra R&D today increase people’s incomes into the future, holding fixed the amount of R&D conducted at later times?[3]An example of an intervention causing a temporary boost in R&D activity would be to fund some researchers for a limited period of time. Another example would be to bring forward in time a policy change that permanently increases the number of researchers.
2. How much welfare is produced by this increase in income?

In brief, I find that:

• The social returns to marginal R&D are high, but typically not as high as the returns in other areas we’re interested in. Measured in our units of impact (where “1x” is giving cash to someone earning $50k/year) I estimate that the cost-effectiveness of funding R&D is 45x. This is ~4% as impactful as the (roughly 1,000x) GHW bar for funding. • Put another way, I estimate that$20 billion to “average” R&D has the same welfare benefit as increasing the incomes of 180 million people by 10% each for one year.
• That said, the best R&D projects might have much higher returns. So could projects aimed at increasing the amount of R&D (for example, improving science policy).
• This estimate is very rough, and I could readily imagine it being off by a factor of 2-3 in either direction, even before accounting for the limitations below.
• Returns to R&D were plausibly much higher in the past. This is because R&D was much more neglected, and because of feedback loops where R&D increased the amount of R&D occurring at later times.
• My estimate has many important limitations. For example, it omits potential downsides to R&D (e.g. increasing global catastrophic risks), and it focuses on a specific scenario in which historical rates of return to R&D continue to apply even as population growth stagnates.
• Alternative scenarios might change the bottom line. For instance, R&D today might speed up the development of some future technology that drastically accelerates R&D progress. This would significantly increase the returns to R&D, but in my view would also strengthen the case for Open Phil to focus on reducing risks from that technology rather than accelerating its development.

Overall, the model implies that the best R&D-related projects might be above our GHW bar, but it also leaves us relatively skeptical of arguments that accelerating innovation should be the primary social priority going forward.

In the full report, I also discuss:

• How alternative scenarios might affect social returns to R&D.
• What these returns might have looked like in the year 1800.
• How my estimates compare to those of economics papers that use statistical techniques to estimate returns to R&D growth.
• The ways in which my current views differ from those of certain thinkers in the Progress Studies movement.

Footnotes

↑1 If environmental constraints require that we reduce our use of various natural resources, productivity growth can allow us to maintain our standards of living while using fewer of these scarce inputs. For example: in Stubborn Attachments, Tyler Cowen argues that the best way to improve the long-run future is to maximize the rate of sustainable economic growth. A similar view is held by many of those involved in the Progress Studies community. An example of an intervention causing a temporary boost in R&D activity would be to fund some researchers for a limited period of time. Another example would be to bring forward in time a policy change that permanently increases the number of researchers.

Last year, we recommended $300 million of grants to GiveWell’s evidence-backed, cost-effective recommendations in global health and development, up from$100 million the year before. We recently decided that our total allocation for this year will be $350 million. That’s a$50 million increase over last year, significantly driven by GiveWell’s impressive progress on finding more cost-effective opportunities. We expected GiveWell to identify roughly $430 million of 2022 “room for more funding” in opportunities at least 8 times more cost effective than cash transfers to the global poor. Instead, we estimate that GiveWell is on track to identify at least$600 million in such opportunities. However, due to reductions in our asset base, the growth in our commitment this year is smaller than the $200 million increase we had projected last fall. The rest of this post: • Reviews our framework for allocating funding across opportunities to maximize impact. (More) • Discusses how changes in asset values influence the appropriate distribution of funding across our program areas. (More) • Explains how we chose this year’s allocation to GiveWell, given our asset changes and GiveWell’s significant progress in finding more cost-effective opportunities. (More) • Shares Alexander’s personal thoughts on why GiveWell seems like an unusually compelling opportunity for individual donors this year. (More) ## Our framework for allocating funding across opportunities We want to help others as much as we can with the resources available to us. When choosing how much funding to allocate — whether to GiveWell’s recommendations, or to other areas in global health and wellbeing — we think about how the choice will affect our funding options in future years. If we had estimates of the cost-effectiveness of every grant we could possibly make this year, and similarly for each of the next 50 years, then we could take the following approach to maximize the impact of our spending and estimate our optimal threshold for cost-effectiveness: 1. Rank all the opportunities (all 50 years’ worth[1]The 50-year horizon ensures that this analysis takes into account our funders’ desire to spend down their assets within their lifetimes.) in terms of their expected cost-effectiveness.[2]Of course, we’d have to discount future costs by the expected rate of return on assets, since we only need to set aside 1 dollar now in order to fund 1+r dollars of grantmaking next year. For example, if we expect a compounding 7% rate of return, a grant we’d make for$1 million in 2022 would … Continue reading
2. Set aside funds for the most cost-effective grant, then the next most cost-effective, and so on, going down the list until we’d allocated all our resources.
3. Look at the marginal grant that exhausts our resources, and use its estimated cost-effectiveness as the “bar” for all our other opportunities.

Of course, our cost-effectiveness estimates and predictions about the future aren’t nearly certain enough or granular enough to actually list every future opportunity. But we have done a very rough, abstract, assumptions-driven exercise (which we hope to publish in the next year) aimed at the same goals of figuring out what our “bar” should be and how to fund the most cost-effective opportunities across time.

One stylized finding from this exercise is that we can maximize our expected impact by setting a cost-effectiveness “bar” such that in any given year we spend roughly 8-10% of our remaining assets (of the assets we plan to eventually spend on interventions like GiveWell’s recommended charities).[3]This is the output of a Monte Carlo simulation approach, which we hope to write more about once we’re more confident in the model and findings. We simulate the interaction of many factors that could make our future spending more or less cost-effective. For example, some of our most cost-effective … Continue reading

## How changes in assets influence optimal allocation across portfolio areas

In the hypothetical exercise above, we ranked each potential grant by its cost-effectiveness and then allocated our resources to the top grants, moving down the list until our resources were exhausted. If our assets were to shrink in value, they’d be exhausted further up the list, meaning that we’d fund fewer grants. Equivalently, since our cost-effectiveness “bar” is defined by the marginal grant, fewer resources means setting a higher bar for cost-effectiveness.

While we don’t have the information to actually do this exercise at the grant level across years, we can think about the implications for different portfolio areas. Some portfolios or categories of interventions will have many grants with cost-effectiveness near the bar; those grants could be ruled in or out by small fluctuations in the bar, so our giving in those categories will be especially sensitive to changes in asset values. Others will have most of their grants far above or far below the bar, which means our giving in those categories will not be very sensitive to changes in asset values.

GiveWell’s recommendations are an example of the first kind of grantmaking category, and so in theory our giving to GiveWell top charities should be especially sensitive to our asset values. That’s because, compared to most of our grantees, GiveWell’s work is highly elastic: its cost-effectiveness is not very sensitive to the annual scale of funding.

For example:

• If a charity focused on distributing bednets receives an extra $10 million, they can probably buy a lot more distribution, because this kind of work is highly scalable. Bednets are relatively easy to purchase and distribute; one might be able to spend a large amount of marginal funding at a similar level of cost-effectiveness by (for example) expanding to a new location with only slightly lower malaria prevalence. • In other words, if the first$10 million in funding to this organization has a cost-effectiveness of 1100x, the next $10 million might be 1000x, because it buys a similar amount of distribution in a new location. • By contrast, if a charity focused on researching new malaria treatments receives an extra$10 million, they may have a harder time “buying” more research, because this work isn’t as scalable. Even if the funding would pay for a dozen new researchers, there may simply not be enough relevant specialized candidates on the job market, which makes it hard to spend the money as effectively.
• In other words, even if the first $10 million in funding to this organization has a cost-effectiveness of 2000x, the next$10 million might be 500x, because it buys much less research.

This difference in elasticity means that a moderate change in our bar could rule out a lot of GiveWell grants while ruling out fewer grants in our other Global Health and Wellbeing (GHW) cause areas, which are more focused on research and advocacy.

## How we chose this year’s allocation to GiveWell

Due to the recent stock market decline, our available assets have declined by about 40% since the end of last year,[4]This just reflects a decline in the market; our main donors are still planning to give away virtually all of their wealth within their lifetimes. which changes the optimal allocation across causes. All else being equal, our planned 2022 allocation to GiveWell should respond more than proportionally, while our allocation to less elastic portfolio areas should respond less than proportionately.

However, asset values are not the only thing that’s changed since we last projected our GiveWell support. As we noted above, GiveWell has found much more cost-effective opportunities over the last nine months than we or they were expecting.

Incorporating that update, and adjusting for various constraints on our current opportunity set,[5]The “abstract, assumptions-driven” model we described above assumes that our grantmaking opportunity set is fully isoelastic (i.e. cost-effectiveness and scale trade off at a fixed proportional rate, no matter how much or how quickly we scale up) and doesn’t distinguish between different … Continue reading our model suggests that the optimal cost-effectiveness bar for our Global Health and Wellbeing spending is roughly 1100-1200x in our units.[6]We refer to our bar using a multiplier — for example, using a “1000X” bar would mean we wanted each dollar of funding to have 1000 times as much impact as giving $1 to someone earning$50,000 per year. Our current bar is higher than the one we discussed last year because it reflects the … Continue reading That’s consistent with us giving roughly $250-$450 million to GiveWell top charities this year, depending on various assumptions about the opportunities GiveWell finds and their fundraising from other sources.[7]Even without those sources of uncertainty, we’d have a fairly wide range of roughly-optimal amounts we could give this year. Our analysis suggests that giving $100 million more or less than the truly “optimal” amount to GiveWell top charities only reduces our overall impact by roughly as much … Continue reading We don’t want our giving to create incentive problems by funging other donors,[8]For more on this point, see the “coordination issues” section of this post. so we’re committing now to a number in the middle of that range. GiveWell’s increasingly cost-effective opportunity set means that, even though our available assets have fallen since the end of last year, and our allocation to GiveWell should respond more than proportionally, our planned 2022 support to GiveWell top charities has fallen (from our original tentative plan) by less than asset values, and still grown in absolute terms. Grants we’ve made to GiveWell’s recommendations this year include: • Up to$64.7 million to Dispensers for Safe Water to install, maintain, and promote use of chlorine dispensers at rural or remote water collection sites. They project that this funding “will see us providing over 10% of Uganda’s population, and over 15% of Malawi’s, with access to safe water”.
• $14 million to Evidence Action to scope and scale potentially cost-effective interventions that don’t have clear existing implementers. In expectation, GiveWell believes this grant will open up roughly$40 million per year in new cost-effective funding opportunities by 2025.
• $8.2 million to Fortify Health to expand its partnerships with flour mills in India. The organization provides equipment and materials which allow its partner mills to fortify wheat flour with iron, folic acid, and vitamin B12 in an effort to reduce health issues like anemia and cognitive impairment. •$5 million to PATH to support ministries of health in Ghana, Kenya, and Malawi in the implementation of the RTS,S malaria vaccine. GiveWell believes that financing this opportunity will speed up implementation by roughly a year in the areas covered by the grant, and could be similarly cost-effective to Malaria Consortium’s seasonal malaria chemoprevention program (a GiveWell top charity).

When our GiveWell spending growth is combined with growth in other Global Health and Wellbeing causes, including new programs in South Asian Air Quality and Global Aid Policy, we plan to spend more in 2022 than in any previous year, both in absolute terms and as a share of assets. We expect that growth to continue beyond this year.

## Alexander’s final thoughts

When we published our initial plans for 2022, we were excited by GiveWell’s progress and eager to fund their future recommendations. Over the last nine months, they have exceeded our high expectations, which is why we are continuing to grow our support.

For other donors: I (Alexander) think it’s worth noting that GiveWell’s recommendations look very cost-effective this year. Compared to last year, we’d guess that their marginal recommendation this year will be ~20% more impactful; they also expect to have a sizable funding gap this year.[9]I feel some responsibility for the gap. I think that originally discussing our tentatively planned $500 million allocation for 2022, along with GiveWell’s related disclosure of their expectation that they’d be rolling funding forward on the margin in 2021, led some donors to hold off on … Continue reading A few years ago, my wife and I contributed to a donor-advised fund to save for an exceptional future donation opportunity. This year, in addition to our standard annual giving, we plan to recommend half of the balance to GiveWell’s recommendations. I think that other donors interested in cost-effective and evidence-backed giving opportunities should take a close look at GiveWell this year. Footnotes ↑1 The 50-year horizon ensures that this analysis takes into account our funders’ desire to spend down their assets within their lifetimes. Of course, we’d have to discount future costs by the expected rate of return on assets, since we only need to set aside 1 dollar now in order to fund 1+r dollars of grantmaking next year. For example, if we expect a compounding 7% rate of return, a grant we’d make for$1 million in 2022 would need to have roughly the same estimated impact as a grant we’d make for $2 million in 2032 for the two grants to be similarly ranked — because we could get roughly$2 million in 2032 by investing the $1 million now. This is the output of a Monte Carlo simulation approach, which we hope to write more about once we’re more confident in the model and findings. We simulate the interaction of many factors that could make our future spending more or less cost-effective. For example, some of our most cost-effective philanthropic opportunities will shrink over time as child mortality and global poverty decline, and the entry of new funders might mean there is less need for our spending in the future. On the other hand, waiting longer means we have more resources due to investment returns, and additional research might reveal new opportunities. Our current best estimate is that, for interventions like GiveWell’s recommended charities, our optimal strategy is to spend 8-10% of relevant assets each year.. This means that whatever level of assets we have, our cost-effectiveness “bar” for GiveWell-like interventions should be set so that the opportunities above this bar in the next year add up to 8-10% of such assets. This just reflects a decline in the market; our main donors are still planning to give away virtually all of their wealth within their lifetimes. The “abstract, assumptions-driven” model we described above assumes that our grantmaking opportunity set is fully isoelastic (i.e. cost-effectiveness and scale trade off at a fixed proportional rate, no matter how much or how quickly we scale up) and doesn’t distinguish between different categories of GHW spending. More realistic constraints would, among other effects, limit how much we can spend in the next few years on non-GiveWell opportunities. For example, if we set the goal “quadruple our planned spending in 2023”, we’d probably have to set a much lower bar for that year. By contrast, if our goal was “quadruple our planned spending over the next decade”, we probably wouldn’t have to lower the bar as much (since we’d have more time to build a strategy around the new goal). We refer to our bar using a multiplier — for example, using a “1000X” bar would mean we wanted each dollar of funding to have 1000 times as much impact as giving$1 to someone earning $50,000 per year. Our current bar is higher than the one we discussed last year because it reflects the update from assets declining and GiveWell finding more cost-effective opportunities than we expected, both of which raise the optimal bar. Even without those sources of uncertainty, we’d have a fairly wide range of roughly-optimal amounts we could give this year. Our analysis suggests that giving$100 million more or less than the truly “optimal” amount to GiveWell top charities only reduces our overall impact by roughly as much as if we’d spent $5 million of our GHW assets to have zero impact. The impact of small deviations from the optimal path is small because, if we were perfectly optimally allocated across categories (and years) of spending, the marginal cost-effectiveness would be equalized for each category — that is, an extra dollar of giving would accomplish the same amount of good, no matter which year or category we allocated it to. Therefore, if we get our allocations a bit wrong, but are still near optimal, those deviations don’t reduce our impact too much. (Very roughly, they reduce our impact by the amount misallocated, times half the difference in marginal cost-effectiveness between the misallocated category and our overall “bar”. That’s because impact is the area under the cost-effectiveness curve, which is roughly approximated by a triangle at these scales.) For more on this point, see the “coordination issues” section of this post. I feel some responsibility for the gap. I think that originally discussing our tentatively planned$500 million allocation for 2022, along with GiveWell’s related disclosure of their expectation that they’d be rolling funding forward on the margin in 2021, led some donors to hold off on donations they might otherwise have made. While we still think it was correct to share our projections with GiveWell in order for them to plan correctly, we should have done more, privately and publicly, to emphasize that our plans were tentative, and that GiveWell could readily end up with more exciting grant opportunities than funding to fill them.

## How accurate are our predictions?

When investigating a grant, Open Philanthropy staff often make probabilistic predictions about grant-related outcomes they care about, e.g. “I’m 70% confident the grantee will achieve milestone #1 within 1 year.” This allows us to learn from the success and failure of our past predictions and get better over time at predicting what will happen if we make one grant vs. another, pursue one strategy vs. another, etc. We hope that this practice will help us make better decisions and thereby enable us to help others as much as possible with our limited time and funding.[1]Here is a fuller list of reasons we make explicit quantified forecasts and later check them for accuracy, as described in an internal document by Luke Muehlhauser: There is some evidence that making and checking quantified forecasts can help you improve the accuracy of your predictions over time, … Continue reading

Thanks to the work of many people, we now have some data on our forecasting accuracy as an organization. In this blog post, I will:

1. Explain how our internal forecasting works. [more]
2. Present some key statistics about the volume and accuracy of our predictions. [more]
3. Discuss several caveats and sources of bias in our forecasting data: predictions are typically scored by the same person that made them, our set of scored forecasts is not a random or necessarily representative sample of all our forecasts, and all hypotheses discussed here are exploratory. [more]

## 1. How we make and check our forecasts

Grant investigators at Open Philanthropy recommend grants via an internal write-up. This write-up typically includes the case for the grant, reservations and uncertainties about it, and logistical details, among other things. One of the (optional) sections in that write-up is reserved for making predictions.

The prompt looks like this (we’ve included sample answers):

Do you have any new predictions you’re willing to make for this grant? […] A quick tip is to scan your write-up for expectations or worries you could make predictions about. […]

 Predictions Scoring (you can leave this blank until you’re able to score) With X% confidence… …I predict that (yes/no or confidence interval prediction)… …by time Y (ideally a date, not e.g. “in one year”) Score (please stick to True / False / Not Assessed) Comments or caveats about your score 30% The grantee will produce outcome Z End of 2021

After a grant recommendation is submitted and approved, the predictions in that table are logged into our Salesforce database for future scoring (as true or false). If the grant is renewed, scoring typically happens during the renewal investigation phase, since that’s when the grant investigator will be collecting information about how the original grant went. If the grant is not renewed, grant investigators are asked to score their predictions after they come due.[2]In some rare cases, it’s possible for the people managing the database to score predictions using information available to them. However, predictions tend to be very in-the-weeds, so scoring them typically requires input from the grant investigators who made them. Scores are then logged into our database, and that information is used to produce calibration dashboards for individual grant investigators and teams of investigators working in the same focus area.

A user’s calibration dashboard (in Salesforce) looks like this:

The calibration curve tells the user where they are well-calibrated vs. overconfident vs. underconfident. If a forecaster is well-calibrated for a given forecast “bucket” (e.g. forecasts they made with 65%-75% confidence), then the percent of forecasts that resolved as “true” should match that bucket’s confidence level (e.g. they should have come true 65%-75% of the time). On the chart, their observed calibration (the red dot) should be close to perfect calibration (the gray dot) for that bucket.[3]The horizontal coordinate of the gray dots is calculated by averaging the confidence of all the predictions in each bin. Note that this is in general different from the midpoint of the bin; for example, if there are only two predictions in the 45%-55% bin and they have 46% and 48% confidence, … Continue reading If it’s not, then the forecaster may be overconfident or underconfident for that bucket — for example, if things they predict with 65%-75% confidence happen only 40% of the time (overconfidence). (A bucket can also be empty if the user hasn’t made any forecasts within that confidence range.)

Each bucket also shows a 90% credible interval (the blue line) that indicates how strong the evidence is that the forecaster’s calibration in that bucket matches their observed calibration, based on how many predictions they’ve made in that bucket. As a rule of thumb, if the credible interval overlaps with the line of perfect calibration, that means there’s no strong evidence that they are miscalibrated in that bucket. As a user makes more predictions, the blue lines shrink, giving that user a clearer picture of their calibration.

In the future, we hope to add more features to these dashboards, such as more powerful filters and additional metrics of accuracy (e.g. Brier scores).

## 2. Results

#### 2.1 Key takeaways

1. We’ve made 2850 predictions so far. 743 of these have come due and been scored as true or false. [more]
2. Overall, we are reasonably well-calibrated, except for being overconfident about the predictions we make with 90%+ confidence. [more]
3. The organization-wide Brier score (measuring both calibration and resolution) is .217, which is somewhat better than chance (.250). This requires careful interpretation, but in short we think that our reasonably good Brier score is mostly driven by good calibration, while resolution has more room for improvement (but this may not be worth the effort). [more]
4. About half (49%) of our predictions have a time horizon of ≤2 years, and only 13% of predictions have a time horizon of ≥4 years. There’s no clear relationship between accuracy and time horizon, suggesting that shorter-range forecasts aren’t inherently easier, at least among the short- and long-term forecasts we’re choosing to make. [more]

#### 2.2 How many predictions have we made?

As of March 16, 2022, we’ve made 2850 predictions. Of the 1345 that are ready to be scored, we’ve thus far assessed 743 of them as true or false. (Many “overdue” predictions will be scored when the relevant grant comes up for renewal.) Further details are in a footnote.[4]Our stats as of 2022-03-16 are as follows (italics means the percentage is taken over scored predictions, not total): Status Number % Scored True 382 45% False 361 42% Not Assessed 115 13% Total Scored 858 30% Not scored Not Yet Due 1,448 51% Overdue 487 17% Missing End … Continue reading

What kinds of predictions do we make? Here are some examples:

• “[20% chance that] at least one human challenge trial study is conducted on a COVID-19 vaccine candidate [by Jul 1, 2022]”
• [The grantee] will play a lead role… in securing >20 new global cage-free commitments by the end of 2019, improving the welfare of >20M hens if implemented”
• “[70% chance that] by Jan 1, 2018, [the grantee] will have staff working in at least two European countries apart from [the UK]”
• “60% chance [the grantee] will hire analysts and other support staff within 3 months of receiving this grant and 2-3 senior associates and a comms person within 6-9 months of receiving this grant”
• “70% chance that the project identifies ~100 geographically diverse advocates and groups for re-grants”
• “[80% chance that] we will want to renew [this grant]”
• “75% chance that [an expert we trust] will think [the grantee’s] work is ‘very good’ after 2 years”

Some focus areas[5]We’re leaving out focus areas with less than $10M moved in the subsequent analyses. The excluded focus areas are South Asian Air Quality, History of Philanthropy, and Global Health and Wellbeing. are responsible for most predictions, but this is mainly driven by the number of grant write-ups produced for each focus area. The number of predictions per grant write-up ranges from 3 to 8 and is similar across focus areas. Larger grants tend to have more predictions attached to them. We averaged about 1 prediction per$1 million moved, with significant differences across grants and focus areas.

#### 2.3 Calibration

Good predictors should be calibrated. If a predictor is well-calibrated, that means that things they expect to happen with 20% confidence do in fact happen roughly 20% of the time, things they expect with 80% confidence happen roughly 80% of the time, and so on.[6]This sentence and some other explanatory language in this report are borrowed from an internal guide about forecasting written by Luke Muehlhauser. Our organization-wide calibration curve looks like this:

To produce this plot, prediction confidences were binned in 10% increments. For example, the leftmost dot summarizes all predictions made with 0%-10% confidence. It appears at the 6% confidence mark because that’s the average confidence of predictions in the 0%-10% range, and it shows that 12% of those predictions came true. The dashed gray line represents perfect calibration.

The vertical black lines are 90% credible intervals around the point estimates for each bin. If the bar is wider, that generally means we’re less sure about our calibration for that confidence range because we have fewer data points in that confidence range.[7]These intervals assume a uniform prior over (0, 1). This means that, for a bin with T true predictions and F false predictions, the intervals are calculated using a Beta(T+1, F+1) distribution. All the bins have at least 40 resolved predictions except the last one, which only has 8 – hence the wider interval. A table with the number of true / false predictions in each bin can be found in a footnote.[8]Detailed calibration data for each bin are provided below. Note that intervals are open to the left and closed to the right; a 30% prediction would be included in the 20-30 bin, but a 20% prediction would be included in the 10-20 bin. Confidence … Continue reading

The plot shows that Open Philanthropy is reasonably well-calibrated as a whole, except for predictions we made with 90%+ confidence (those events only happened slightly more than half the time) and possibly also in the 70%-80% range (those events happened slightly less than 70% of the time). In light of this, the “typical” Open Phil predictor should be less bold and push predictions that feel “almost certain” towards a lower number.[9]However, given that there is high variance in calibration across predictors, this may not be the best idea in all cases. For personal advice, predictors may wish to refer to their own calibration curve, or their team’s curve.

#### 2.4 Brier scores and resolution

On top of being well calibrated, good predictors should give high probability to events that end up happening and low probability to events that don’t. This isn’t captured by calibration. For example, imagine a simplified world in which individual stocks go up and down in price but the overall value of the stock market stays the same, and there aren’t any trading fees. In this world, one way to be well-calibrated is to make predictions about whether randomly chosen stocks will go up or down over the next month, and for each prediction just say “I’m 50% confident it’ll go up.” Since a randomly chosen stock will indeed go up over the next month about 50% of the time (and down the other 50% of the time), you’ll achieve perfect calibration! This good calibration will spare you from the pain of losing money, but it won’t help you make any money either. However, you will make lots of money if you can predict with 60% (calibrated) confidence which stocks will go up vs. down, and you’ll make even more money if you can predict with 80% calibrated confidence which stocks will go up vs. down. If you could do that, then your stock predictions would be not just well-calibrated but also have good “resolution.”

A metric that captures both aspects of what makes a good predictor is the Brier score (also explained in a footnote[10]For binary events, the Brier score can be defined as $$BS\,=\,\frac{1}{n} \sum_{i=1}^n (P_i\,-\,Y_i)^2$$ Where $$i = 1,…,N$$ ranges over events, $$p_i$$ is the forecasted probability that the i-th event resolves True, and $$Y_i$$ is the actual outcome of the i-th event (1 if True, 0 … Continue reading). The most illustrative examples are:

1. A perfect predictor (100% confidence on things that happen, 0% confidence on things that don’t) would get a Brier score of 0.
2. A perfect anti-predictor (0% confidence on things that happen, 100% confidence on things that don’t) would get a score of 1.
3. A predictor that always predicts 50% would get a score of 0.25 (assuming the events they predict happen half the time). Thus, a score higher than 0.25 means someone’s accuracy is no better than if they simply guessed 50% for everything.

The mean Brier score across all our predictions is 0.217, and the median is 0.160. (Remember, lower is better.) 75% of focus area Brier scores are under 0.25 (i.e. they’re better than chance).[11]A score of 0.25 is a reasonable baseline in our case because the base rate for past predictions happens to be very close to 50%. This means that predictors in the future could state 50% confidence on all predictions and, assuming the base rate stays the same (i.e. the population of questions that … Continue reading

This rather modest[12]For comparison, first-year participants in the Good Judgment Project (GJP) that were not given any training got a score of 0.21 (appears as 0.42 in table 4 here; Tetlock et al. scale their Brier score such that, for binary questions, we’d need to multiply our scores by 2 to get numbers with … Continue reading Brier score together with overall good calibration implies our forecasts have low resolution.[13]For a base rate of 50%, resolution ranges from 0 (worst) to 0.25 (best). OP’s resolution is 0.037. Luke’s intuition on why there’s a significant difference in performance between these two dimensions of accuracy is that good calibration can probably be achieved through sheer reflection and training, just by being aware of the limits of one’s own knowledge, whereas resolution requires gathering and evaluating information about the topic at hand and carefully using it to produce a quantified forecast, something our grant investigators aren’t typically doing in much detail (most of our forecasts are produced in seconds or minutes). If this explanation is right, getting better Brier scores would require spending significantly more time on each forecast. We’re uncertain whether this would be worth the effort, since calibration alone can be fairly useful for decision-making and is probably much less costly to achieve, and our grant investigators have many other responsibilities besides making predictions.

#### 2.5 Longer time horizons don’t hurt accuracy

Almost half of all our predictions are made less than 2 years before they will resolve (e.g. the prediction might be “X will happen within two years”),[14]A caveat about this data: I’m taking the difference between ‘End Date’ (i.e. when a prediction is ready to be assessed) and ‘Investigation Close Date’ (the date the investigator submitted their request for conditional approval). This underestimates the time span … Continue reading with ~75% being less than 3 years out. Very few predictions are about events decades into the future.

It’s reasonable to assume that (all else equal) the longer the time horizon, the harder it is to make accurate predictions.[15]This is in line with evidence from GJP and (less so) Metaculus showing that accuracy drops as time until question resolution increases. However, note that the opposite holds for PredictionBook, i.e. Brier scores tend to get better the longer the time horizon. Our working hypothesis to explain this … Continue reading However, our longer-horizon forecasts are about as accurate as our shorter-horizon forecasts.

A possible explanation is question selection. Grant investigators may be less willing to produce long-range forecasts about things that are particularly hard to predict because the inherent uncertainty looks insurmountable. This may not be the case for short-range forecasts, since for these most of the information is already available.[16]This selection effect could come about through several mechanisms. One such mechanism could be picking well-defined processes more often in long-range forecasts than in short-range ones. In those cases, what matters is not the calendar time elapsed between start and end but the number and … Continue reading In other words, we might be choosing which specific things to forecast based on how difficult we think they are to forecast regardless of their time horizon, which could explain why our accuracy doesn’t vary much by time horizon.

## 3. Caveats and sources of bias

There are several reasons why our data and analyses could be biased. While we don’t think these issues undermine our forecasting efforts entirely, we believe it’s important for us to explain them in order to clarify how strong the evidence is for any of our claims. The main issues we could identify are:

1. Predictions are typically written and then later scored by the same person, because the grant investigator who made each prediction is typically also our primary point of contact with the relevant grantee, from whom we typically learn which predictions came true vs. false. This may introduce several biases. For example, predictors may choose events that are inherently easier to predict. Or, they may score ambiguous predictions in a way that benefits their accuracy score. Both things could happen subconsciously.
2. There may be selection effects on which predictions have been scored. For example, many predictions have overdue scores, i.e. they are ready to be evaluated but have not been scored yet. The main reason for this is that some predictions are associated with active grants, i.e. grants that may be renewed in the future. When this happens, our current process is to leave them unscored until the grant investigator writes up the renewal, during which they are prompted to score past predictions. It shouldn’t be assumed that these unscored predictions are a random sample of all predictions, so excluding them from our analyses may introduce some hard-to-understand biases.
3. The analyses presented here are completely exploratory. All hypotheses were put forward after looking at the data, so this whole exercise should be better thought of as “narrative speculations” rather than “scientific hypothesis testing.”

Footnotes

Footnotes
1 Here is a fuller list of reasons we make explicit quantified forecasts and later check them for accuracy, as described in an internal document by Luke Muehlhauser:

1. There is some evidence that making and checking quantified forecasts can help you improve the accuracy of your predictions over time, which in theory should improve the quality of our grantmaking decisions (on average, in the long run).
2. Quantified predictions can enable clearer communication between grant investigators and decision-makers. For example, if you just say it “seems likely” the grantee will hit their key milestone, it’s unclear whether you mean a 55% chance or a 90% chance.
3. Explicit quantified predictions can help you assess grantee performance relative to initial expectations, since it’s easy to forget exactly what you expected them to accomplish, and with what confidence, unless you wrote down your expectations when you originally made the grant.
4. The impact of our work is often difficult to measure, so it can be difficult for us to identify meaningful feedback loops that can help us learn how to be more effective and hold ourselves accountable to our mission to help others as much as possible. In the absence of clear information about the impact of our work (which is often difficult to obtain in a philanthropic setting), we can sometimes at least learn how accurate our predictions were and hold ourselves accountable to that. For example, we might never know whether our grant caused a grantee to succeed at X and Y, but we can at least check whether the things we predicted would happen did in fact happen, with roughly the frequencies we predicted.
2 In some rare cases, it’s possible for the people managing the database to score predictions using information available to them. However, predictions tend to be very in-the-weeds, so scoring them typically requires input from the grant investigators who made them.
3 The horizontal coordinate of the gray dots is calculated by averaging the confidence of all the predictions in each bin. Note that this is in general different from the midpoint of the bin; for example, if there are only two predictions in the 45%-55% bin and they have 46% and 48% confidence, respectively, then the point of perfect calibration in that bin would be 47%, not 50%.
4 Our stats as of 2022-03-16 are as follows (italics means the percentage is taken over scored predictions, not total):

 Status Number % Scored True 382 45% False 361 42% Not Assessed 115 13% Total Scored 858 30% Not scored Not Yet Due 1,448 51% Overdue 487 17% Missing End Date 57 2% Total Not Scored 1992 70% Total 2850 100%

Some categories in the table above deserve further comments:

• Not Assessed: There are several reasons why some predictions are not assessed:
• Some predictions had vague / subjective resolution criteria (so that it was unclear whether the event happened or not).
• We didn’t check some predictions because it would have taken too much time or effort to do so.
• Some predictions were premised on a condition that wasn’t fulfilled (e.g. “if X happens, the grantee will achieve Y”, if X never happens).
• Some predictions were about grants that didn’t happen.

We don’t yet have systematic data to determine which of these reasons are more prevalent, but we may be able to say more about this in the future.

• Overdue: Some predictions have overdue scores because they are associated with active grants that may be renewed in the future. In these cases, we don’t request scores from grant investigators until they write up the renewal grant. There may also be some scores we haven’t logged yet due to lack of capacity.
• Missing End Date: Predictions with no end date can’t be scored as False (because the event may still happen in the future). We’re currently working with grant investigators to log reasonable end dates for these.
5 We’re leaving out focus areas with less than $10M moved in the subsequent analyses. The excluded focus areas are South Asian Air Quality, History of Philanthropy, and Global Health and Wellbeing. 6 This sentence and some other explanatory language in this report are borrowed from an internal guide about forecasting written by Luke Muehlhauser. 7 These intervals assume a uniform prior over (0, 1). This means that, for a bin with T true predictions and F false predictions, the intervals are calculated using a Beta(T+1, F+1) distribution. 8 Detailed calibration data for each bin are provided below. Note that intervals are open to the left and closed to the right; a 30% prediction would be included in the 20-30 bin, but a 20% prediction would be included in the 10-20 bin.  Confidence [%] True False Total 0-10 5 39 44 10-20 10 32 42 20-30 20 53 73 30-40 24 36 60 40-50 69 82 151 50-60 64 36 100 60-70 86 44 130 70-80 65 29 94 80-90 34 7 41 90-100 5 3 8 9 However, given that there is high variance in calibration across predictors, this may not be the best idea in all cases. For personal advice, predictors may wish to refer to their own calibration curve, or their team’s curve. 10 For binary events, the Brier score can be defined as $$BS\,=\,\frac{1}{n} \sum_{i=1}^n (P_i\,-\,Y_i)^2$$ Where $$i = 1,…,N$$ ranges over events, $$p_i$$ is the forecasted probability that the i-th event resolves True, and $$Y_i$$ is the actual outcome of the i-th event (1 if True, 0 if False). A predictor that knows the base rate, b, of future events and predicts that on every event gets a Brier score of * (1 – b). For example, if = 50% (as is roughly the case for us), the expected Brier is 0.25. A prefect predictor (100% confidence on things that happen, 0% confidence on things that don’t) would get a Brief score of 0. A predictor that is perfectly anticorrelated with reality (predicts the exact opposite as a perfect predictor) would get a score of 1. The Brier score can be decomposed into a sum of 3 components as $$BS\,=\,E(p\, -\,P[Y|p])^2\,-\,E(P[Y|p]\,-\,b)^2\,+\,b\,*\,(1\,-\,b)$$ Where $$i = 1$$ denotes expectation, $$p_i$$ is the forecasted probability of the event $$Y$$, $$P[Y|p]$$ is the actual probability of $$Y$$ given that the forecasted probability was $$p$$, and $$b$$ is the base rate of $$Y$$. The components can be interpreted as follows: 1. The first one measures miscalibration. It is the mean squared error between forecasted and actual probabilities. It ranges from 0 (perfect calibration) to 1 (worst). 2. The second one measures resolution. It is the expected improvement of one’s forecasts over the blind strategy that always outputs the base rate. It ranges from 0 (worst) to b(1-b) (best). 3. The third one measures the inherent uncertainty of the events being forecasted. It is just the entropy of a binary event that happens with probability b. In practice, because it is unlikely that any two events have the same forecasted probability, $$P[Y | p]$$ is calculated by binning forecasts and averaging within each bin, i.e. the empirical estimate is $$P[Y | p]$$ = (# of true predictions in that bin) / (total # of predictions in that bin). This is exactly what we do in our dashboards. 11 A score of 0.25 is a reasonable baseline in our case because the base rate for past predictions happens to be very close to 50%. This means that predictors in the future could state 50% confidence on all predictions and, assuming the base rate stays the same (i.e. the population of questions that predictors sample from is stable over time), get close to perfect calibration without achieving any resolution. 12 For comparison, first-year participants in the Good Judgment Project (GJP) that were not given any training got a score of 0.21 (appears as 0.42 in table 4 here; Tetlock et al. scale their Brier score such that, for binary questions, we’d need to multiply our scores by 2 to get numbers with the same meaning). The Metaculus community averages 0.150 on binary questions as of this writing (May 2022). Both comparisons have very obvious caveats: the population of questions on GJP or Metaculus is very different from ours and both platforms calculate average Brier scores over time, taking into account updates to the initial forecast, while our grant investigators only submit one forecast and never try to refine it later. 13 For a base rate of 50%, resolution ranges from 0 (worst) to 0.25 (best). OP’s resolution is 0.037. 14 A caveat about this data: I’m taking the difference between ‘End Date’ (i.e. when a prediction is ready to be assessed) and ‘Investigation Close Date’ (the date the investigator submitted their request for conditional approval). This underestimates the time span between forecast and resolution because predictions are made before the investigation closes. This explains the fact that some time deltas are slightly negative. The most likely explanation for this is that the grant investigator wrote the prediction long before submitting the write-up for conditional approval. 15 This is in line with evidence from GJP and (less so) Metaculus showing that accuracy drops as time until question resolution increases. However, note that the opposite holds for PredictionBook, i.e. Brier scores tend to get better the longer the time horizon. Our working hypothesis to explain this paradoxical result is that, when users get to select the questions they forecast on (as they do on PredictionBook), they will only pick “easy” long-range questions. When the questions are chosen by external parties (as in GJP), they tend to be more similar in difficulty across time horizons. Metaculus sits somewhere in the middle, with community members posting most questions and opening them to the public. We may be able to test this hypothesis in the future by looking at data from Hypermind, which should fall closer to GJP than to the others because questions on the platform are commissioned by external parties. 16 This selection effect could come about through several mechanisms. One such mechanism could be picking well-defined processes more often in long-range forecasts than in short-range ones. In those cases, what matters is not the calendar time elapsed between start and end but the number and complexity of steps in the process. For example, a research grant may contain predictions about the likely output of that research (some finding or publication) that can’t be scored until the research has been conducted. If the research was delayed for some reason, or if it happens earlier than expected due to e.g. a sudden influx of funding, that doesn’t change the intrinsic difficulty of predicting anything about the research outcomes themselves. ## Announcing the launch of our new website We’ve just launched a new version of our website. We think the new design will make our content easier to navigate, so that readers have an easier time learning about our work and our thinking. As part of the launch, we’ve updated language on a number of core pages to better reflect how our work has evolved in the years since our previous website was created. This includes updates to our mission statement, which had been in place since our incubation as a project of GiveWell. The new statement is more concise, and we think it better reflects the breadth of our work: “Our mission is to help others as much as we can with the resources available to us.” Other updates include: • The ability to sort and filter much of our published content, including blog posts, research reports, and notable lessons. • Statistics on our giving in each of our focus areas. • A new page explaining the difference between our two grantmaking portfolios (Global Health & Wellbeing and Longtermism). • New pages for our newest focus areas, South Asian Air Quality and Global Aid Policy. If you experience any issues using the new site, or want to suggest a change, we would appreciate your feedback! Contact [email protected] to get in touch. ## Open Philanthropy’s Cause Exploration Prizes Update: We’ve now announced the recipients of these prizes. At Open Philanthropy, we aim to give as effectively as we can. To find the best opportunities, we’ve looked at many different causes, some of which have become our current focus areas. Even after a decade of research, we think there are many excellent grantmaking ideas we haven’t yet uncovered. So we’ve launched the Cause Exploration Prizes around a set of questions that will help us explore new areas. We’re most interested in responses to our open prompt: “What new cause area should Open Philanthropy consider funding?” We also have prompts in the following areas: We’re looking for responses of up to 5,000 words that clearly convey your findings. It’s fine to use bullet points and informal language. For more detail, see our guidance for authors. To submit, go to this page. We hope that the Prizes help us to: • Identify new cause areas and funding strategies. • Develop our thinking on how best to measure impact. • Find people who might be a good fit to work with us in the future. You can read more about the Cause Exploration Prizes on our dedicated website. You’ll also be able to read all of the submissions on the Effective Altruism Forum later this summer – stay tuned! ## Prizes, rules, and deadlines All work must be submitted by 11:00 pm PDT on August 4, 2022. You are almost certainly eligible! We think these questions can be approached from many directions, and you don’t need to be an expert or have a PhD to apply. There’s a$25,000 prize for the top submission, and three $15,000 prizes. Anyone who wins one of these prizes will be invited to present their work to Open Philanthropy’s cause prioritization team in San Francisco (and be compensated for their time and travel). And we will follow up with authors if their work contributes to Open Phil grantmaking decisions! We will also award twenty honorable mentions ($500), and a participation award ($200) for the first 200 submissions made in good faith and not awarded another prize. All submissions will be shared on the Effective Altruism Forum to allow others to learn from them. If participants prefer, their submission can be published anonymously, and we can handle the logistics of posting to the Forum. See more detail here. For full eligibility requirements and prize details, see our rules and FAQs. If you have any questions not answered by that page, contact us at [email protected]. ## Our Progress in 2021 and Plans for 2022 This post compares our progress with the goals we set forth a year ago, and lays out our plans for the coming year. In brief: • We recommended over$400 million worth of grants in 2021. The bulk of this came from recommendations to support GiveWell’s top charities and from our major current focus areas. [More]
• We launched several new program areas — South Asian air quality, global aid policy, and effective altruism community building with a focus on global health and wellbeing — and have either hired or are currently hiring program officers to lead each of those areas. [More]
• We revisited the case for our US policy causes — spinning out our criminal justice reform program as an independent organization, making exit grants in US macroeconomic stabilization policy, and updating our approaches to land use reform and immigration policy. [More]
• We shared our latest framework for evaluating global health and wellbeing interventions, as well as several reports on key topics in potential risk from advanced AI. [More]

## Continued grantmaking

Last year, we wrote:

We expect to continue grantmaking in potential risks from advanced AIbiosecurity and pandemic preparednesscriminal justice reformfarm animal welfarescientific research, and effective altruism, as well as recommending support for GiveWell’s top charities. We expect that the total across these areas will be over $200 million. We wound up recommending over$400 million across those areas. Some highlights:

### Plans for 2022

This year, we expect to continue grantmaking in potential risks from advanced AIbiosecurity and pandemic preparednessSouth Asian air qualityglobal aid policyfarm animal welfarescientific research, and effective altruism community building (focused on both longtermism and global health and wellbeing), as well as recommending support for GiveWell’s top charities.

We aim to roughly double the amount we recommend this year relative to last year, and triple it by 2025.

## New program areas

We launched two new program areas this year: South Asian Air Quality and Global Aid Policy. We’re thrilled to have hired Santosh Harish and Norma Altshuler to lead these programs, and we look forward to sharing some of the grants they’re making in next year’s annual review.

We also announced plans to launch another new program area: supporting the effective altruism community with a focus on global health and wellbeing. We are still in the process of hiring a program officer to lead this area.

### Plans for 2022

This year, our global health and wellbeing cause prioritization team aims to launch three more new program areas where we can find scalable opportunities above our bar, and to continue laying the groundwork for more growth in future years.

## Revisiting our older US policy causes

We made our initial selection of US policy causes in 2014 and 2015. We chose criminal justice reform, macroeconomic stabilization policy, immigration policy, and land use reform in order to try to get experience across a variety of causes that stood out on different criteria (immigration on importance, CJR on tractability, land use reform and macro on neglectedness).

We’ve learned a lot from our experience funding in these fields, but over time have updated towards a more unified ROI framework that lets us more explicitly compare across causes. (Also, the world has changed a lot over the last 7 years.) We gave an initial update on our revised thinking back in 2019, and we are still evaluating our performance and plans for the future as of 2022.

On our new website, we are no longer referring to these causes as full “focus areas” because we do not have full-time staff leading any of them. But our particular plans for the future vary across the four causes:

• We spun out our criminal justice reform program into an independent organization — Just Impact, which we supported with $50 million in seed funding. • After the rapid recovery of the U.S. from the most recent recession, we decided to wind down our giving to U.S. grantees in macroeconomic policy. (We made some exit grants, as we often do when we decide not to renew support to organizations we’ve supported for a long time.) We currently expect to continue to support regranting on this issue within Europe via Dezernat Zukunft, but will revisit depending on how economic and policy conditions evolve. We hope to write more about our thinking on and lessons learned in this area in the future. • On land use reform: we recently completed an updated review on the performance of our grantees and the valuation of a marginal housing unit in key supply-constrained regions, which made us think that our returns to date have been well above our bar and that there is room for expansion. We’ve commissioned an outside review of our analysis; pending the results of that review, we’re considering hiring someone to lead a bigger portfolio in this space. • On immigration policy: • We have never had a clear theory of how to change the political economy to be supportive of substantially larger immigration flows, which is what would be necessary to achieve the global poverty improvements that motivate our interest in this issue. Accordingly, our recent spending has been lower than in macro or land use reform. • Over the last few years, the US political climate for immigration reform has come to look even less promising than when we initially explored this space. • We’re currently planning to continue supporting Michael Clemens’ work (which is what motivated our interest in this cause), make occasional opportunistic grants that fit with our overall ROI framework, and explore whether we should have a program around STEM immigration. But we are not planning to do more on US immigration policy per se. ## New approaches to funding This year, we created a number of new programs to openly solicit funding requests from individuals, groups, and organizations. This represents a different approach from the proactive searching and networking we use to find most of our grants, and we are excited by the potential for these programs to unearth strong opportunities we wouldn’t have found otherwise. The largest such program is our Regranting Challenge, which will allocate$150 million in funding to the grantmaking budgets for one to five outstanding programs at other foundations. That program is closed to new submissions, but we’ve listed many programs that are open to submissions on our “How to Apply for Funding” page.

Groups eligible for at least one open program include:

## Worldview investigations

Last year, we wrote:

Our worldview investigations team is now working on:

• More thoroughly assessing and writing up what sorts of risks transformative AI might pose and what that means for today’s priorities.
• Updating our internal views of certain key values, such as the estimated economic value of a disability-adjusted life year (DALY) and the possible spillover benefits from economic growth, that inform what we have thus far referred to as our “near-termist” cause prioritization work.

New work we’ve published in these areas:

### Other published work

This section lists other work we’ve published this year on our research and grantmaking, including work published by Open Phil staff on the Effective Altruism Forum.

Holden started a blog, Cold Takes. His Most Important Century series incorporates much of Open Phil’s research into an argument that we could be living in the most important century of all time for humanity, and includes a guest post from Ajeya Cotra on why aligning AI could be difficult with modern deep learning methods.

Open Phil staff were also interviewed for a variety of articles and podcasts:

## Our new structure and grantmaking terminology

In June, Alexander was promoted to co-CEO alongside Holden. Alexander currently oversees the grantmaking in our Global Health and Wellbeing portfolio, while Holden oversees the grantmaking in our Longtermism portfolio.

These portfolios represent a new way of describing our work:

• “Global Health and Wellbeing” (GHW) describes our work to increase health and/or wellbeing worldwide
• “Longtermism” describes our work to raise the probability of a very long-lasting and positive future.

See this post for more on the two portfolios.

## Hiring and other capacity building

Since our previous update, another 27 full-time hires have joined our team. We won’t list them all here, but you can see them on our team page.

Anya Hunt published articles on Open Phil’s approach to recruiting and our plans for hiring in 2022 (see below).

### Plans for 2022: Hiring more people than ever

As we scale up our grantmaking, we’ll need to grow our staff to match. Accordingly, we plan to hire more than 30 people this year, and over 100 people in the next four years.

This represents massive growth compared to past years, which is an exciting opportunity and an immense challenge. If you’d like to help us achieve our hiring goals — and thus, all of our other goals — we are hiring recruiters and senior recruiters!

The growth also means that if you want to work at Open Phil, you’ll have more chances to apply this year than ever before. We will continue to post open positions on our Working at Open Phil page, and we encourage you to check it out! If you don’t see something you want to apply for, you can fill out our General Application, and we’ll reach out if we post a position we think might be a good fit.

Finally, we’re always looking for referrals. If you refer someone and we hire them, we’ll pay you \$5,000!