Behavioral treatments for insomnia

Published: August 2016; updated November 2016

It is widely believed, and seems likely, that regular, high-quality sleep is important for personal performance and well-being, as well as for public safety and other important outcomes. Unfortunately, many people are unable to fall asleep as quickly as desired and/or unable to stay asleep as long as desired — a condition known as insomnia. Reliable and scalable treatments for insomnia could bring substantial humanitarian benefit.

The Open Philanthropy Project hopes to survey a wide range of potential cause areas in the social sciences, only some of which will turn out to look promising enough to warrant deeper investigation and potential grantmaking. We chose to conduct a brief, surface-level investigation of the evidence for the effectiveness of standard behavioral treatments for insomnia because we thought it might turn out to be promising enough to warrant deeper investigation, and because it seemed to be a well-contained topic on which we could experiment with variations on our process for generating such reports (more on this below). We might or might not investigate non-behavioral treatments for insomnia later.

I (Luke Muehlhauser) had two goals for this project: to identify the most well-regarded behavioral treatments for insomnia, and then evaluate what the state of the evidence on those treatments’ effectiveness appears to be. I did not attempt to closely examine any studies I found — mostly, I only evaluated “surface features” such as what methods the study authors claim to have used.

My overall conclusion, described in more detail here, is that I don’t think we have strong evidence to suggest that standard behavioral treatments for insomnia are effective at ≥1mo after treatment.

To increase the speed with which we can survey the evidence concerning many different potential cause areas, we decided not to invest as much time on exposition and thoroughness as we have for some other investigations.

If you know of studies or reviews which seem like they should have been mentioned or cited in this report, or which are important but were published after the initial release of this report, please send them to socialscienceupdates+insomnia@openphilanthropy.org along with your comments, if any.

My process

For this report, I will describe my literature search process in less detail than I did for my carbs-obesity report. This time, we experimented with a different investigation process that we hoped would require less overhead on our part but still allow the report to be vetted for accuracy by Open Philanthropy Project staff and by external readers. To that end, the sections describing my tentative conclusions about treatments for insomnia are footnoted with vettable probability statements such as “To be more precise: I am X% confident that my spreadsheet of RCTs on this topic includes at least Y% of RCTs on this topic which have features A, B, and C.” We suspect these probability statements will be useful only for some readers. I explain our motivations for providing these probability statements in more detail in a footnote.1

Here is a brief account of my literature search process. First, I searched for general overview articles on treatments for insomnia,2 and quickly learned that the relevant literature is organized under the heading of “sleep medicine.” I used these general overview articles to familiarize myself with the standard concepts, treatments, and outcome measures used in the field.

Once I learned that standard insomnia treatments have been tested by many randomized controlled trials (RCTs),3 I decided to focus only on the evidence for treatment effectiveness from RCTs. I then searched for systematic reviews (SRs) of RCTs for insomnia treatments, and found that the literature could be divided into three major categories of treatments: psychological/behavioral treatments for insomnia (I’ll call them “BTIs”), pharmacological treatments for insomnia (“PTIs”), and alternative treatments for insomnia (“ATIs”) such as acupuncture. For now, I decided to investigate only BTIs.

To survey SR-included RCTs testing the effectiveness of commonly-tested BTIs, I did the following:

  1. I made a spreadsheet (here) of all the SRs of RCTs (plus other studies, in some cases) testing the effectiveness of any kind of insomnia treatment, published online before October 2015. I found ~70 such SRs, and I think this is a fairly complete list.4
  2. I identified the SRs on this list that focused entirely or mostly on BTIs (rather than PTIs or ATIs).
  3. Two GiveWell staff members5 identified all unique RCTs across these SRs,6 and identified which ones met certain criteria discussed below, for example which outcome measures were used at each study’s last follow-up assessment. I spot-checked their work.7 Our spreadsheet of SR-included RCTs is available here.
  4. I quickly reviewed the RCTs matching certain criteria, and wrote my tentative conclusions below.

An initial draft of this report was internally vetted in April and May of 2016 by Sarah Ward, which led to a few minor error corrections. We considered publishing the details of the vet and the edits it prompted, but in this case doing so would have been prohibitively time-expensive, and we decided not to do so.

How insomnia treatments are studied

First let me set the stage for my later substantive claims about BTIs by explaining some basic concepts related to BTIs, as explained by recent narrative reviews on the topic.8

A patient with insomnia can’t fall asleep as quickly as they’d like, and/or can’t stay asleep as long as they’d like. Insomnia with one or more obvious medical, psychiatric, or environmental causes (e.g. acute pain) is known as comorbid or secondary insomnia; otherwise the condition is known as primary insomnia. Common BTIs for primary or comorbid insomnia include:

  • Sleep restriction: Instruction to avoid the bed as much as possible when not sleeping in it.9
  • Stimulus control: Instructions which aim to strengthen the mental and physiological association between the bed and sleep, and to establish a regular sleeping schedule. E.g.: “Go to bed only when sleepy,” “Get out of bed when unable to sleep,” “No napping,” and “Arise at the same time every morning.”10
  • Sleep hygiene: Education about health practices (e.g. diet, exercise, substance use) and environmental factors (e.g. noise, light, temperature) that may affect sleep success.11
  • Relaxation training: Procedures aimed at reducing arousal, muscle tension, and thoughts that may interfere with sleep, e.g. meditation and progressive muscle relaxation. Most of these procedures require some initial training and practice.12
  • Cognitive therapy: Psychotherapy aimed at treating anxiety about sleep problems and reframing false beliefs about insomnia.13
  • Cognitive behavioral therapy for insomnia (CBT-I): A combination of several different treatments from the above list, perhaps most commonly of sleep restriction, stimulus control, and sleep hygiene.14

Hereafter, I’ll refer to these BTIs as “standard” BTIs.15

CBT-I appears to be the most commonly-discussed BTI in the research literature, and is plausibly the most common BTI in clinical practice. It can be delivered on an individual basis or in a group setting, via self-help (with or without phone support), and via computerized delivery (with or without phone support). It can also be delivered simultaneously with PTIs and ATIs.16

In RCTs testing the effectiveness of standard BTIs, night-time sleep outcomes are typically measured with one or more of the following measures:

  • Polysomnography (PSG): PSG combines objective measures of brain activity, eye movement, muscle activity and perhaps also heart rhythm, respiration, blood oxygen saturation, and other measures. PSG is widely considered the “gold standard” measure of sleep, but it has several disadvantages. It is expensive, complicated to interpret, requires some adaptation by the patient (people aren’t used to sleeping with wires attached to them), and is usually (but not always) administered at a sleep lab rather than at home.17
  • Actigraphy (ACT): An actigraph is a watch-like device that uses an accelerometer to record movement during the night. Supposedly (I haven’t checked), it correlates well with PSG on at least two key variables — total sleep time (TST) and sleep efficiency (SE: percentage of time in bed spent asleep) — in healthy subjects, but agreement rates are lower in patients with insomnia. Actigraphy is less expensive and more convenient than PSG, and can easily be used at home.18
  • Sleep diary (SD): Subjects are asked to fill out a daily diary of sleep outcomes, usually including TST, SE, sleep onset latency (SOL: how long it took to fall asleep), wake-time after initial sleep onset (WASO), and perhaps other variables. Usually, subjects are asked to self-report these outcomes in the morning, for the previous night’s sleep. Supposedly (I haven’t checked), SD is known to be less accurate than PSG or ACT, but it seems to be the most common measure of sleep outcomes in RCTs of BTIs.19
  • Questionnaires: A variety of standardized questionnaires are available to measure sleep outcomes, the most common of which is probably the Pittsburgh sleep quality index (PSQI). The PSQI includes 19 questions about sleep quality over the past month, and results in a total score for overall sleep quality as well as 7 component scores (e.g. sleep duration and sleep quality). I haven’t checked how valid and reliable this measure is.20

In the section below, I focus on measurements of TST and SE, because these two variables seem (to me) to capture the most relevant outcome information without requiring that I check the results for a cumbersomely long list of outcome variables, and because they are two of the most commonly measured outcome variables.

How effective are commonly tested BTIs?

To quickly assess the likely effectiveness of commonly tested BTIs, I looked only at SR-included RCTs testing the effectiveness of standard BTIs for adults (or mostly adults). Approximately 180 unique RCTs were included across all the SRs I found (published online before October 2015).21

Long-term effectiveness, measured objectively

First, I looked at RCTs that (1) measured the long-term (≥6mo) effectiveness of one or more BTIs, and that (2) used at least one objective measure of sleep (PSG or ACT) during the last follow-up measurement.22

These criteria yielded ~20 RCTs. Unfortunately, only 7 of these RCTs had a neutral control, retained it through a follow-up period of at least 6 months (thus allowing meaningful comparisons between active treatment and neutral control at that follow-up), and reported objectively-measured TST or SE at that follow-up.23 The results of these studies, focusing on objectively-measured TST and SE, are:

Study [last follow-up] Treatment conditions at last follow-up Participants (at last follow-up) Objectively measured TST and SE at last follow-up
Lichstein et al. (2001) [12mo] Relaxation therapy vs. sleep compression vs. placebo desensitization 74 subjects from Memphis, 59 or older, chronic primary insomnia, no sleep apnea, no sleep medications, plus some other criteria PSG: TST and SE were worse for relaxation therapy subjects than placebo subjects. Sleep compression subjects averaged ~40 more minutes of TST than placebo subjects, and ~7 percentage points greater sleep efficiency. Statistical significance of these differences not reported.24
Wu et al. (2006) [8mo] CBT-I vs. placebo tablets25 36 subjects26 from an unspecified location (Beijing?), chronic primary insomnia, no sleep apnea, no sleep medication, plus some other criteria PSG: CBT-I subjects averaged ~53 more minutes of TST than controls, and ~10 percentage points greater sleep efficiency. Statistical significance of these differences not reported.27
Berger et al. (2009) [12mo] CBT-I vs. healthy eating instructions28 155 female subjects from the U.S. Midwest, breast cancer-related fatigue, receiving chemotherapy, no pre-cancer insomnia or sleep apnea, plus some other criteria ACT: CBT-I patients averaged 16 more minutes of TST than controls, and 1 percentage point higher “sleep percent after onset.” Statistical significance of these differences not reported.29
Espie et al. (2008) [6mo] CBT-I vs. treatment as usual (TAU) 106 subjects from Scotland, chronic insomnia, diagnosed with cancer, no sleep apnea, plus some other criteria ACT: No effect of CBT-I over TAU for either TST or SE.30
Edinger et al. (2005) [6mo] CBT-I vs. sleep hygiene vs. TAU 20 subjects from an unspecified location (near Durham, NC?), insomnia and fibromyalgia but not other comorbidities, no sleep apnea, plus some other criteria ACT: No group differences for either TST or SE.31
McCurry et al. (2014) [18mo] CBT for pain vs. CBT for pain and insomnia vs. education only 320 subjects, members of a health maintenance organization in Washington state (“Group Health”), aged 60 or older, had received care for osteoarthritis at Group Health in the past 3 years, with chronic pain and insomnia, no sleep apnea, plus some other criteria32 ACT: TST not measured. No differences between groups for SE.33
Lichstein et al. (2013) [12mo] CBT vs. placebo biofeedback vs. withdrawal 61 subjects from (or near) Memphis, diagnosed with hypnotic-dependent insomnia, aged 50 or older, no sleep apnea, plus some other criteria PSG: Probably no group differences for either TST or SE (but statistical significance not reported).34

The results of these studies are inconsistent. Though three of these six studies did not report the statistical significance of the comparisons that most interested me,35 I would guess (based on effect sizes) that the first two studies (as listed above) each found positive and statistically significant effects of at least one BTI on TST and SE (objectively measured, at last follow-up) that are both statistically significant and large enough to be “practically” significant, whereas the last five studies (as listed above) found no statistically or practically significant effects (objectively measured, at last follow-up).

Moreover, these seven trials were only moderately pragmatic in design.36 For example, subject eligibility was usually tightly restricted, resulting in a sample of subjects that is not especially representative of the population we’d like to treat for insomnia. All else equal, I consider pragmatic trials to provide stronger evidence of broad intervention effectiveness than explanatory trials do, for reasons described here.

Medium-term effectiveness, measured objectively

Perhaps it is too much to hope for that we could have good evidence that BTIs are effective ≥6mo after treatment. What if we look at standard BTIs’ effects on objectively measured TST and SE at the last follow-up occurring ≥1mo and ≤3mo after treatment?

These criteria yielded 16 RCTs. Unfortunately, only 3 of these RCTs had a neutral control, retained it through the designated follow-up period, and reported objectively-measured TST or SE at the designated follow-up period.37 The results of these studies, focusing on objectively measured TST and SE, are:

Study [last follow-up ≥1mo and ≤3mo after treatment] Treatment conditions at designated follow-up Participants (at designated follow-up) Objectively measured TST and SE at designated follow-up.
Wu et al. (2006) [3mo] CBT-I vs. placebo tablets38 36 subjects39 from an unspecified location (Beijing?), chronic primary insomnia, no sleep apnea, no sleep medication, plus some other criteria PSG: CBT-I subjects averaged ~71 more minutes of TST than controls, and ~17 percentage points greater sleep efficiency. Statistical significance of these differences not reported.40
Lovato et al. (2014) [3mo] CBT-I vs. wait list control 99 subjects from near Adelaide, South Australia, chronic insomnia, no sleep apnea, plus some other criteria ACT: CBT-I subjects averaged ~30 fewer minutes of TST than controls. No difference for sleep efficiency.41
Berger et al. (2009) [3mo] CBT-I vs. healthy eating instructions 160 female subjects from the U.S. Midwest, breast cancer-related fatigue, receiving chemotherapy, no pre-cancer insomnia or sleep apnea, plus some other criteria ACT: CBT-I patients averaged 10 more minutes of TST than controls, and 1 percentage point higher “sleep percent after onset.” Statistical significance of these differences not reported.42

The results of these studies are inconsistent. The first study listed above reported a CBT-I advantage that is plausibly practically and statistically significant, whereas the other two studies did not. Moreover, as with the RCTs summarized in the previous section, these three trials were only moderately pragmatic in design.43

Immediate effectiveness, measured via self-report

Finally, what if we look at self-reported TST and SE, immediately after treatment? This is the kind of summary statistic typically reported in meta-analyses of RCTs on the topic. Here are the findings from the most recent (2015-2016) SRs of RCTs of standard BTIs I reviewed:

SR Focus of the SR Included RCTs Basic results for self-reported TST and SE, at post-treatment
Johnson et al. (2016) CBT-I for cancer survivors 8 “CBT-I resulted in a 15.5% improvement in SE relative to control conditions.” TST not reported.
Geiger-Brown et al. (2015) CBT-I for comorbid insomnia 23 Standardized mean difference for TST was .25 and for SE was .93.44
Koffel et al. (2015) Group CBT-I 8 Mean effect size for TST was -.04 and for SE was .84.45
Ho et al. (2015) Self-help CBT-I 20 Mean effect size for TST was .24 and for SE was .80.46
Zacharie et al. (2015) Internet-delivered CBT-I 11 Hedges’ g for TST was .29 and for SE was .58.47
Trauer et al. (2015) CBT-I, excluding studies focused on comorbid insomnia 20 “TST improved by 7.61… minutes, and SE improved by 9.91%.”

In short, these SRs tend to report practically relevant average effects on SE but not so much for TST.

However, I don’t summarize more details from these SRs, or summarize details from any SR with an official publication date earlier than 2015, because I don’t weight their meta-analytic findings very heavily in my consideration of the evidence, for two reasons.

First, I don’t trust the accuracy of self-reported sleep diary measurements. In part, this is because some (but not all) narrative reviews on insomnia report that sleep diaries are considered a less accurate measure of sleep than PSG or ACT.48 And while I couldn’t find any SRs of studies comparing self-report and objective measures of sleep in adults, I did find two SRs of studies comparing self-report (or parent-report) of sleep and objective measures in children and adolescents, and both of those SRs reported low correspondence between self-report/parent-report and objective measures.49 Moreover, both a priori reasoning about self-report measures and empirical reviews of the accuracy of self-report measures (across multiple domains) lead me to be suspicious of self-reported measures of sleep.50 Finally, it’s my impression, from the dozens of studies I skim-read for this investigation, that objective and self-report measures of sleep often disagree, with the self-report measures typically showing more beneficial effects of treatment than objective measures show.51

Second, I’m interested in lasting effects of treatment, not immediate post-treatment effects.

Finally, a point that applies to studies using either self-report measures or objective measures or both: I expect few to no RCTs on this topic to be both high quality and highly pragmatic.52

My overall tentative conclusion

Standard BTIs have only rarely been tested against a neutral control at ≥1mo follow-up using objective measures of TST or SE in RCTs, and these results are inconsistent, with most such studies showing no practically significant effect of treatment at the follow-ups I checked. Moreover, I would guess that standard BTIs have never been tested in this way in a high-quality, highly pragmatic RCT. Given this, and given that I have many reasons to be suspicious of self-report measures of sleep quality, I don’t think we have strong evidence to suggest that standard BTIs are effective at ≥1mo.

I would be quite surprised if a more thorough search for RCTs testing the effectiveness of standard BTIs challenged this tentative conclusion.53

If I were to substantially change my mind about this upon further investigation, my guess is that the most likely reasons for this change of mind would be:

  1. There turn out to be reasons to think self-reported sleep diary measurements of sleep are more accurate than I currently suspect they are, and a well-designed recent meta-analysis of RCTs relying on sleep diary measurements shows substantial positive effects of standard BTIs at ≥1mo (when sleep diary measures are used), in a variety of populations and contexts.
  2. There is at least one well-conducted, highly pragmatic RCT which shows that a standard BTI improves sleep outcomes at ≥1mo (using objective measures), but I didn’t find this RCT in my search. If I found one well-conducted pragmatic RCT of this nature, that could be more persuasive to me than meta-analyses of many small, weak, mostly explanatory RCTs, for reasons described here.

Despite my skepticism about the state of the evidence on the effectiveness of standard BTIs, I continue to suggest some standard BTIs (in particular sleep restriction and sleep hygiene) to insomnia sufferers who ask me for advice. I make this suggestion not based on scientific evidence, but based on my intuitive priors about which interventions seem to me like they might work, and the fact that these interventions are usually cheap to try.

In other words, my personal recommendation that insomnia sufferers at least try the sleep restriction and sleep hygiene treatments is given from the following perspective: “The effectiveness evidence in this area is weak. But sleep restriction and sleep hygiene seem intuitively to me like they might help at least some insomnia sufferers, and that’s not true of most possible insomnia treatments one could propose (e.g. various herbal treatments, about which I have no intuitions concerning effectiveness). If you’ve got insomnia, you might as well try sleep restriction and sleep hygiene and see whether they help you. But if I wanted to predict how much human welfare (via insomnia reduction) would accrue if someone spent several million dollars improving or scaling the delivery of standard BTIs, I would say I have no idea because the scientific evidence is too weak to allow me to make that kind of judgment, even as a guesstimate.”

What might I recommend funding in this area?

Obviously, I would want to investigate this topic more deeply before making any funding recommendations. But if I had to guess, on the basis of what I know now, which funding recommendations I’d make upon investigating further, I would guess I’d end up recommending something like the following.

Before anyone funds the first large, expensive, highly pragmatic RCT on this topic, I think we should make sure we’ve got an accurate and ecologically valid measure of sleep, and I’m worried that current actigraphs aren’t accurate enough, even if they’re more accurate than sleep diaries. So, I’d be curious to learn more about the feasibility of developing a night-time sleep measure that will strongly agree with PSG for approximately all populations and conditions. It seems to me like this might be feasible, plausibly via a combination method: e.g. perhaps a comfortable-to-wear headband or skullcap, plus an improved actigraph, and maybe also some little device that listens to one’s breathing throughout the night (or even something similar to this micro-CPAP device54 but only for measuring respiration). Basically: if the startup incubator X (formerly Google X) wanted to build a highly accurate measure of sleep that didn’t require attaching wires to people, what would they build?

If we had a highly accurate measure that subjects could use relatively cheaply at home — either because a new measure was developed or because actigraphy looks more accurate to me upon deeper investigation than it does now — then my next step would probably be to recommend a relatively small pre-registered RCT with ≥6mo follow-up, open data, blinding for everything that can be blinded, and so on — just to see if we could get some preliminary good news about person-delivered CBT-I vs. computerized CBT-I vs. placebo once we’re using an accurate and ecologically valid sleep measure and checking off basic methodological boxes like pre-registration. I’d also want to make sure more development effort goes into the computerized CBT-I intervention than is usually the case.

If one such RCT was promising, or perhaps only if a few such RCTs were promising, then I might be ready to recommend a large, well-designed, multi-site, highly pragmatic RCT with ≥6mo follow-up, testing the effectiveness of person-delivered CBT-I vs. computerized CBT-I vs. placebo.

I have very little sense of how much these things would cost. My guess is that if a better measure of sleep (of the sort I described) can be developed, it could be developed for $2M-$20M. I would guess that the “relatively small” RCTs I suggested might cost $1M-$5M each, whereas I would guess that a large, pragmatic RCT of the sort I described could cost $20M-$50M. But these numbers are just pulled from vague memories of conversations I’ve had with people about how much certain kinds of product development and RCT implementation cost, and my estimates could easily be off by a large factor, and maybe even an order of magnitude.

Sources

Document Source
Adamo et al. (2009) Source
Airing micro-CPAP Source (archive)
Amazon Source (archive)
Barnow & Greenberg (2014) Source (archive)
Barsevick et al. (2010) Source (archive)
Bastien et al. (2012) Source
Bauer & Blunden (2008) Source (archive)
Belanger et al. (2007) Source (archive)
Berger et al. (2009) Source (archive)
Berry (2011) Source
Bhandari & Wagner (2006) Source (archive)
Bound et al. (2001) Source (archive)
Bryant et al. (2014) Source (archive)
Chan (2009) Source
Cochrane Collaboration Risk of Bias Tool Source (archive)
Currie et al. (2000) Source (archive)
Donaldson & Grant-Vallone (2002) Source (archive)
DynaMed Source (archive)
Edinger et al. (2005) Source (archive)
Espie et al. (2008) Source (archive)
Fayers & Machin (2016) Source (archive)
Fernandez-Ballesteros & Botella (2007) Source
Fiorentino et al. (2010) Source (archive)
Geiger-Brown et al. (2015) Source (archive)
Geretsegger et al. (2012) Source (archive)
Google Scholar Source (archive)
Gorber et al. (2007) Source (archive)
Gorber et al. (2009) Source (archive)
Groves et al. (2009) Source (archive)
Hauri (1981) Source (archive)
Ho et al. (2015) Source (archive)
Hoch et al. (2001) Source (archive)
Hodge et al. (2012) Source (archive)
Johnson et al. (2016) Source (archive)
Jungquist et al. (2012) Source (archive)
Koffel et al. (2015) Source (archive)
Kowalski et al. (2012) Source (archive)
Kryger et al. (2016), 6th Edition Source
Kryger et al. (2016), 5th Edition Source (archive)
Kuncel et al. (2005) Source (archive)
Lichstein et al. (2001) Source (archive)
Lichstein et al. (2012) Source
Lichstein et al. (2013) Source (archive)
Loudon et al. (2015) Source (archive)
Lovato et al. (2014) Source (archive)
Luke Muehlhauser, Insomnia treatment SRs Source
McCurry et al. (2007) Source (archive)
McCurry et al. (2014) Source (archive)
Meyer et al. (2009) Source (archive)
Miller et al. (2015) Source
Morin (2010) Source
Morin & Benca (2012) Source (archive)
Morin et al. (1999) Source (archive)
Open Philanthropy Project, RCTs included in SRs on behavioral treatments of insomnia Source
Payne et al. (2008) Source (archive)
Perlis et al. (2010) Source (archive)
PRECIS-2 Source (archive)
Price et al. (2011) Source (archive)
Prince et al. (2008) Source (archive)
Schwarz et al. (2008) Source
Smith (2011) Source (archive)
Smith et al. (2002) Source (archive)
Stalans (2012) Source (archive)
Stone et al. (1999) Source (archive)
Stone et al. (2007) Source (archive)
Streiner & Norman (2008) Source (archive)
Suziedelyte & Johar (2013) Source (archive)
Taylor et al. (2014) Source (archive)
Thomas & Frankenberg (2002) Source
Trauer et al. (2015) Source (archive)
UpToDate Source (archive)
Vitiello et al. (2013) Source (archive)
Wikipedia, X Source (archive)
Wood & McCall (2013) Source (archive)
Wu et al. (2006) Source (archive)
Zacharie et al. (2015) Source (archive)
  • 1. Both GiveWell and the Open Philanthropy Project aim to communicate the evidence and reasoning behind our claims in a transparent manner whenever such transparency is not cost-prohibitive (see here, here, and here). One way to achieve such transparency is to provide support (from the scientific literature, expert interviews, or other sources) for nearly all our substantive claims, and provide a relatively comprehensive account of our literature search process. This is the usual approach of GiveWell intervention reports, but we can’t justify the expense of that approach for “shallow” investigations such as this report. Still, I have tried to write this report in a way that makes it easy for the reader to discern what kind of support I think I have for (nearly) every substantive claim I make. These “kinds” of support can include: other reports we’ve written, careful analysis of one or more studies, shallow analysis of one or more studies, verifiable facts I can easily provide sources for, verifiable facts I can’t easily provide sources for, expert opinion I feel comfortable assessing, expert opinion I can’t easily assess, probabilistic statements about the state of the literature given the literature searches I conducted, and more. The primary “innovation” of this report is its footnoted probability statements, which are much less costly for me to produce than a thorough account of every literature search I conducted and every paper I looked at, but which (we hope) still provide some indication of what kind of support I think I have for my statements about the state of the literature (given the searches I conducted). Of course, the reader cannot assume that my probability statements are well-calibrated, in the sense that things I state with 60% confidence will on average be true 60% of the time, things I state with 80% confidence will on average be true 80% of the time, and so on. I have undergone substantial calibration training, and appear to be well-calibrated on trivia questions, but I don’t know how well-calibrated I am on the types of probabilistic claims made in the footnotes of this report. But such probability statements do at least give some indication of the support I think I have for my statements about the literature in the main text of this report. Moreover, they are precise enough that they can be checked for accuracy by Open Philanthropy Project staff or by external readers, which makes it possible (eventually, over the long run) to learn how well-calibrated I am when making such predictions and, if I turn out to be poorly calibrated on such claims, then future vetting of those claims will also make it possible to improve my calibration.
  • 2. Most of my searches use Google Scholar, because in my own experiments I have found that it (1) uses a more comprehensive database of sources than e.g. PubMed or EMBASE, and that it (2) more reliably brings the most relevant results to the top of the search results pages than other literature search engines do. In addition to Google Scholar keyword searches, I use backward and forward citation searches. (A backward citation search checks for sources cited by a given source, whereas a forward citation search checks for later-published sources that cite a given source.) I also search Amazon for academic books on the topic. I also check UpToDate and DynaMed.
  • 3. The way I use the term, an RCT can have either a placebo control or an “active” control or both.
  • 4. To be more precise: I’m 70% confident there are fewer than 5 SRs on this topic, that I did not find, published online before October 2015, which include at least 5 RCTs testing the effectiveness of one or more treatments for insomnia.
  • 5. Karalyn Lacey and Tracy Williams collectively put in more than 30 hours of work into this project. My thanks to them both!
  • 6. I also used this table of RCTs to identify additional SRs. Specifically, I identified the RCTs cited by the most SRs, and then checked Google Scholar for sources citing those RCTs and using SR-related keywords.
  • 7. The chance of errors in our spreadsheet of SR-included RCTs was considered when stating my probabilities (later in this document) that there are RCTs of various types that I did not review. So, too, was the chance that some RCTs we could not retrieve (and thus could not evaluate) actually do meet certain criteria discussed below.
  • 8. E.g. Morin (2010); Morin & Benca (2012); Lichstein et al. (2012); Perlis et al. (2010); Miller et al. (2015); Bastien et al. (2012); Berry (2011), ch. 25; Wood & McCall (2013). Note that some of my sources in this report are chapters from the 5th edition of Kryger et al.’s Principles and Practice of Sleep Medicine. The 6th edition was released shortly after I finished a first draft of this report. I skimmed some of the insomnia chapters of the 6th edition once it was released, and doing so did not shift my tentative conclusions.
  • 9. Morin (2010) defines sleep restriction as “A method designed to restrict time spent in bed… as close as possible to the actual sleep time, thereby strengthening the homeostatic sleep drive” (p. 867). Lichstein et al. (2012) defines sleep restriction as “Prescribed time in bed is abruptly reduced to match total sleep time to gain better sleep consolidation” (p. 454).
  • 10. Morin (2010) defines stimulus control as “A set of instructions designed to reinforce the association between the bed and bedroom with sleep and to re-establish a consistent sleep-wake schedule” and lists the following instructions: “Go to bed only when sleepy,” “Get out of bed when unable to sleep,” “Use the bed/bedroom for sleep only (no reading, watching TV, etc.),” “Arise at the same time every morning,” and “No napping” (p. 867).
  • 11. Morin (2010) defines sleep hygiene education as “General guidelines about health practices (e.g., diet, exercise, substance use) and environmental factors (e.g., light, noise, excessive temperature) that may promote or interfere with sleep” (p. 867).
  • 12. Morin (2010) defines relaxation training as “Clinical procedures (e.g., progressive muscle relaxation, meditation) aimed at reducing autonomic arousal, muscle tension, and intrusive thoughts interfering with sleep. Most relaxation procedures require some professional guidance initially and daily practice over a period of a few weeks” (p. 867).
  • 13. Morin (2010) defines cognitive therapy (for insomnia) as “Psychological approach using socratic questioning and behavioral experiments to reduce excessive worrying about sleep and to reframe faulty beliefs about insomnia and its daytime consequences” (p. 867).
  • 14. Morin (2010) defines CBT (for insomnia) as “A multimodal intervention combining some of the above cognitive and behavioral… procedures” (p. 867).
  • 15. Physical exercise is an example of another BTI, but the narrative reviews of BTIs I found tended to say little or nothing about physical exercise as a treatment for insomnia, and instead focused on the “standard” BTIs I’ve listed here.
  • 16. I found many or several RCTs testing each of these types of CBT-I.
  • 17. Miller et al. (2015) explains: “Sleep and wake states are measured by EEG whereby electrodes on the scalp record electrical brain activity… The recording of brain activity by EEG is only one aspect of the overall diagnostic sleep study. The study can gather other information about the body during sleep, using polysomnography (PSG), which simply means ‘many sleep recordings’ in addition to EEG, and the following measures can also determine sleep: eye movement measured by electrooculogram (EOG) and muscle tone measured via the electromyogram (EMG) on the chin… [Some] disorders can be distinguished using any number of measures, including respiratory effort and airflow, snoring, body position, heart rate, oxygen saturation levels, and limb and jaw movements over the course of the night” (p. 69). In Table 4.1 (p. 67), Miller et al. (2015) describe PSG as the “gold standard measure of sleep” and lists the following disadvantages: “expensive,” “difficult to assess,” “discomfort,” “first-night effect,” and “requires analysis and interpretation.”
  • 18. Miller et al. (2015) explains: “Actigraphy is cost effective and more convenient than a full PSG… and it can be repeated across many nights to build an ecologically valid assessment of sleep without the first-night effect of PSG… Actigraphs are typically watch-like devices worn on the nondominant hand, and they use an accelerometer to record movement over a given threshold… actigraphy has demonstrated high agreement rates with PSG data for TST and sleep efficiency variables in healthy subjects… However, these variables have been found to have lower agreement rates in patients with [obstructive sleep apnea] and insomnia” (pp. 74-76).
  • 19. Miller et al. (2015) explains: “Sleep diaries are widely used in sleep science… Self-monitoring of sleep through a sleep diary… normally includes the following estimated measures: sleep onset latency (SOL), wake-time after initial sleep onset (WASO), TST, total time spent in bed, sleep efficiency… and a numerical estimation of overall sleep quality. Normally, patients are asked to complete the diary before they commence their day and refer back to the previous night’s sleep with approximations to the nearest 5min… When compared to gold standard PSG and actigraphy, sleep diaries tend to be less accurate…” (p. 77). Similarly, Wood & McCall (2013) report: “Sleep diaries have been compared with PSG and ACT in depressed insomniacs… While PSG and ACT sleep parameters had positive correlations, significant differences were observed between sleep diaries and PSG. This study suggests that ACT is a better reflection of PSG sleep than sleep diaries” (p. 796).
  • 20. Miller et al. (2015) explains: “Sleep can be profiled subjectively through self-report questionnaire measures… The Pittsburgh sleep quality index… is one of the most widely used self-report measures for the assessment of sleep quality… This is a … retrospective assessment of sleep quality and sleep disturbance over a 1-month period. Patients score 19 individual items… Seven components are derived from the individual item scores, and a total score of overall sleep quality, ranging from 0 to 21, is obtained by adding the 7 component scores” (p. 78).
  • 21. I did not look at some SRs that included RCTs on all kinds of treatments (rather than focusing on BTIs): Belanger et al. (2007), McCurry et al. (2007), and Smith et al. (2002). Finally, I did not look at SRs published earlier than Morin et al. (1999), the first SR of nonpharmacological treatments for insomnia conducted by a task force appointed by the American Academy of Sleep Medicine. These facts are taken into account when stating my probabilities (later in this document) that there are RCTs of various types that I did not review.
  • 22. See below for more details on why I focused on objective measures of sleep.
  • 23. “Neutral control” is contrasted with a “positive” control, i.e. another active treatment. I excluded Hauri (1981) from the tally of RCTs because, though it was included in one of the SRs I found, it does not test one of the standard modern BTIs, but instead tests biofeedback treatments. I also excluded Hoch et al. (2001) from the tally of RCTs, because it is only a pilot study. Jungquist et al. (2012) was excluded from the tally of RCTs because while it collected the relevant data, it does not seem to show how ACT-measured TST or SE scores differed between the treatment and control groups at the relevant follow-up (6mo). Table 2 reports comparisons between sleep diary and ACT measures, and Table 3 reports diary-measured group means, but neither table compares ACT-measured group means. But I might be misunderstanding their tables, and I did not contact the authors to clarify. In any case, it is a very small study, with only 20 subjects at the month follow-up. I decided to include Lichstein et al. (2001) even though it appears that all participants (including those in the placebo condition) were given sleep hygiene instructions at the start of the study (“All participants received sleep hygiene instructions in the first treatment sessions”). Note also that although the spreadsheet of RCTs accompanying this report checked for RCTs sastisfying my criteria for the last ≥6mo follow-up (and not necessarily for those staifying my criteria at any ≥6mo follow-up), I remain fairly confident in my statement that “only 7 of these RCTs had a neutral control, retained it through a follow-up period of at least 6 months (thus allowing meaningful comparisons between active treatment and neutral control at that follow-up), and reported objectively-measured TST or SE at that follow-up,” and my uncertainty about this is incorporated into the probabilistic judgments I make elsewhere in this report.
  • 24. The paper doesn’t seem to report whether these differences are statistically significant, and I did not take the time myself to compute their statistical significance and check whether e.g. distributional assumptions were met. The discussion section makes the general claim that “Our main findings are that psychological treatments for insomnia in older adults are effective, but this conclusion does not stand without qualification. These results were obtained in the sleep diaries but not in the PSG data.” Another caveat to this result is that the PSG adaptation period at follow-up was only one night long, which doesn’t intuitively seem to me like enough time to adapt to (1) sleeping in a lab instead of one’s home, and (2) sleeping with lots of wires connected to one’s body.
  • 25. This trial also included pharmacotherapy treatments that are not listed here because they are not BTIs.
  • 26. The study reports that 71 subjects completed the treatment protocol (36 in the CBT-I and placebo conditions), but doesn’t say whether any subjects dropped out between post-treatment and the last follow-up measurements.
  • 27. See Table 1. The paper doesn’t seem to report whether these differences (between CBT-I and placebo) are statistically significant, and I did not take the time myself to compute their statistical significance and check whether e.g. distributional assumptions were met.
  • 28. The treatment condition is called the “Individualized Sleep Promotion Plan,” but this turns out to be a combination of stimulus control, sleep restriction, relaxation therapy, and sleep hygiene, which is consistent with what is normally called CBT-I. I counted the “healthy eating” control condition as a “neutral” control rather than a positive control, since healthy eating coaching is not a treatment targeted very directly at sleep.
  • 29. The published paper refers to this information as being in “Table B,” but Table B was never published. The numbers I provide here — both for outcomes and for subject count (counting only subjects measured by ACT) — are taken from personal communication with the study’s lead author, Dr. Ann Berger, in January 2016. The paper doesn’t report whether these differences are statistically significant, and I did not take the time myself to compute their statistical significance and check whether e.g. distributional assumptions were met.
  • 30. See Table 4.
  • 31. See Table 3.
  • 32. Details about the subjects of this RCT are provided on p. 948 of an earlier paper, Vitiello et al. (2013).
  • 33. “…benefits for insomnia observed over 9 mo were reduced at 18 mo and did not achieve statistical significance for any group comparison” (p. 302).
  • 34. See p. 792.
  • 35. This tally does not include McCurry et al. (2014), which reported follow-up data for SE but never collected TST measurements.
  • 36. To be more precise: If three professionally trained users of the PRECIS-2 tool used that tool to assess the pragmaticness of these seven trials, I’m 70% confident that none of these trials would be achieve an average domain score of 3.7 or higher (after averaging the domain scores from each of the three PRECIS-2 users). PRECIS-2 has 9 domains, each scored from 1 (very explanatory) to 5 (very pragmatic). In some cases, one or more domains are not applicable to a particular trial, and that is why I state my prediction in terms of average domain score rather than total score. The paper introducing the PRECIS-2 tool, Loudon et al. (2015), provides ratings and reasoning for four example trials, and gives Geretsegger et al. (2012) as an example of a relatively pragmatic trial design (average domain score: 4) and Price et al. (2011) as a relatively explanatory trial design (average domain score: 3.44).
  • 37. Currie et al. (2000) is excluded from this tally because, though it used ACT at follow-up, it did not report ACT-measured SE or TST at follow-up. I also excluded Taylor et al. (2014), Fiorentino et al. (2010), and Hoch et al. (2001) because they are pilot studies. Barsevick et al. (2010) is excluded because it tested an uncommon BTI called EASE (the study’s results were null, anyway). Payne et al. (2008) is excluded from this tally because its intervention is exercise. Jungquist et al. (2012) was excluded from the tally of RCTs because while it collected the relevant data, it does not seem to show how ACT-measured TST or SE scores differed between the treatment and control groups at the relevant follow-up (3mo). Table 2 reports comparisons between sleep diary and ACT measures, and Table 3 reports diary-measured group means, but neither table compares ACT-measured group means. But I might be misunderstanding their tables, and I did not contact the authors to clarify. In any case, it is a very small study, with only 20 subjects at the 3 month follow-up.
  • 38. This trial also included pharmacotherapy treatments that are not listed here because they are not BTIs.
  • 39. The study reports that 71 subjects completed the treatment protocol (36 in the CBT-I and placebo conditions), but doesn’t say whether any subjects dropped out between post-treatment and the 3mo follow-up measurements.
  • 40. See Table 1. The paper doesn’t seem to report whether these differences (between CBT-I and placebo) are statistically significant, and I did not take the time myself to compute their statistical significance and check whether e.g. distributional assumptions were met.
  • 41. See Table 2.
  • 42. As stated about this study in the previous table, the numbers I provide here — both for outcomes and for subject count (counting only subjects measured by ACT) — are taken from personal communication with the study’s lead author, Dr. Ann Berger, in January 2016. The paper doesn’t report whether these differences are statistically significant, and I did not take the time myself to compute their statistical significance and check whether e.g. distributional assumptions were met.
  • 43. To be more precise: If three professionally trained users of the PRECIS-2 tool used that tool to assess the pragmaticness of these three trials, I’m 75% confident that none of these trials would be achieve an average domain score of 3.7 or higher (after averaging the domain scores from each of the three PRECIS-2 users).
  • 44. See Table 3.
  • 45. See Table 4.
  • 46. See Table S3.
  • 47. See Table 2, which claims these numbers were “adjusted for publication bias” using the trim and fill method.
  • 48. For example, Miller et al. (2015) cites “Poor correlation with PSG” as a disadvantage of sleep diary measures, and the chapter later states that “When compared to gold standard PSG and actigraphy, sleep diaries tend to be less accurate…” Bastien et al. (2012) is less clear, and doesn’t comment on the accuracy of sleep diary measurements relative to PSG or ACT except to say (somewhat confusingly) that sleep diary estimates “yield a reliable and valid index of insomnia, even though they do not reflect absolute values obtained from polysomnography.” I searched Google Scholar for studies comparing sleep diary measurements to PSG or ACT measurements and found dozens of studies, but no SRs of such studies (none focused on adult subjects, anyway).
  • 49. Bauer & Blunden (2008) found 17 studies matching their criteria, and concluded that there is “a great deal of variance between subjective and objective sleep reports.” Hodge et al. (2012) focused on studies of children with autism spectrum disorders, found 11 studies matching their criteria, and concluded that “with the exception of sleep latency, parents’ reports of children’s sleep are not consistently associated with objective measures of children’s sleep.” I am 70% confident that there is no other such SR published online before October 2015.
  • 50. I am still researching the accuracy of self-report measures across multiple domains, and might or might not produce a separate report on the topic. In the meantime, I only have time to point to some of the sources that have informed my preliminary judgments on this question, without further comment or argument at this time. Sources that provide theoretical considerations and non-systematic evidence in favor of substantial a priori concern about the accuracy of self-report measures include Stone et al. (1999); Groves et al. (2009), especially section 7.3; Stalans (2012); Schwarz et al. (2008). Broad (and in some cases, systematic) empirical reviews (or unusually large-scale primary studies) comparing self-report measures to “gold standard” objective measures include Bryant et al. (2014); Gorber et al. (2007); Prince et al. (2008); Bhandari & Wagner (2006); Gorber et al. (2009); Adamo et al. (2009); Kowalski et al. (2012); Kuncel et al. (2005); Bound et al. (2001); Meyer et al. (2009); Barnow & Greenberg (2014). Finally, one cherry-picked primary study I found disheartening with regard to the accuracy of self-report was Suziedelyte & Johar (2013). Please keep in mind that this is only a preliminary list of sources: I have not evaluated any of them closely, they may be unrepresentative of the literature on self-report as a whole, and I can imagine having a different impression of the typical accuracy of self-report measures if and when I complete my report on the accuracy of self-report measures. My uncertainty about the eventual outcome that investigation is accounted for in the predictions I have made in other footnotes in this report. For those interested in the topic, I list some additional general sources I found useful, again without comment or argument at this time: Stone et al. (2007); Smith (2011); Fernandez-Ballesteros & Botella (2007); Donaldson & Grant-Vallone (2002); Thomas & Frankenberg (2002); Chan (2009); Streiner (2008); Fayers & Machin (2016), ch. 19.
  • 51. To be more precise, I’ll describe two tests. First test: Select at random 20 RCTs from our spreadsheet of SR-included RCTs which reported both SD-measured TST and SE and objectively-measured TST and SE at ≥1mo follow-up (if there are multiple such follow-ups, select one at random for each study). From among these studies, select only those RCTs which reported a positive effect of a standard BTI on SD-measured TST and SE at that follow-up. I’m 70% confident that at least 35% of these RCTs will fail to report a positive effect of the chosen standard BTI on objectively-measured TST and/or SE at the chosen follow-up. Second test: Among the 20 RCTs selected at random for the first test (and using the same chosen follow-ups), select the RCTs which which reported a positive effect of a standard BTI on objectively-measured TST and SE at that follow-up. I’m 70% confident that fewer than 35% of these RCTs will fail to report a positive effect of the chosen standard BTI on SD-measured TST and/or SE at the chosen follow-up.
  • 52. To be more precise, I’ll describe two tests. The test for study quality is this: If three professionally trained users of the Cochrane Risk of Bias Tool (from version 5.1 of the Cochrane Handbook) used that tool to assess the quality of an RCT, and the RCT received a “low risk” rating for at least 4 of the 7 categories of risk from at least two of the users of the tool, I’d consider it a “high quality” study (for the purposes of this footnote only), otherwise I wouldn’t consider it high quality (for the purposes of this footnote). The Risk of Bias Tool’s 7 categories are: random sequence generation, allocation concealment, blinding of participants and personnel (I’m assuming studies on BTIs must fail on this category, because BTIs cannot be blinded to participants), blinding of outcome assessment, incomplete outcome data, selective reporting, and other bias. The test for pragmaticness is this: If three professionally trained users of the PRECIS-2 tool used that tool to assess the pragmaticness of an RCT, and it achieved an average domain score of 3.9 or higher (after averaging the domain scores from each of the three PRECIS-2 users), I’d call that a “highly pragmatic” trial, otherwise not. I’m 80% confident that no RCT — that I found or didn’t find, which tests a standard BTI against a neutral control, and was published before October 2015 — would pass both these tests for high quality and high pragmaticness.
  • 53. A more precise statement of the 2nd sentence in my conclusion paragraph can be found in the previous footnote. To be more precise about the 1st sentence in my conclusion paragraph: I’m 75% confident that there are fewer than 12 RCTs, published online before October 2015, that I didn’t describe in this report, that find a positive effect of a standard BTI (over a neutral control condition) at ≥1mo follow-up using PSG or ACT measures of TST or SE. (Note that I found at least 5 RCTs which I didn’t mention in this report because they had a final follow-up period >3mo and
  • 54. Even if the linked and still under-development micro-CPAP device doesn’t actually work, it seems plausible that a similar small device devoted exclusively to measuring respiration might work.