Note: details section (“Actually”) updated 21-OCT-2019. Additions shown in blue, deletions in strikethrough.

 

In brief: A good-faith, high-power attempt to reproduce a previously-observed finding.

Slightly longer: A good-faith attempt to reproduce a previously-observed finding, with a sample large enough to find the effect if there.

We will presume you already know what is a replication.  So… what is “high quality”? Ideal first, then actual.

Ideally

An ideal high-quality study would be definitive: if there is an effect, it will be found, and if not, it won’t. No single study can reach this ideal, but if we did a bunch of replications, all in different labs we might come close. Multiple replications smooth out individual quirks or mistakes, and jointly they provide a large enough sample size that it has nearly 100% power to detect a real effect – even if it’s notably smaller than the original paper claimed – as most are. That’s not practical for SCORE, but amazingly it has been done at smaller scale — see for example the Many Labs 2 study. 

Such a study would come close to measuring what we really want to know: can we trust the claim? SCORE’s program description says:

Confidence scores are quantitative measures that should enable a … consumer of SBS [Social and Behavioral Science] research to understand the degree to which a particular claim or result is likely to be reproducible or replicable.¹

This ideal definition abstracts away from the limits of any particular replication, which itself may get an unlucky draw.  Therefore, DARPA suggested SCORE forecasters consider this ideal scenario:

If you’re now trying to specify sampling distributions for studies, stop.  This is the ideal. The point is to imagine sufficient effort that everyone would agree the result is a reliable Confidence Score. If all our forecasters kept this frame in mind, and forecast well, we would get the desired Confidence Scores. In fact, if we’ve idealized properly, it should be a very good estimate of how much they believe the claim is true. 

But full disclosure: our accuracy isn’t based on the ideal, because ain’t nobody got time to run 100 replications each on 150+ claims. (At only $10K apiece, that would be $150M!)  It’s based on one high-quality replication.  The forecast may be similar, but it’s unlikely to be the same.  

So what are you really forecasting? What is “high quality” really? 

Actually

SCORE’s TA1 team will run a single high-quality replication of each selected claim.  We assume good faith and competence, so “high quality” amounts to “high power”. The power must be high enough that a failed replication is informative about the truth of the claim. Simonsohn (2015) argues for 2.5x the original sample size – a good rule but sometimes wasteful or practically unreachable.

TA1 will use the two-stage approach from SSRP, noting cases where it is not feasible.

The sampling strategy is similar to the one used in the original studies (students or other easily accessible adult subject pools) and are described in the Replication Reports for each replication posted at OSF (https://osf.io/pfdyw/). For sample sizes we used a two-stage procedure with 90% power to detect 75% of the original effect size at the 5% level (two-sided test) in Stage 1; if the effect was not significant in the original direction in Stage 1 a second data-collection was carried out with 90% power to detect 50% of the original effect size at the 5% level (two-sided test) in the pooled first and second stage data collection. 

Previous plan below, struck out.

Therefore the TA1 gets three estimates of replication sample size, and aims for the middle:

  • Minimum power = smallest of 95% power to detect (p < .05) the original effect size, 80% safeguard power for 80% power, and 2.5*N
  • Target power = middle of 95% power to detect (p < .05) the original effect size, 80% safeguard power for 80% power, and 2.5*N
  • Aspirational power = largest of 95% power to detect (p < .05) the original effect size, 80% safeguard power for 80% power, and 2.5 × N

All studies will be conducted to one of these high, but imperfect, standards, aiming for at least Target Power.²  

Should this affect your forecast?  How? See What do I forecast?


Extra: Where do the power estimates come from?

As noted above, TA1 is using the two-stage approach from SSRP, but the following may still be of interest.

  • High-power (Open Science Collaboration, 2015) uses 95% of the original effect size. This is the most straightforward approach, but it presumes that the original effect size estimate is credible.
  • Safeguard power (Perugini et al., 2014) bases power estimates on the lower bound of the precision of the original estimate. This is a useful strategy for precisely measured original effects while still taking into account possible inflation of published effect sizes. It is a less productive strategy (i.e., extremely demanding sample size) for imprecisely estimated effects, particularly those with a lower bound of the confidence interval close to .05.  
  • Small telescopes (Simonsohn, 2015) bases power estimates on the original sample size, providing a very simple rule of using 2.5× the original sample. This complements safeguard power in that it is (relatively) practical for highly imprecise estimated effects with a lower bound of the confidence interval close to .05.  However, it is less productive — and counterproductive for resource management — for original studies that were precisely estimated and highly significant. It is also not applicable to nested sampling such as multi-level models common in some areas of social-behavioral research (e.g., education).

¹ It says “DoD consumer” because DARPA is in the US Dept. of Defense, but clearly applies generally.

²​ A few exceptions to the minimum power may be approved “if the replication serves the broader interests of representativeness, coverage of domains, and completion of sufficient numbers of replications.”   As the TA1 group is keenly aware of the importance of high power, we treat this as negligible.