What do we mean by high quality replication?
In brief: A good-faith, high-power attempt to reproduce a previously-observed finding.
Slightly longer: A good-faith attempt to reproduce a previously-observed finding, with a sample large enough to find the effect if there.
We will presume you already know what is a replication. Our project will consider two kinds of replications:
Direct replications involve testing the original claim by gathering new data. This is the canonical case, and the market only asks about these. Surveys also ask about…
Data-analytic replications involve testing the original claim using a similar, pre-existing dataset, for example the same economic indicator, but 5 years later.
So… what is “high quality”? Ideal first, then actual.
An ideal high-quality study would be definitive: if there is an effect, it will be found, and if not, it won’t. It would have perfect control, high-powered tests, large sample size….
No single study can reach this ideal, but if we did a bunch of replications, all in different labs we might come close. Multiple replications smooth out individual quirks or mistakes, and jointly they provide a large enough sample size that it has nearly 100% power to detect a real effect – even if it’s notably smaller than the original paper claimed – as most are. That’s not practical for our project, but amazingly it has been done at smaller scale — see for example the Many Labs 2 study.
(For data-analytics replications, you would want different datasets more than different labs – though it’s clear there is a lab effect as well.)
Such a study would come close to measuring what we really want to know: can we trust the claim? SCORE’s program description says:
This ideal definition abstracts away from the limits of any particular replication, which itself may get an unlucky draw. Therefore, DARPA suggested forecasters consider this ideal scenario:
If you’re now trying to specify sampling distributions for studies, stop. This is the ideal. The point is to imagine sufficient effort that everyone would agree the result is a reliable Confidence Score. If all our forecasters kept this frame in mind, and forecast well, we would get the desired Confidence Scores. In fact, if we’ve idealized properly, it should be a very good estimate of how much they believe the claim is true.
But full disclosure: our accuracy isn’t based on the ideal, because ain’t nobody got time to run 100 replications each on 150+ claims. (At only $10K apiece, that would be $150M!) It’s based on one high-quality replication. The forecast may be similar, but it’s unlikely to be the same.
So what are you really forecasting? What is “high quality” really?
The Center for Open Science will run a single high-quality replication of each selected claim. We assume good faith and competence, so “high quality” amounts to “high power”. The power must be high enough that a failed replication is informative about the truth of the claim. Simonsohn (2015) argues for 2.5x the original sample size – a good rule but sometimes wasteful or practically unreachable. Therefore the replicating team makes three estimates of replication sample size, and aims for the middle:
- Minimum power = smallest of 95% power to detect (p < .05) the original effect size, 80% safeguard power for 80% power, and 2.5*N
- Target power = middle of 95% power to detect (p < .05) the original effect size, 80% safeguard power for 80% power, and 2.5*N
- Aspirational power = largest of 95% power to detect (p < .05) the original effect size, 80% safeguard power for 80% power, and 2.5 × N
All studies will be conducted to one of these high, but imperfect, standards, aiming for at least Target Power.² (For data analytic replications, the guidelines will inform the selection of dataset.)
Should this affect your forecast? How? See What do I forecast?
Extra: Where do the power estimates come from?
- High-power (Open Science Collaboration, 2015) uses 95% of the original effect size. This is the most straightforward approach, but it presumes that the original effect size estimate is credible.
- Safeguard power (Perugini et al., 2014) bases power estimates on the lower bound of the precision of the original estimate. This is a useful strategy for precisely measured original effects while still taking into account possible inflation of published effect sizes. It is a less productive strategy (i.e., extremely demanding sample size) for imprecisely estimated effects, particularly those with a lower bound of the confidence interval close to .05.
- Small telescopes (Simonsohn, 2015) bases power estimates on the original sample size, providing a very simple rule of using 2.5× the original sample. This complements safeguard power in that it is (relatively) practical for highly imprecise estimated effects with a lower bound of the confidence interval close to .05. However, it is less productive — and counterproductive for resource management — for original studies that were precisely estimated and highly significant. It is also not applicable to nested sampling such as multi-level models common in some areas of social-behavioral research (e.g., education).
¹ It says “DoD consumer” because DARPA is in the US Dept. of Defense, but clearly applies generally.
² A few exceptions to the minimum power may be approved “if the replication serves the broader interests of representativeness, coverage of domains, and completion of sufficient numbers of replications.” As the TA1 group is keenly aware of the importance of high power, we treat this as negligible.
Want to receive updates about Replication Markets? Share your contact information below.