# What do we mean by high quality replication?

[Updated Jan-2020 for new unified definition starting in Feb-2020 (Round 6).

* In brief:* A good-faith, high-power attempt to reproduce a previously-observed finding.

** Slightly longer**: A good-faith attempt to reproduce a previously-observed finding, with a sample large enough to find the effect if there.

We will presume you already know what is a replication. [Feb. 2020: see that page for new unified replication definition, and legacy direct/data distinction.]

So… what is “*high quality*”? Ideal first, then actual.

### Ideally

An *ideal* high-quality study would be definitive: if there is an effect, it will be found, and if not, it won’t. It would have perfect control, high-powered tests, large sample size….

No single study can reach this ideal, but if we did a bunch of replications, all in different labs we might come close. Multiple replications smooth out individual quirks or mistakes, and jointly they provide a large enough sample size that it has nearly 100% power to detect a real effect – even if it’s notably smaller than the original paper claimed – as most are. That’s not practical for our project, but amazingly it *has* been done at smaller scale — see for example the Many Labs 2 study.

(For data-analytics replications, you would want different *datasets *more than different labs – though it’s clear there is a lab effect as well.)

Such a study would come close to measuring what we *really* want to know: can we trust the claim? SCORE’s program description says:

Confidence scores are quantitative measures that should enable a ... consumer of SBS [Social and Behavioral Science] research to understand the degree to which a particular claim or result is likely to be reproducible or replicable.¹

This ideal definition abstracts away from the limits of any particular replication, which itself may get an unlucky draw. Therefore, DARPA suggested forecasters consider this ideal scenario:

Assume 100 replications of this study were performed, and a weighted average of their results (i.e., a meta-analysis) was calculated. What is the probability that this average effect will be in the same direction as the original claim? For example, if the original result was that increasing X is related to increasing Y, would the average replication result show the same type of relationship (as opposed to either no relationship or a relationship in the opposite direction)?

If you’re now trying to specify sampling distributions for studies, stop. This is the ideal. The point is to imagine sufficient effort that everyone would agree the result is a reliable Confidence Score. If all our forecasters kept this frame in mind, and forecast well, we would get the desired Confidence Scores. In fact, if we’ve idealized properly, it should be a very good estimate of how much they believe the claim is *true*.

But full disclosure: our accuracy isn’t based on the ideal, because *ain’t nobody got time to run 100 replications each on 150+ claims*. (At only $10K apiece, that would be $150M!) It’s based on one high-quality replication. The forecast may be similar, but it’s unlikely to be the same.

So what are you really forecasting? What is “high quality” really?

**Actually**

The Center for Open Science will run a single high-quality replication of each selected claim. We assume good faith and competence, so “high quality” amounts to “high power”. The power must be high enough that a failed replication is informative about the truth of the claim. Simonsohn (2015) argues for 2.5x the original sample size – a good rule but sometimes wasteful or practically unreachable. [Teal and ~~strikethrough~~ added 2020-01-03] It has been decided that for SCORE, statistical power planning will employ a two-stage approach. Specifically, Stage 1 will consist of an initial study with 90% power to detect 75% of the original effect size at the 5% level (two-sided). If that effect is not significant in the original direction, a second data-collection (Stage 2) will be carried out, such that the pooled two-stage data collection will have 90% power to detect 50% of the original effect size at the 5% level (two-sided). Completion of second-stage data collections will be contingent on replication lab feasibility constraints (e.g. money, time, participants), and infeasible cases will be tracked and reported. ~~Therefore the replicating team makes three estimates of replication sample size, and aims for the middle:~~

**Minimum power**=**smallest**of 95% power to detect (*p*< .05) the original effect size, 80% safeguard power for 80% power, and 2.5*N**Target power**=**middle**of 95% power to detect (*p*< .05) the original effect size, 80% safeguard power for 80% power, and 2.5*N**Aspirational power**=**largest**of 95% power to detect (*p*< .05) the original effect size, 80% safeguard power for 80% power, and 2.5 × N

~~All studies will be conducted to one of these high, but imperfect, standards, aiming for at least Target Power.² (For data analytic replications, the guidelines will inform the selection of dataset.)~~

Should this affect your forecast? How? See What do I forecast?

### Extra: Where do the power estimates come from?

**High-power**(Open Science Collaboration, 2015) uses 95% of the original effect size. This is the most straightforward approach, but it presumes that the original effect size estimate is credible.**Safeguard power**(Perugini et al., 2014) bases power estimates on the lower bound of the precision of the original estimate. This is a useful strategy for precisely measured original effects while still taking into account possible inflation of published effect sizes. It is a less productive strategy (i.e., extremely demanding sample size) for imprecisely estimated effects, particularly those with a lower bound of the confidence interval close to .05.**Small telescopes**(Simonsohn, 2015) bases power estimates on the original sample size, providing a very simple rule of using 2.5× the original sample. This complements safeguard power in that it is (relatively) practical for highly imprecise estimated effects with a lower bound of the confidence interval close to .05. However, it is less productive — and counterproductive for resource management — for original studies that were precisely estimated and highly significant. It is also not applicable to nested sampling such as multi-level models common in some areas of social-behavioral research (e.g., education).

¹ It says “DoD consumer” because DARPA is in the US Dept. of Defense, but clearly applies generally.

² A few exceptions to the minimum power may be approved “if the replication serves the broader interests of representativeness, coverage of domains, and completion of sufficient numbers of replications.” As the TA1 group is keenly aware of the importance of high power, we treat this as negligible.

Want to receive updates about Replication Markets? Share your contact information below.