What a Year at Replication Markets!

Sep 29 2020

Dear Forecasters:

Wow! What a year it’s been! Did you know you completed 54,000 surveys and 42,000 trades on 3,000 studies? Plus the current COVID-19 Round 11? That’s stunning.

Replication Markets Recap: The View From 3,000 Claims The forecasts are in. The results are not. But there’s quite a bit we can say – if you don’t mind preliminary results. font-size: 11px;”>you are a professional bunch. Not only do 2/3 of forecasts came from academia (and 1/3 from PhDs), but over time you streamlined from 642 mildly active forecasters to 70ish highly active “pro” forecasters. As number of forecasters dropped, overall activity increased!

Round 0

Second, any of you participated in Round 0, a small “meta” round asking what you expected, before you saw the claims. We documented those group forecasts in our RSOS paper: back in August 2019, you expected replication rates to rise from about 40% in 2009 to about 60% in 2018. You expected economics to beat psychology and education. Collectively, those were your prior beliefs

Rounds 1-10

After seeing the claims, you were a little more optimistic. You thought the claims had a 50-60% replication chance, depending on how we aggregated. As we’ve seen in the past, markets were more extreme than survey means. And as expected, the SSR extremized the surveys. As we applied our pre-registered aggregations to both surveys and markets, correlations increased.

Round 0 vs. Rounds 1-10

The most striking finding so far is how much your individual forecasts departed from your prior expectations. As blogger Alvaro de Menard observed, when we aggregate the claim-by-claim forecasts, you were (a) much more optimistic in general, but (b) you saw no improvement over time. Also, you rated Education and Sociology much higher than you expected.

We Bet You Did Well

We’d bet heavily you’re at least 70% accurate. Not just because prior markets were about that accurate, but we trained a classifier using only your forecasts, and tested it on 400 other replications. It was 70% accurate.

By combining human and machine methods, it looks like we can near 80% accuracy – but we’re starting to suspect a hard limit set by replication noise. Actual replication power sets a hard limit on expected accuracy. Most past replications were designed for 90+% power, but that’s assuming no systemic bias. The world is rarely so kind.

SSR Sanity Checks

The riskiest part of the project was relying on SSR for prizes – before results were known. But so far, surveys and markets are well-correlated, SSR seems to be making survey forecasts look more like market forecasts, and we are not seeing obvious failure signs: people who take more surveys still earn the same per survey, on average, as those who take fewer. Encouraging!

Market Starting Prices

Market starting prices were set by a simple rule using p-values, that was nearly as accurate as the crowd on past replications. Here’s how that play out: after we filled in 635 missing p-values(!), it turned out over 1/3 of the papers were weak, p≥.01, with corresponding start price of 30%. Ouch!

I think we’ve used this clip before, but…

Closing

Thank you so much! You’ve been amazing – you are amazing! It’s been a privilege to work with you on SCORE Phase 1, and now that we’ve seen the Task3 teams present their initial machine learning work, we’re more excited than ever on the prospects for scaling up human + machine research reliability ratings. What to expect from here:

We’re eagerly waiting to hear if we continue to Phase 2. Stay tuned!
Most replication results should arrive by 30-Nov, the end of Phase 1.
We will pay out Round prizes as Rounds complete.
Follow us here for results and write-ups.

Non-SCORE work: We hope to launch a COVID side-project (non-SCORE) in mid-October, with $14K in prizes. Please sign up below for our distribution list to be notified of this and other upcoming work! And consider following our individual researchers and other projects, including: