Wow! What a year it’s been! Did you know you completed 54,000 surveys and 42,000 trades on 3,000 studies? Plus the current COVID-19 Round 11? That’s stunning.
Second, any of you participated in Round 0, a small “meta” round asking what you expected, before you saw the claims. We documented those group forecasts in our RSOS paper: back in August 2019, you expected replication rates to rise from about 40% in 2009 to about 60% in 2018. You expected economics to beat psychology and education. Collectively, those were your prior beliefs
After seeing the claims, you were a little more optimistic. You thought the claims had a 50-60% replication chance, depending on how we aggregated. As we’ve seen in the past, markets were more extreme than survey means. And as expected, the SSR extremized the surveys. As we applied our pre-registered aggregations to both surveys and markets, correlations increased.
The most striking finding so far is how much your individual forecasts departed from your prior expectations. As blogger Alvaro de Menard observed, when we aggregate the claim-by-claim forecasts, you were (a) much more optimistic in general, but (b) you saw no improvement over time. Also, you rated Education and Sociology much higher than you expected.
We’d bet heavily you’re at least 70% accurate. Not just because prior markets were about that accurate, but we trained a classifier using only your forecasts, and tested it on 400 other replications. It was 70% accurate.
The riskiest part of the project was relying on SSR for prizes – before results were known. But so far, surveys and markets are well-correlated, SSR seems to be making survey forecasts look more like market forecasts, and we are not seeing obvious failure signs: people who take more surveys still earn the same per survey, on average, as those who take fewer. Encouraging!
Market Starting Prices
Market starting prices were set by a simple rule using p-values, that was nearly as accurate as the crowd on past replications. Here’s how that play out: after we filled in 635 missing p-values(!), it turned out over 1/3 of the papers were weak, p≥.01, with corresponding start price of 30%. Ouch!
I think we’ve used this clip before, but…
ClosingThank you so much! You’ve been amazing – you are amazing! It’s been a privilege to work with you on SCORE Phase 1, and now that we’ve seen the Task3 teams present their initial machine learning work, we’re more excited than ever on the prospects for scaling up human + machine research reliability ratings. What to expect from here:
- We’re eagerly waiting to hear if we continue to Phase 2. Stay tuned!
- Most replication results should arrive by 30-Nov, the end of Phase 1.
- We will pay out Round prizes as Rounds complete.
- Follow us here for results and write-ups.
Charles Twardy, PhD, Principal Investigator, Replication Markets, Jacobs