Surrogate scoring rules (SSR) is the method we used to score forecasters for accuracy in our survey. We encourage all forecasters to watch our short video about the method. This document has three sections: (1) Overview, (2) Implementation, (3) Example.

SSR Overview

Surrogate scoring rules enable us to score forecasts’ accuracy using strictly proper scoring rules without relying on the events’ realized outcome. Strictly proper scoring rules (SPSR) are standard measures of prediction accuracy. Given a prediction p and a realized event outcome Y, SPSR will assign the prediction p a score S(p, Y). For example, consider a prediction p=0.6 about an event Y(e.g., “It rains tomorrow.”). If it rains tomorrow, we denote Y=1, and if it doesn’t rain tomorrow, we denote Y=0. Then, suppose after tomorrow, we know is 1. According to SPSR, the forecaster will get a score S(0.6, 1). A commonly used formula for function S(p, Y) is the Brier score, where S(p, Y)=(p-Y)². This means that if your prediction is perfect, your prediction will get a score zero,  meaning that your prediction has no difference to the event outcome. If your prediction is the opposite, your prediction will get a score one, meaning that your prediction is totally different from the event outcome. The smaller the Brier score, the more accurate your prediction is. 


In our project, however, we do not know the outcomes of the claims at the time of scoring survey predictions. Therefore, we generate a surrogate event outcome for each claim and use this surrogate outcome together with SPSR to score the predictions for accuracy. Our method is hence called surrogate scoring rules. 


To generate a surrogate event outcome for a claim to score your prediction, we will first take the mean of the all forecasters’ predictions on that claim, except yours. We denote the mean as q. We generate the surrogate event outcome Y’=1 for that claim with probability (which means Y’=0 with probability 1-q). This surrogate event outcome may not be the same as the true event outcome. So we model the relationship of the surrogate event outcome Y’ and the true event outcome using two error rate parameters e0 and e1.  

    e0 represents the probability of Y’=1 when Y=0.

    e1 represents the probability of Y’=0 when Y=1.

Then, we assign your prediction p with a surrogate score S'(p, Y’), where

    S'(p, Y’)=((1-e0)S(p, 1)-e1S(p,0))/(1-e0e1)if Y’=1,

and

    S'(p, Y’)=((1-e1)S(p, 0)-e0S(p,1))/(1-e0e1)if Y’=0.

Our theorem shows that if we know the correct values of e0 and e1, then the mean surrogate score you get will be equal to the true mean SPSR score you get when you answer a sufficiently large number of claims. To get the correct values of e0 and e1, we assume that the value of e0 is the same for all claims in Phase 1 and so is e1. Then, we used the “method of moments” to estimate these values from the collected predictions. A detailed introduction of surrogate scoring rules can be found in the paper Surrogate scoring rules. In our project, we set S(p, Y) as a rank-sum scoring rule. Forecasts with higher rank sum scores are considered to be more accurate.

Implementation of SSR in Phase I:

  1. We first remove predictions from users who completed less than 5 predictions in each round. 
  2. For an arbitrary user i, we compute eand eusing methods of moment according to Section 5 of the paper Surrogate Scoring Rules
  3. Consider a single batch of claims and a user i who completed that batch. For each claim, we compute for user i the aforementioned mean prediction q from predictions of other users on the claim. User i’s surrogate outcome for the claim is then viewed as being generated according to q .
  4. For each claim in this batch, we compute S'(p, 1) and S'(p, 0) for user  i according to the surrogate score formulas. In the formulas, S(p, Y) is set to be the rank-sum score of user i’s prediction p on this claim, which is calculated based on all predictions user i made within the batch. The computation of the rank-sum score for a single prediction can be found in Section 2.5 of the paper Linear scoring rules for probabilistic binary classification. User i’s score for this claim is his expected surrogate rank-sum score 𝔼Y’~Bernouilli(q)[S'(p,Y’)]=qS'(p,1)+(1-q)S'(p,0).
  5. User i’s score for the batch is the total score (total expected surrogate rank-sum score) the user received on all claims in the batch. We denote this score as SSRirank-sum for user i in the batch. 
  6. For any other user j completing the batch, we compute SSRjrank-sum in the same way. We then rank all users in that batch according to their  SSRirank-sum, from the highest to the lowest, and dispatch our prizes to the top users. When we rank the users, we remove all users who have not completed all claims in the batch. The users being ranked have made predictions on all claims in the batch.

To make this implementation more concrete, we provide an example showing how a user’s SSRirank-sum is computed.

An Example:

Suppose there is a batch of 4 claims and 5 users with the following predictions, we demonstrate how the surrogate score of user 1 is computed in this batch.
User 1 User 2 User 3 User 4 User 5
Claim 1 0.8 0.7 0.6 0.6 0.9
Claim 2 0.3 0.1 0.1 0.2 0.4
Claim 3 0.4 0.1 0.2 0.4 0.3
Claim 4 0.5 0.3 0.4 0.4 0.5
  1. Suppose we have already estimated the values of e0 and e1. Let’s say e0=0.2, e1=0.3.
  2. For user 1, the mean q of other users’ predictions on Claim 1 to Claim 4 is 0.7, 0.2, 0.25, 0.4 respectively. 
  3. For user 1, we then need to compute the original rank-sum score S(p, Y)for both potential cases  Y=0 and Y=1 for each of her predictions. (The original rank-sum score for a user will be the sum of the rank-sum scores of all her predictions given the ground truth of each prediction.) To compute the original rank-sum score S(p, Y) for a prediction, we first need to compute a rank value for each prediction from user 1. The rank value of a prediction p from an array of predictions made by a user is the number of the predictions strictly smaller than the prediction p, minus the number of the predictions strictly larger than the prediction p. So, predictions 0.8, 0.3, 0.4, 0.5 from user 1 get a rank value 3, -3, -1, 1 respectively. Given the rank value of each prediction, the rank-sum score S(p, Y) for each of these predictions is the rank value if Y=1, and is 0 if Y=0. So user 1 has the following original rank-sum score for each prediction she made.
Y=1 Y=0 
Claim 1 3 0
Claim 2 -3 0
Claim 3 -1 0
Claim 4 1 0
  1. According to the surrogate scoring rule formulas, user 1 gets the following surrogate score depending on the surrogate event outcome. 
Y’=1 Y’=0 
Claim 1 4.8 -1.2
Claim 2 -4.8 1.2
Claim 3 -1.6 0.4
Claim 4 1.6 -0.4
  1. We then compute the expected surrogate score for user 1 as if Y’Bernoulli(q) for each claim where qis the mean prediction for the claim calculated in Step 2. User 1 gets an expected surrogate score for the four claims as follows. 
𝔼Y’Bernoulli(q)[S'(p,Y’)]
Claim 1 0.7×4.8+0.3x(-1.2) = 3
Claim 2 0.2x(-4.8)+0.8*1.2 = 0
Claim 3 0.25*(-1.6)+0.75*0.4= -0.1
Claim 4 0.4*1.6+0.6*(-0.4)= 0.4
  1. For this batch, the final surrogate rank-sum score for user 1 is the total expected surrogate score user 1 gets for the four claims. Thus, user 1’s score for the batch is 3+0+(-0.1)+0.4 = 3.3.
  2. We compute the surrogate rank-sum score for other users in the same way. Then, we rank users according to their final surrogate rank-sum scores. The higher a user’s score is, the more accurate it is believed to be, based on our surrogate scoring method.