As you may have heard, I recently came into possession of previously undisclosed material for a 2013 paper by John Cook and others of Skeptical Science. The paper claimed to find the consensus on global warming is 97%.
That number was reached by having a group of people read abstracts (summaries) of ~12,000 scientific papers then say which endorse or reject the consensus. Each abstract was rated twice, and some had a third rater come in as a tie-break. The total number of ratings was 26,848. These ratings were done by 24 people. Twelve of them, combined, contributed only 873 ratings. That means 12 people did approximately 26,000 ratings.
Cook et al. have only discussed results related to the ~27,000 ratings. They have never discussed results broken down by individual raters. They have, in fact, refused to share the data which would allow such a discussion to take place. This is troubling. Biases in individual raters are always a problem when having people analyze text.
Biases can arise because of differences in worldviews, differences in how people understand of the rating system or any number of other things. These biases don’t mean the raters are bad people or even bad raters. It just means their ratings represent different things. If you take no steps to address that, your ratings can wind up looking like:
This image shows the ratings broken down by individual rater for the Cook et al. paper. The columns go from zero to seven. Zero meant no rating was given. The others were given as:
1 Explicitly endorses and quantifies AGW as 50+%
2 Explicitly endorses but does not quantify or minimise
3 Implicitly endorses AGW without minimising it
4 No Position
5 Implicitly minimizes/rejects AGW
6 Explicitly minimizes/rejects AGW but does not quantify
7 Explicitly minimizes/rejects AGW as less than 50%
The circles in each column are colored according to rater. Their size indicates the number of times the rater selected that endorsement level. Their position on the y-axis represents the percentage of ratings by that rater which fell on that level.
As you can see, these circles do not line up. Some circles are higher than others, meaning those raters were more likely to pick that particular value. Some circles are lower than others, meaning those raters were less likely to pick that particular value. That shows the raters were biased. If they weren’t, the circles would have lined up.
Now then, the authors of the paper did take a step to try to address this issue. When two raters gave different ratings to the same abstract, they were given the opportunity to discuss the disagreement and modify their ratings. This reduced the biases present in the ratings, making the data look like this:
As you can see, the post-reconcilation data has no zero ratings. It also has fewer biases. Fewer is not none, however. The problem of bias still clearly exists. That problem will necessarily affect the study’s results. The biases of raters’ whose circles are largest will necessarily influence the results more than those of raters’ whose circles are smaller.
To see why this is a problem, remember each circle’s size is dependent largely upon how active a rater was. Had different raters been more active, the larger circles would have been in different locations. That means the combined result would have been in a different location as well.
To demonstrate, I’ve created a simple image. Its layout is the same as the last figure, but it shows the data for the 12 most active raters combined (yellow). It also shows what the combined result would have been if the activity of those 12 raters had been reversed (red):
There are readily identifiable differences given this simple test. That shows the effect of the bias in raters affects the final results. It’s true this particular test resulted in differences favoring the Cook et al results, but that doesn’t mean it’s okay. Bias influencing results isn’t okay, and a different test could have resulted in a different pattern,
Regardless, we now know the results of the Cook et al. paper are influenced by the raters’ individual biases. That’s a problem in and of itself, but it raises a larger question. All the people involved in this study belong to the same group (Skeptical Science). All of these people know each other, talk to one another and have similar overall views related to global warming.
If biases between such a homogenous group can influence their results, what would the results have been if a different group had done the ratings? How would we know which results are right?
Update: It’s worth pointing the paper explicitly said, “Each abstract was categorized by two independent, anonymized raters.” That would have mitigated concerns of bias if true. However, it’s difficult to see how a small group of friends can be considered “independent” of one another. That’s especially true when the group actively talked to one another (on a forum ran by the lead author), even about how to rate specific papers, while the “independent” ratings were going on. This issue was first noted here, and it’s highly relevant when considering issues of bias.