I made damning criticisms of the Skeptical Science consensus paper within days of it coming out. I’ve never received much response from them. I suspect it’s because every time they respond to criticisms, they make themselves look even worse.
John Cook and his associates made this important claim in their paper:
Each abstract was categorized by two independent, anonymized raters.
Using independent raters is a way of combating potential biases. As an obvious example, people who support a position may be more inclined to think papers support the same position. The best way to combat that is to get many people with many different worldviews. Cook and associates did it by using “independent” raters.
The problem is their use of “independent” is nothing like anything any sensible person would ever use. All of the raters of the paper were active participants at Skeptical Science. They routinely talked to one another on the Skeptical Science forum, run by the lead author of the paper. They even had a page in which many of them were labeled as being part of “The Skeptical Science Team” for years before the project began.
No sensible person would say they are on a team of ~20 people then claim to be independent of the other team members.
Beyond that, the raters actively discussed ratings and how to interpret rules with one another. The project’s forum had posts with titles like, “how to rate: Cool Dudes: The Denial Of Climate Change…” This lead one of the authors, Sarah Green, to say:
But, this is clearly not an independent poll, nor really a statistical exercise. We are just assisting in the effort to apply defined criteria to the abstracts with the goal of classifying them as objectively as possible.
Disagreements arise because neither the criteria nor the abstracts can be 100% precise. We have already gone down the path of trying to reach a consensus through the discussions of particular cases. From the start we would never be able to claim that ratings were done by independent, unbiased, or random people anyhow.
The authors of the paper accept the fact they discussed specific ratings while doing them, but they claim the above quote is “out of context” and say:
Discussion of the methodology of categorising abstract text formed part of the training period in the initial stages of the rating period. When presented to raters, abstracts were selected at random from a sample size of 12,464. Hence for all practical purposes, each rating session was independent from other rating sessions. While a few example abstracts were discussed for the purposes of rater training and clarification of category parameters, the ratings and raters were otherwise independent.
They provide no explanation as to how context could change the meaning of Sarah Green’s comment. They provide no context that does change the meaning. All they do is acknowledge the fact they discussed specific ratings and defend it by saying those discussions were “for the purposes of rater training.” The evidence shows that is false.
One topic’s title was, “second opinion??” In no way does that imply training is involved. The same is true of the topic creator’s post, which merely cites a paper’s title and summary while asking:
A poor translation, but I think it’s saying we’re lucky to have AGW because it offsets dangerous global cooling; except where it says humans have caused cooling since 1950. So, does it support or reject AGW?
There is no desire for training there. All it is is one rater asking other raters for their opinion about what rating to pick. The same is true for another topic where a rater simply asks:
True in my experience, but how do I rate it??
These discussions are clearly not for training purposes.
As for the claim discussions of rater guidelines during the rating process “formed part of the training period in the initial stages of the rating period,” that is contradicted by the fact their “Official TCP Guidelines” topic had an active discussion of how to interpret the rules up to March 15th.
John Cook made a graph on March 15, showing how many ratings the top raters had done:
It shows over 15,000 ratings had been performed during what the authors now call the “initial stages of the rating period.” There were fewer than 30,000 total ratings. That means the authors are defending their active discussion of how to interpret the rating guidelines by claiming the “initial stages of the rating period” covered more than half of the ratings they did.
It’s no wonder they didn’t respond more than a year ago when I called them out on their supposed independence. At least, not publicly. We can see they discussed the post in private (taken from the list of links they posted in their forum):