If you don’t agree with me, you’re stupid.
Not really. Being right doesn’t mean you have to agree with me. I could, theoretically, be wrong. You’d be a fool to disagree with me. You wouldn’t be a fool to be ambivalent though. Not taking a position is different from taking the opposite position of me.
Why do I point out the obvious? It’s not because I like saying I’m right. I’m too humble for that. I just wanted to make it easy to understand how some critics of John Cook’s paper examining the “consensus” on global warming went way off-base. It first came up here:
The implication of such error, i.e. of inter-observer agreement and reliability, can be calculated. In the Cook group data, kappa is 0.08 (p <<< 0.05). The Cook rating method is essentially completely unreliable. The paper authors’ ratings matched Cook’s for only 38% of abstracts. A kappa score of 0.8 is considered ‘excellent’; score less than 0.2 indicates worthless output.
In reality, the author ratings are the weakest link: they invalidate the conclusions of the paper. It is evident the reviewers have not looked at the data themselves: they would have seen through the trickery employed.
The issue is simple. Cook and his associates rated summaries (abstracts) of many papers, attempting to determine how many endorsed a particular “consensus.” Authors of some of those papers did the same, only with the whole papers (instead of summaries). The results agreed only 38% of the time.
That sounds bad. According to the quote above, it “invalidate[s] the conclusions of the paper.” Only, it doesn’t. That there’s a lack of agreement ~62% of the time tells us nothing about how much disagreement there is.
The quoted post uses this example:
A bird reserve hires a fresh enthusiast and puts him to do a census. The amateur knows there are 3 kinds of birds in the park. He accompanies an experienced watcher. The watcher counts 6 magpies, 4 ravens and 2 starlings. The new hire gets 6 magpies, 3 ravens and 3 starlings. Great job, right?
No, and here’s how. The new person was not good at identification. He mistook every bird for everything else. He got his total the same as the expert but by chance.
If one looks just at aggregates, one can be fooled into thinking the agreement between birders to be an impressive 92%. In truth, the match is abysmal: 25%. Interestingly this won’t come out unless the raw data is examined.
But it’s easy to see why this example is misleading. The only times the two individuals failed to agree in this example are when one disagreed with the other. That’s not the case with Cook et al’s paper. Their paper specifically points out:
Among self-rated papers not expressing a position on AGW in the abstract, 53.8% were self-rated as endorsing the consensus.
Rating a paper as “not expressing a position” is akin to taking no position. That is, Cook et al found 53.8% of the time they took no position, the authors of the rated papers endorsed a particular “consensus.” That’s no surprise. A full paper has more information than a summary. Of course fewer papers will take no position than their summaries.
Let’s go back to the bird watcher example. Suppose the fresh enthusiast in the example knew his limitations. Suppose he knew what a raven looked like, but not what a magpie or starling looked like. He saw 12 birds, rated four as ravens and said he didn’t know what the other eight were. His agreement with the skilled bird watcher would only be 33%. Would that “invalidate” anything?
Of course not. That they failed to agree ~62% of the time, I mean, 66% of the time, doesn’t mean anyone was wrong. It doesn’t mean the fresh enthusiast was stupid. It doesn’t mean we can’t trust the fresh enthusiast’s answers. All it means is when different things (e.g. ornithology knowledge of two people) are measured, different answers were gotten.
So when Richard Tol publishes a paper criticizing the Cook et al paper, saying (in part):
No less that 63% of abstract ratings differ from the paper ratings, 25% differ by more than 1 point, and 5% by more than 2 points
Realize it doesn’t tell us anything useful. A failure to agree 63% of the time doesn’t automatically indicate a problem. If it did, Cook et al wouldn’t have highlighted the fact there were substantial differences in the results. That means the only problem Tol highlights regarding this is:
0.7% of ratings were rejections in one case and endorsement in the other (Fig. S21).
But disagreement ~1% of the time is hardly notable. It certainly doesn’t “invalidate the conclusions of the paper.” It doesn’t mean “trickery [was] employed.” The only “trickery” here is from some critics of Cook et al. They are the ones saying if two people don’t agree, one must be stupid.
They can dress it up with statistical tests. They can pontificate upon things like Cohen’s or Krippendorff’s Kappa values. It doesn’t matter. It’s all nonsense. It’s logically no better than if they said, “If you’re not with us, you’re against us.”
Wait for it…