It’s always weird when someone accuses you of making an incredibly stupid mistake based upon absolutely nothing. It’s even weirder when they write an entire blog post about it. But it only reaches the truly absurd when, just a few days prior, that person says:
Often, though, when someone thinks that everyone else has forgotten something, it’s more likely that everyone else hasn’t, but they’ve considered it and realised something that the first person hasn’t considered.
I know there’s a difference between saying “everyone” has made a mistake and saying one person has, but still, it’s fun to keep that remark in mind when you look at this post by blogger Anders. Paragraph after paragraph of this post is devoted to challenging an idea I first expressed here, regarding this image:
That image shows the ratings in the 2013 Cook et al consensus paper, broken down by individual rater and rating. In my first description of what it showed, I specifically said:
It’s worth pointing out after the raters finished doing their ratings, raters whose ratings disagreed had a chance to communicate regarding their disagreements. They could then change their ratings if they wished. The results I’m showing are for after this reconciliation phase. That is, even after giving a massive number of responses, the results did not converge. Even talking to one another didn’t make their results converge.
Anders refers to this image and the conclusion drawn from it, quoting the Cook et al paper and saying:
“Each abstract was categorized by two independent, anonymized raters. A team of 12 individuals completed 97.4% (23 061) of the ratings; an additional 12 contributed the remaining 2.6% (607). Initially, 27% of category ratings and 33% of endorsement ratings disagreed. Raters were then allowed to compare and justify or update their rating through the web system, while maintaining anonymity. Following this, 11% of category ratings and 16% of endorsement ratings disagreed; these were then resolved by a third party.”
Okay, so different raters produced different ratings initially, which were later reconciled or resolved by a third party. So, it seems that Brandon and Richard are simply pointing out something that was both acknowledged by the original paper and, presumably, entirely expected. Why have each abstract rated by two people and then have a reconciliation procedure if you expected the raters’s initial ratings to agree? Remember, the goal was to rate a sample of abstracts using people who read and assessed each abstract. The raw data was the abstracts and the output was the final, reconciled ratings. Finding some – essentially entirely expected – issues with some intermediate data doesn’t tell you that the final ratings are “wrong”.
As you can see, all Anders has done is claim the image I said was for post-reconciliation ratings is for pre-reconciliation ratings. He doesn’t accuse me of lying or anything in the process. He’s simply unaware of what I said about the image. That is, his entire post was based upon him assuming he knew things about data, even though they were completely untrue.
Richard Tol, repeating this for the pre-conciliation ratings is on my list of things to do. I’m not in a rush though. The large number of 0s in those ratings makes the data set awkward to do calculations on.
Had Anders bothered to try to see what I said when I first described this image, he’d have known better than to write what he’s written. Tol and I both clearly knew the data I showed was not the data Anders claimed I showed. Even if he hadn’t read the comments section where I described the image, he could have read my follow-up post discussing it, which specifically said:
They seem to disagree quite a bit, and that’s after they went through an entire stage of the study where they talked to one another about their disagreements. If only idiots wouldn’t know what was being asked, why was there so much disagreement? Were the people doing the study idiots?
In other words, the only person who has mixed up anything regarding the data and how it should be interpreted is Anders himself. In case there is any doubt about that, here is what I’d have gotten had I made the same image for the pre-reconciliation ratings:
As you can see, the rater biases were even greater prior to the reconciliation phase. An additional difference is the pre-reconciliation data has 0-7 ratings while the post-reconciliation data has 1-7 ratings. That means it is impossible to mix the two up.
A higher quality version of the image can be found here. The x-axis in it is wrong though. It says 1-8 instead of 0-7. I didn’t notice that until after I uploaded it, and I don’t think it matters enough to fix.