I’m Stupid Because Someone Didn’t Read What I Wrote

It’s always weird when someone accuses you of making an incredibly stupid mistake based upon absolutely nothing. It’s even weirder when they write an entire blog post about it. But it only reaches the truly absurd when, just a few days prior, that person says:

Often, though, when someone thinks that everyone else has forgotten something, it’s more likely that everyone else hasn’t, but they’ve considered it and realised something that the first person hasn’t considered.


I know there’s a difference between saying “everyone” has made a mistake and saying one person has, but still, it’s fun to keep that remark in mind when you look at this post by blogger Anders. Paragraph after paragraph of this post is devoted to challenging an idea I first expressed here, regarding this image:

5-10-tease

That image shows the ratings in the 2013 Cook et al consensus paper, broken down by individual rater and rating. In my first description of what it showed, I specifically said:

It’s worth pointing out after the raters finished doing their ratings, raters whose ratings disagreed had a chance to communicate regarding their disagreements. They could then change their ratings if they wished. The results I’m showing are for after this reconciliation phase. That is, even after giving a massive number of responses, the results did not converge. Even talking to one another didn’t make their results converge.

Anders refers to this image and the conclusion drawn from it, quoting the Cook et al paper and saying:

“Each abstract was categorized by two independent, anonymized raters. A team of 12 individuals completed 97.4% (23 061) of the ratings; an additional 12 contributed the remaining 2.6% (607). Initially, 27% of category ratings and 33% of endorsement ratings disagreed. Raters were then allowed to compare and justify or update their rating through the web system, while maintaining anonymity. Following this, 11% of category ratings and 16% of endorsement ratings disagreed; these were then resolved by a third party.”

Okay, so different raters produced different ratings initially, which were later reconciled or resolved by a third party. So, it seems that Brandon and Richard are simply pointing out something that was both acknowledged by the original paper and, presumably, entirely expected. Why have each abstract rated by two people and then have a reconciliation procedure if you expected the raters’s initial ratings to agree? Remember, the goal was to rate a sample of abstracts using people who read and assessed each abstract. The raw data was the abstracts and the output was the final, reconciled ratings. Finding some – essentially entirely expected – issues with some intermediate data doesn’t tell you that the final ratings are “wrong”.

As you can see, all Anders has done is claim the image I said was for post-reconciliation ratings is for pre-reconciliation ratings. He doesn’t accuse me of lying or anything in the process. He’s simply unaware of what I said about the image. That is, his entire post was based upon him assuming he knew things about data, even though they were completely untrue.

A humorous aspect to this is Richard Tol, who Anders responded to in that post, specifically suggested I “repeat this for the pre-conciliation ratings.” I responded, telling him:

Richard Tol, repeating this for the pre-conciliation ratings is on my list of things to do. I’m not in a rush though. The large number of 0s in those ratings makes the data set awkward to do calculations on.

Had Anders bothered to try to see what I said when I first described this image, he’d have known better than to write what he’s written. Tol and I both clearly knew the data I showed was not the data Anders claimed I showed. Even if he hadn’t read the comments section where I described the image, he could have read my follow-up post discussing it, which specifically said:

They seem to disagree quite a bit, and that’s after they went through an entire stage of the study where they talked to one another about their disagreements. If only idiots wouldn’t know what was being asked, why was there so much disagreement? Were the people doing the study idiots?

In other words, the only person who has mixed up anything regarding the data and how it should be interpreted is Anders himself. In case there is any doubt about that, here is what I’d have gotten had I made the same image for the pre-reconciliation ratings:

5-10-pre-reconciliation

As you can see, the rater biases were even greater prior to the reconciliation phase. An additional difference is the pre-reconciliation data has 0-7 ratings while the post-reconciliation data has 1-7 ratings. That means it is impossible to mix the two up.

A higher quality version of the image can be found here. The x-axis in it is wrong though. It says 1-8 instead of 0-7. I didn’t notice that until after I uploaded it, and I don’t think it matters enough to fix.

Advertisements

5 comments

  1. Anders is currently playing or being stupid, saying:

    Just for clarity, Brandon is pointing out that the figure was actually based on post-reconciliation data. So, maybe it’s not strictly the initial ratings, but some intermediate stage. I don’t see how that changes anything, given that the quote from the paper that I include in the post also includes that there was still disagreement between raters after they were given an opportunity to reconcile their ratings.

    First off, completely failing to understand what you’re talking about is obviously an important issue. The idea nobody should care Anders didn’t bother to actually look at anything I wrote is just silly. Second, the point of the images I’ve shown is to show raters didn’t just disagree with one another – they disagreed in biased ways. Nothing in the Cook et al paper indicates that.

    If raters are biased, the results are dependent upon which raters were most active. This is easy to see if we consider the extreme. Suppose there were only two raters, one who always picked “Endorse AGW” and one who always picked “Reject AGW.” If both were equally active, the results would show an equal amount of Endorse/Reject ratings. If one was more active than the other, the results would favor his bias. The ultimate results would thus be dependent upon rater bias. Anders acts as though he doesn’t understand that point. I’m not sure how he could not. Anyone who gives this issue any real thought should.

    Of course, anyone who gives this issue any real thought wouldn’t keep making things up about it. From the same comment:

    These then went to a third party, but of course this stage is not included in the figure (otherwise they’d all agree).

    The data I’m displaying does include tie-break ratings. I even pointed this out in the comments on the original post. That shows Anders didn’t just choose not to research his initial post. Now, after realizing he had simply made things up in that post, he posts a new claim which shows he still hasn’t researched his arguments.

    I guess it’s hardly surprising a person who knows nothing about basic facts of an issue would find arguments unconvincing.

  2. Shub Niggurath, yup. There’s a rater (who only rated 112 abstracts) who’s initial ratings were 3 – 50%, 4 – 32%. After reconciliation, his ratings were 3 – 29%, 4 – 62%.

    Richard Tol is also right to be more concerned about the orange rater. His initial ratings were 3 – 16%, 4 – 79%. After reconciliation, they were 3 – 20%, 4 – 73%. With how many ratings he did, those differences amount to ~200 ratings.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s