A Small Oddity

I was looking over the Cook et al consensus paper data I recently obtained, and I noticed a small oddity. I’m curious if anyone can offer an explanation.

Each abstract examined in this paper was rated by two people. Once those ratings were complete, raters were allowed to discuss their disagreements with one another through anonymous forms, updating their ratings if they so desired. After this step, raters still disagreed about what 1870 (16%) abstracts said about the consensus. These disagreements were resolved by a “TiebreakRater.”

The oddity is 7% of the time the TiebreakRater disagreed with both of the original raters. Can anyone explain that to me? I know 7% of 16% is tiny (~1%), but should we really trust the other tie breaks?

I don’t know. What I do know is it is interesting to look at these tie-breaks. This is their distribution:

1   2   3   4   5   6   7
13 210 871 749  24   2   1

That’s pretty skewed. That’s not necessarily bad though. Abstracts in certain categories may be more difficult to rate. As a way to check, we can look at the distribution of the ratings which disagreed (divided by two and rounded to make them more comparable):

0   1   2   3   4   5   6   7 
8  46 279 786 708  31   8   2 

The distribution is a bit different. The tie-breaks seem more conservative, slightly skewed to the endorsement side. That’s interesting, but for the sake of thoroughness, we shouldn’t combine each set of ratings. We should look at them individually:

0   1   2   3   4   5   6   7
13  69 336 812 586  39  12   3
4  24 222 760 831  23   5   1

There are significant differences. The first round of ratings are dramatically different from the second. That shows how raters rated abstracts changed over time. Tie break ratings favored the second round of ratings because the tie-breaks were done after the secound round of ratings. We can see this bias by looking at agreement rates:

Tie Break Rating/First Rater – 31%
Tie Break Rating/Second Rater – 62%
Tie Break Rating/No Rater – 7%

Since we’re talking about biases, it’s worth pointing out this may not just be a bias introduced by raters rating abstracts differently over time. Another issue might be who performed the tie-breaks. Suppose all of the tie-breaks were done by one person. Couldn’t that influence the results?

It’s an idea worth considering. Unfortunately, the University of Queensland has threatened to sue me if I examine the data to look into it. Fortunately, I know that threat is complete bunk. Even if you believe I can’t disclose the data I have, there is no question I can look at it.

When I do, I see 681 of these 1870 disagreements were resolved by a single rater. That rater had only performed 711 ratings prior to the tie-break stage. Somehow a person who performed ~3% of the original ratings wound up performing ~36% of the tie-break ratings. I think that might have affected the results. Don’t you?

And don’t you wish we could investigate things like this without having to worry about being sued?



  1. Brandon:

    Does the tiebreaker involve a vote by all of the other raters?

    I could easily see two raters disagreeing, and then 5 tiebreakers all agreeing that no it wasn’t 3 or 4 it was really 5.

    Or do you think the tiebreaker is a single person?

  2. Each individual tie-breaker rating was definitely done by a single person.

    An interesting point is of all the tie-break ratings, 53% were done by one of two people. ~80% of them were done by only four people; ~90% by five.

  3. Brandon,
    I think his knowledge makes it quite difficult to determine whether a 7% disagreement is large or small. If they’d just gotten another person to come in blind, and the other person pulled (1-7) out of a hat, he would disagree 6/7 times, so that would be a quite low disagreement. But 7% of the time, knowing the two other ratings and their arguments the third guy disagrees with both? Strikes me as high if there is an objectively correct answer. But I’m not sure how you figure out what the correct rate of agreement would be when there is a “correct” answer.

    Presumably, the rate at which people disagree tells us something about the degree of subjectivity.

    Suppose you had college guys rate how good looking women were on a scale of 1-7. Then, when there was disagreement a tie breaker person came along and rated? How often would they all agree or disagree?

    Now suppose you had college guys rate whether a particular mathematical arithmetical process was ‘done properly’ on a scale of 1-7. How often would all disagree?

    ( Just to see there is a context where the math example might not be nonsense, maybe intermediate determinations are for rounding. Like as people whether 10/pi=3 completely right– (1) totally wrong (7), or somewhere in between. Then ask about 10/pi=3.2? or 10/pi=3.183 which to the degree show is incorrect rounding, since the correcter value is 3.184 on the other hand, if you are doing a computation, using 3.183 will often get you answers that are closer to correct than using 3.2 which is rounded ‘correctly’, but further from the really, truly ‘correct’ answer.)

    I tend to think that a case where 7% of people who want to cooperate with each other don’t agree even after a reconciliation process suggests that the issue is not entirely objective and one is measuring the raters to a large extent. But… well… who knows.

  4. Perhaps a sample of the papers used in the study could be tested to determine the level of bias.

  5. lucia, I agree. Another problem is the amount of uncertainty would vary by category. A tie-breaker could disagree 7% of the time on most ratings while refusing to ever pick a 5, 6 or 7. Even if you considered that 7% to be low, the tie-breakers’ ratings would indicate a significant problem.

    Man Bearpig, I’ve actually been toying with the idea of collecting the ~4,000 abstracts the “consensus” is based on and creating my own web-rating system for them. I would use sensible categories that let us clearly define what any “consensus” we found would be.

    The problem is the effort required. It’d take a fair amount of time and work to set that up, plus a bit of money for the site. I’m not sure if it’d be worthwhile.

  6. Now that I think about it, I did collect all of their abstracts and titles once upon a time. Putting them into a database wouldn’t be too much trouble. That means the only real hurdle would be creating the web interface. That’s not as bad as I thought.

    I’m still not sure I’d want to spend the time or money on it though.

  7. After holding this secret data for 3 weeks the only headline discovery we get is about “a small oddity” concerning tie-breaks in 3% of the papers. If there was anything more to be found we would have heard about it by now.
    I conclude there is nothing incriminating whatsoever in the secret data.

    Sounds like UQ has made a mountain out of a molehill here, just to be spiteful to a known critic.

  8. Andrew McRae, I can’t make sense of your comment. I haven’t had this data for three weeks (it’s been almost two), the tie-breaks in question aren’t “in 3% of the papers,” and I first announced this data with a problem as serious, if not more serious, than the one I’ve highlighted in this post. I’m also not sure who the “known critic” is supposed to be, but that’s less important.

    I won’t say the data I have shows a “smoking gun,” but it definitely shows more than I’ve discussed in my posts. The most insightful commentary on it was probably that in the comments section of another post where I discussed a bunch of Kappa scores,

  9. “Somehow a person who performed ~3% of the original ratings wound up performing ~36% of the tie-break ratings.”

    That seems to me that rather more like an “override” than a tie breaking vote. Given the personalities involved and the fact that this paper seems to have had a pre-defined conclusion, I’d say an override is the more likely action.

  10. Anthony Watts, I’m not sure if I’d call it an “override” because I’m not sure what that term would mean. This wasn’t overriding an agreed upon position or anything. It was just disregarding other people’s views.

    A problem I see is there were a ton of ways this project could go be abused. That there were so many different problems means it’s difficult to judge the impact of any particular one. A single rater with malicious intent had tons of different ways he could have biased the results. It’s difficult to try to check for them.

    I think this is a perfect situation for a re-analysis. We should examine the ~4,000 abstracts Cook et al rated as “Endorse AGW” and see what they actually say. Once that’s done, we can say, “The popularly accepted consensus on global warming shows…”

    Of course, what the results actually show probably won’t reflect the real views of the “consensus,” but it’s not our problem of abstract ratings are a bad proxy. Cook et al was widely accepted. The people who accepted it can’t criticize us for using it as well. (Well, they can, but it’d be hypocritical.)

  11. “t was just disregarding other people’s views.” Well that fits my definition of “override” in this case.

    It seems likely the outcome was steered in a desired direction by that person.

  12. That’s possible. I just prefer not to decide something was “desired” without knowing anything about the person in question. Everyone has their biases. Some are based on desires, but many aren’t. A lot of our biases are tied to how we’ve been taught to interpret things.

    In any event, I don’t think examining the ratings can resolve this problem. All the people were from the same social group. The data shows individual biases were a problem, but it can’t show us anything about the effect of group biases. It’s quite plausible the results would have been notably different if a different group of raters would have done the ratings.

    Speaking of which, I’ve been thinking about what it’d take to set up a system to crowd-source ratings of some of the Cook et al abstracts. The Skeptical Science crowd likes to tell people the “correct” response is to do a study of their own. Maybe we should.

    What do you think?

  13. Anthony, you are correct. The tie-breaker is just an arbitrary third rating. If an abstract got a ‘3’ and a ‘4’ and required a tie-break, the third person would step and randomly declare the abstract to be either ‘3’ or ‘4’.

    If they had had a second, independent tie-breaker, they could have even fully recovered the original 33% disagreement rate! 🙂 And this is the key to the whole problem. The 33% error rate is still fully in the final ratings. Only it cannot be seen because it has no rating set to compare it with.

    Also, it the outcome was steered in a desired direction, you would have abstract ratings change slowly over time (say, from neutral to ‘support AGW’). This is not seen in the data. From Brandon’s preliminary results and the data available openly, what seems to have happened is a handful of high productivity raters’ scores dominated the picture and low-frequency raters were over-ridden. The proportions of ratings however remains the same with each round of rating. This graph shows this: http://nigguraths.wordpress.com/?attachment_id=4173.

  14. The vast majority of climate studies are funded by entities favoring an AGW result, so of course the numbers will be skewed, though undoubtedly the 97% figure is too high. I get why this is fun, but serious scrutiny of the TCP methodology is a waste of time, as the fix was in from the beginning.

  15. RH makes an important point: Since science journals are openly talking about heretics and betrayals. Government review boards openly have enviro-extremists with vested financial interests in climate catastrophes. Government funding is poured out on to studies that favor a CO2 obsession and climate catastrophe. Governments fund billions to companies that claim to have so-called alternative energies with no accountability for results.
    There should be no surprise that studies favor the extreme point of view of the climate obsessed.

  16. Brandon, regarding Sense and Sensibility…
    Sorry, I hadn’t overridden the default on my NoScript plugin the first time so it prevented the javascript on your blog from loading any more than the last 5 or 6 blog posts on the front page, thus I did not see your earlier post (A Tease) with the coloured circles diagram.
    I refer to this one https://hiizuru.files.wordpress.com/2014/05/5-9-tease_fix.png

    The differences in circle radius are due to some keen raters rating way more papers than other raters, right? (Circles that are small in one category are usually small in the others too.) So the bias is just in the spread of the vertical percentage ordinate, which doesn’t look “dramatic” to me. Ignore the purple dot guy as an outlier, it’s at most a range of 20% in a minimum 50% consensus on cat 4.
    Damned if I can see any big skew between raters just from those pretty circles, but perhaps all will become clear with time.

    I think your earlier re-totalling of the official figures based on their long-form descriptions of categories was far more detrimental to the Cooked-up consensus than this new bit of minor disagreement amongst warmists at the hind end of an international multi-level 3-stage pro-warmism paper filtering process. RH was keeping sight of the bigger picture.

    Anyhow, hang in there, perhaps some interesting results will be forthcoming, and from the “Doubles Down” response it looks worth it just for the feather ruffling in PR terms if not the science. Popcorn sales are through the roof.

  17. Andrew McRae, it’s important to remember that image is for after the reconciliation phase where raters examined and discussed their disagreements. That’s there’s clear biases even after that is a problem, but if we’re just talking about biases in general, the first image in this post is more relevant.

    That shows what differences there were prior to people talking them out. They’re fairly significant. Given those difference stem from individual biases, one must wonder what the spread would be like if the raters weren’t all from Skeptical Science.

    I definitely think other issues are more damning, but this one is important because it shows the rating system wasn’t clear to the raters. That’s a problem in and of itself, and it supports the more damning criticisms.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s