A Tease

I have some interesting news to share, but before I can, I need to do a number of things. In the meantime, I’ll offer you a teaser of what I’ve uncovered. If you think about it, you should be able to guess what this image is:


5-9-2014 Edit: The original version of this image squared counts rather than taking their square roots. I’ve replaced it with a properly scaled version.



  1. Not sure, an odd number of items with large blob(s) in the middle. Something to do with answers on a Likert scale with 7 choices, 4 neutral?

  2. I mean the middle number 4 being a neutral choice. Although I guess that is tautological really since Likert scales always have the middle choice being neutral AFAIK.

  3. @leopard

    Note that “forced choice” Likert Scales have an even number of choices, and so don’t have a central “neutral” choice. And modified Likert scales can have a “don’t know/don’t care” option in addition to or instead of a central neutral choice.

  4. @Jonathan Jones.

    Thanks, so to clarify I mean “Likert scales always have the middle choice being neutral” if they have an odd number of choices, and the middle one is not “don’t know/don’t care” (AFAIK) 🙂

    I’ll have a further punt and guess that these 7 items are a representation of the ‘level of endorsement of AGW’ choices from that soon to be classic paper Quantifying the consensus on anthropogenic global warming in the scientific literature ?

  5. I’m going to guess that this is the distribution of the Cook et al “97%” database. The abscissa is his 1-7 scale of “endorsement”, the ordinate is the percentage in each category. The color scale represents some discrete attribute X. The area, or perhaps the radius, of each circle is proportional to the number of abstracts binned in a particular category with a particular value of X….So what might X be then? There are clues in the figure: There are perhaps 20 values of X. The two biggest circles account for a very large fraction of the total, and about half are mere dots, even for category 4, so don’t represent many abstracts. I don’t think that X is the year of publication, because while there were definitely more papers included for later years, the variation over years is not as wide as the size of circles indicate. So — wild guess — X is the rater.

  6. Alright. I’ve done all the work I needed to do to make sure I have everything I think I have, as well as to confirm its validity. I can now say HaroldW’s guess was correct. The graph in this post shows the individual raters’ ratings in the Cook et al consensus paper. That’s right. I have a copy of the paper’s data with the rater id for each rating included. This copy also has timestamps, though at a daily resolution (I don’t know if the time of ratings was recorded or not).

    I made the image in this post mostly to show I have the data. However, I also had a secondary purpose. One of the first things I checked upon getting a hold of this data was how the distribution of ratings varied amongst authors. This image shows the result. The most striking part is the results of categories 2-3. There are dramatic differences in how likely raters were to pick those categories. That shows the raters had different ideas as to what each category meant, or they had different biases when reading abstracts.

    It’s worth pointing out after the raters finished doing their ratings, raters whose ratings disagreed had a chance to communicate regarding their disagreements. They could then change their ratings if they wished. The results I’m showing are for after this reconciliation phase. That is, even after giving a massive number of responses, the results did not converge. Even talking to one another didn’t make their results converge.

    This is especially troubling as the discrepancies this image shows were “resolved” solely by having a tie-break in which a third person got to make the final call. That third person was within the same group, meaning their decisions would reflect the same discrepancies this image shows. They would have no more inherent validity than either of the disagreeing ratings.

  7. I’m not disclosing that right now.

    I’ve sent John Cook an e-mail alerting him to what material I have, offering him an opportunity to give me reasons I should refrain from releasing it or particular parts of it. I figure a day or two to address any potential privacy concerns should be enough.

    His response will determine how much information I provide. No obligations were placed upon me regarding any of the material I have, but I don’t see any compelling reason to provide information about how I got it either. I’d need a better reason than just satisfying people’s curiosity.

    Plus, I still have to decide what I want to do with the material.

  8. Cook has been told by Tol several times that no one wants the actual identities of the raters. I absolutely agree. The id-tagged anonymised data are all that are needed for analysis. The raters can be anonymised easily: rater ‘X’=1, rater ‘Y’=2, and so forth.

    Cook wanted to keep rater identity locked up. There are two reasons: one, to protect from identifying specific ratings and patterns with specific people. Two, to prevent the biased, variable and subjective nature of the classification exercise from coming out.

    My view: It is ok to grant him the first. I have no interest in know what Dana Nuccitelli called a certain abstract, or what Andy Skuce called another one.

    What matters however is the how variable between raters the classification is. In other words, how there is no ‘consensus’ behind the consensus. As you point out, such high bias and variability behind inter-observer ratings demonstrates the rank discriminatory unfit-ness of the system they devised, to extract so-called ‘consensus’ information from the undifferentiated mass of text their search strategy pulled up.

    I think you should wait for Cook’s response. It is his data. I doubt, however, he would have any sound reasons why de-identified rater ids/tag meta-data should not be released.

  9. By the way, the graphs I posted on Twitter earlier, pertain to the same question.

    With the data you put together, this is what comes out: http://nigguraths.wordpress.com/?attachment_id=4169

    In the first limb, every abstract got rated twice in the Cook project, by two different volunteers. These ratings are the naive, initial result of Cook’s volunteers taking their self-devised system and applying it on abstracts.

    In effect, with volunteers as a whole, there are two independent ratings (observations) for the same set of abstracts. This obviates possible objections about the application of inter-observer reproducibility and co-relation measures such as kappa. The data substrate is the same, both observers apply, putatively, the same criteria, and we have two sets of observations.

    The graph shows kappa (in blue) and percent agreement (in red) between the two sets, arranged in increasing order by year. I smoothed it for easy readability. I know only an idiot smoothes derived statistics such as kappa, but, this is only for readability and no more. Both have been calculated for abstracts in blocks of 30

    The yawing gap between inter-observer concordance and agreement, however, is clearly evident. The low kappa scores show the severe subjective nature of the rating system. The paper’s results carry no meaningful value.

  10. I’m not sure if I’d agree to anonymizing the raters. I’m waiting until I see what John Cook says before I decide. My issue with it is the material I have now was apparently available to all the raters (after the ratings were done). I find it difficult to see how one would argue privacy is a concern if no effort was ever made to ensure it. Unless there was some sort of agreement to disclose one’s ratings to the other raters, but nobody else, I don’t see where the expectation of privacy arises.

    Incidentally, I’m not sure how useful it’d be anyway. In their forum, they kept some track of how many ratings they did. That’d allow a number of them to be identified even if their ids were anonymized (this goes back to the lack of concern for privacy in the study). Also, I can’t see how one would be harmed by having their ratings attributed specifically to them (as we already know who the entire group participating was).

    But we’ll see what (if anything) Cook says. I said I’d give him the weekend. If I don’t hear anything tonight, I’ll try contacting him via Twitter/Skeptical Science. I may try having someone else from SkS get his attention for me. I don’t want him to simply overlook the e-mail I sent.

    By the way, there is some value in associating ids and names. We have comments from many of the people who participated in the study. It could be useful to try to match up biases in the ratings with people’s stated views.

  11. Wow, what do I win? 😉

    I didn’t pay much attention to the Cook et al. paper when it came out, nor after. In fact, I hadn’t seen the paper at all, and only downloaded it today in order to check if my first guess (that each circle corresponded to a year) was reasonable. So I wasn’t aware that rater-based statistics hadn’t previously been available. I might have tried to guess some other set with 20-ish members if I had known!

  12. You win a cut of the money I’m getting paid for this. So zero dollars 😉

    That explains things. I was surprised you got it so quickly. I figured most people wouldn’t guess that since nobody has had this data before. The reasoning you used was spot on, but people rarely follow the clues in such a direct way.

  13. I’ve replaced the original teaser image (though it’s still available on the server) with a fixed version. I apparently mixed up my original equation and squared the counts instead of taking their square roots. That’s obviously the wrong way to do area weighting. Scaling is largely arbitrary so it wasn’t “wrong,” but the new version gives a more informative portrayal.

    I’ve also uploaded a version which scales circles by the total number of ratings by that individual, not the number for each particular value. I think it’s handy for seeing the biases in response distributions:

  14. Brandon: You’re a star.

    Please create a table with the ratings by rater ID. Then you can do a chi-squared test for the equality of proportions to see whether, God forbid, some raters had a systematic bias.

  15. The image in this post (and in the comment right above yours) show there are unquestionably systematic biases according to rater. That said, I created a matrix (named e) of the post-reconciliation ratings while making those images. If I understand what you’re asking me to do, all it’d take is a single line: chisq.test(e). The results:

    Pearson’s Chi-squared test

    data: e
    X-squared = 509.1092, df = 138, p-value < 2.2e-16

    That’s calculated over all 24 raters. Given how few ratings were done by some participants, it seems worth testing over subsets. Here is the same test calculated for the 12, 10 and eight most active participants:

    data: e[c(1:12)]
    X-squared = 36.3241, df = 11, p-value = 0.0001495

    data: e[c(1:10)]
    X-squared = 28.9231, df = 9, p-value = 0.0006677

    data: e[c(1:8)]
    X-squared = 15.0169, df = 7, p-value = 0.03578

  16. Looks like 97% consensus is going inverted pear shaped. You’ll soon be hated by the SSer’s even more than me. I agree with Tol, don’t bother with names. We all know who the players are, but that really isn’t important. What IS important is testing for a systemic bias, because that has teeth. I look forward to Tol interpreting the numbers you’ve provided.

  17. Thanks Brandon

    Great: As I suspected, individual raters did different things, making a mockery of the entire study.

    Would be good to repeat this for the pre-conciliation ratings.

    There was a two-stage reconciliation processes. First, the original raters discussed their ratings. If they continued to disagree, a third rater stepped in and overruled the original ratings. Who were the third raters?

  18. Richard asks ‘Who were the third raters?’

    Surely the answer is obvious.

    They are all third raters.

  19. Anthony Watts, on the issue of names, I’m not clear what could be hidden as the list of participants has been fully known for some time, and we have a fair amount of information about how many ratings various raters had done by certain points. Little new information would be exposed by not anonymizing the ids. I’d have to not disclose ids, anonymized or otherwise, at all.

    Richard Tol, repeating this for the pre-conciliation ratings is on my list of things to do. I’m not in a rush though. The large number of 0s in those ratings makes the data set awkward to do calculations on.

    The third ratings were actually included in this analysis. They were done by the same people who did the first and second rating. The only requirement was nobody rate the same abstract twice. One thing I found interesting is John Cook made file which showed columns for three ratings, plus a tie-break rating. I’m not sure if that means he had considered doing three rounds of rating or what. I do know there are a couple entries in the third column ratings (explaining why some things got rated four times), but I don’t know why.

    One thing to note is different people participated more at different times. That would suggest some people had higher proportions of tie-break ratings than others (the data supports this). That’s interesting as tie-break ratings, not being subject to further review, were effectively worth more than the other ratings. Rather than looking just at total ratings, it might be be worth looking at how many ratings were “accepted.”

  20. Brandon: Exactly.

    It is true that the tie-brakers had more influence.

    I had already shown that the tie-brakes were biased.

    Is the same true for the tie-brakers?

    It would also be good to compute (a) the number of initial ratings per rater, (b) the number of final ratings per rater, and (c) the ratio of b/a. This would reveal how many people were effectively rating.

  21. I went ahead and uploaded high quality versions of the images I’m using. They should make it easier to spot differences in the raters. You can find the image used in this post here. It has all circles scaled to the count of the data they reflect.

    You can find a second image here. In this one, the circles are scaled by the count of ratings given by the rater, not the ratings given for a category. That means the circles have the same size for all seven categories. I like this image because of the pattern it creates when the proportions line up. You can see what I’m talking about in the bottom right (category seven). It looks like a target with an almost perfectly lined up set of rings. It’s easy to compare that alignment to the other six to see how different the raters responded.

  22. Brandon
    Since you have the rater ids you are best placed to calculate inter-observer concordance between all possible pairs of raters.

    For e.g., rater ‘1’ and rater ‘2’ did abstracts 1, 3, 4, 5, 7. Rater ‘1’ and ‘3’ did abstracts 2, 6, 8, 9, 10, and rater ‘2’ and rater ‘3’ did abstracts 11, 12, 13, 14, 15. You can estimate inter-observer agreement between pairs 1*2, 1*3 and 2*3. This would take care of all the questions you raised over application of kappa. It would be a straight-forward application with no necessity to draw unorthodox conclusions from the test.

  23. Brandon –
    I agree that your fixed version — setting the radius of each circle proportional to the square root of the number — is more helpful to the viewer than the original. Although with that version I might well have stuck with my first guess that the ratings were segregated by year.

    And I’m not content with some unspecified cut of the zero dollars. I want a full 25%. Plus a piece of the rights to the musical play which must surely follow. 😉

  24. Naming names may be very important. The 97% number has been cited widely and it has been pumped all over the net by assorted SkS gunsels and hangers on. It has been an extraordinarily misleading and successful bit of agitprop.

    If you are now in a position to prove bias and scientific incoherence attaching names to the deception and its subsequent promotion will hasten the day when the study and its proponents are discredited.

    This is not tiddlywinks – if you feel justified in obtaining data Cook has been unwilling to share there is no justification for withholding parts of that data in some misguided attempt to protect the reputation of people who have collaborated in the creation of this misleading study.

  25. Shub Niggurath, while I have already started on looking at things like that, that actually wouldn’t do anything to address my concerns about the application of kappa testing. The issue I had with using that test was it was used to compare abstract ratings to paper ratings. That meant it showed differences in ratings done over different things. That’s unremarkable.

    HaroldW, I can’t just give you 25%. I have costs to cover. The best I can offer is 25% of the profits.

  26. Jay Currie, if the study actually displayed interest in protecting its participants’ privacy, I’d probably do the same. Exposing the participants of a study can cause harm. We’re not obligated to refrain from causing any harm, but I think we should consider any harm we may cause.

    In this case, I can’t see any. If we know the raters had various biases, and we know who the raters were, there’s no meaningful harm in associating certain raters with certain biases.

  27. Brandon, well done with this, I recognise a similar terrier-like grip on the issue as Steve Mcintyre has on his targets. I’m also impressed with your very generous offer to HaroldW of 25% of the profits, most would be happy with the 3% left over from the consensus.
    As to the “raters” – the time has come to make these people think who they are getting into bed with, their actions have huge consequences and they had better weigh up the cost of the “noble cause” a bit more accurately in future.

  28. johnbuk, you shouldn’t think too much of my generosity. Nobody knows what my expenses are, and nobody knows what I’m being paid.* That is, unless bought the material off me. Even then, without some sort of audit, nobody would actually be able to verify things well enough to know I was giving him 25%.

  29. It’s been 24 hours since I sent John Cook that e-mail. I’ve confirmed he is aware of this issue, so he has presumably seen my e-mail. It would seem he’s in no rush to respond, if he’s going to respond at all.

  30. Lets say you don’t release their names.

    According to the Lewandowsky Principle, you could then write a paper claiming that the participants in this study are all half baked corn nuts who also believe (insert the conspiracy theory of your choosing). If/when they object to your paper and claim that their privacy was invaded, all you have to do is deny that you did anything unethical and say that some other random person gave you permission to do it. You then will be fully justified in whining about it for WEEKS on your blog while all of us write articles for print elsewhere painting you as a noble martyr unfairly judged and silenced by Big Green!!

  31. I know no one is asking for my opinion, and I know offering my opinion is worth even less, but I agree with Jay Currie at May 10, 2014 at 11:44 am.

    Here’s why.

    The 97% consensus thing has been used as a weapon to backhand earnest questioners off the table of public discourse. It’s the big vault door that separates the civilized from the uncivilized. It’s been the rocket that lit the daytime sky of science declared to be our future.

    So when there’s a problem with it, like with the “O” Ring on the Challenger, protecting the name of the manufacturer, or the people who declared it safe, is a criminal act.

    Lives, careers, and reputations have already been destroyed. Future generations are going to have to live with policies made by people who believe this consensus statement. The stakes on this are life and death, especially if our winters get colder.

    The public does not know how this consensus came about and it deserves to.

  32. Brandon,, your concerns are just that, concerns. You could be overly concerned about things or you could understand the conclusion sought to be reached by application of kappa to ratings on supposed different things. It was and continues to be up to you.

    However with the new data, you have an opportunity to address them. How does the Cook classification system when ratings/observations for the *same things* are performed by the same pairs of observers? It would be a trivially simple thing to answer this question.

    Why is kappa just as bad between volunteers rating the same thing – abstracts, as it was between volunteers and authors comparing supposedly different things – abstracts and papers? The interpretation is, and ought to be slightly different in each case.

    Assuming authors give opinion based on the full content of the paper, kappa serves to check the claim that abstracts can perform as a proxy for the full paper. When kappa is tested between volunteers, it serves as the commonly understood measure of inter-observer reproducibility.

    In both instances, the underlying point is that assessment of validity of a classification scheme and its reproducibilty have to be assessed on a one-to-one correspondence basis. It should not be assessed by adding up totals. In both instances the reason for poor concordance is the same, and this is known to all: the loosely concieved and worded nature of ‘consensus’ and the non-discriminant nature of the survey instrument used to ‘detect’ it.

    These are the totals for the ‘consensus’ for the initial 1st, 2nd ratings, reconciled 1st and 2nd ratings and final ratings:

    (141+1068+3079)/(141+1068+3079+59+29+11) = 97.74%

    (72+915+2589)/(72+915+2589+60+16+13) = 97.57%

    (15+122+1033+2924)/(15+122+1033+2924+60+24+12) = 97.70%

    (4+72+945+2729)/(4+72+945+2729+61+20+8) = 97.68%

    (64+922+2910)/(64+922+2910+54+15+9) = 98.03%

    Apparently, there is a 33% disagreement rate between the first two. This drops to 0 by the final from all the hard work. Clearly, this makes no difference to the totals! The error reconciliation process merely moves the same abstracts between each other and the waste-basket ‘4’ category. The error is right there. Just that it is not called so.

  33. Shub Niggurath, remarks like this are pathetic and rude:

    Brandon,, your concerns are just that, concerns. You could be overly concerned about things or you could understand the conclusion sought to be reached by application of kappa to ratings on supposed different things. It was and continues to be up to you.

    That’s an obvious false dilemma that serves no purpose but to discredit my view. The reality is a kappa score calculated between data sets which measure different things cannot prove any problem exists. At most, all it can show is different data sets show different things – a completely unremarkable conclusion.

    You do yourself a disservice when you say things like:

    Assuming authors give opinion based on the full content of the paper, kappa serves to check the claim that abstracts can perform as a proxy for the full paper

    Nobody ever claimed numerical abstract ratings could proxy numerical paper ratings. In fact, Cook et al explicitly stated they could not. Cook et al explicitly said we should expect numerical paper ratings to show a different distribution than numerical abstract ratings.

    Your test would hold it is a problem a paper might be rated a one while its abstract is only rated a two. That’s silly. A paper has more information than an abstract. It is unremarkable a rating done with greater information may be different than one done with less information.

    This is no better than Richard Tol’s earlier, peculiar argument that finding patterns in sorted data proves there is a problem with the data. Arguments like these just strengthen the Cook et al position by making their critics look like fools.

    As does misrepresenting people and being rude to them simply because you disagree with them.

  34. “Nobody ever claimed numerical abstract ratings could proxy numerical paper ratings.”

    Different from what I said. Abstract content is certainly a proxy for any paper. This is the underlying assumption of the Cook paper. The study could not have been undertaken without this assumption. The authors use papers, abstracts and literature inter-changeably in the paper.

    They draw distinctions between the two in the volunteers vs authors rating section. Differences between paper and abstract ratings are certainly to be expected. Agreements by mere chance can be expected as well. The sum concordance between the two should be greater than chance could explain, if less by an amount allowing for information discrepancy between paper and abstracts. But kappa obtained is significantly lower. This challenges the starting assumption.

  35. Shub Niggurath, you claim that is “[d]ifferent from what [you] said,” but that is the only topic I’ve ever discussed in relation to kappa scores. If you’re not talking about anything I’ve ever talked about, I’m not sure why you’re addressing the discussion to me.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s