Introduction for the Upcoming TCP Release

As you may have heard, I recently came into possession of previously undisclosed material for a 2013 paper by John Cook and others of Skeptical Science. The paper claimed to find the consensus on global warming is 97%.

That number was reached by having a group of people read abstracts (summaries) of ~12,000 scientific papers then say which endorse or reject the consensus. Each abstract was rated twice, and some had a third rater come in as a tie-break. The total number of ratings was 26,848. These ratings were done by 24 people. Twelve of them, combined, contributed only 873 ratings. That means 12 people did approximately 26,000 ratings.

Cook et al. have only discussed results related to the ~27,000 ratings. They have never discussed results broken down by individual raters. They have, in fact, refused to share the data which would allow such a discussion to take place. This is troubling. Biases in individual raters are always a problem when having people analyze text.

Biases can arise because of differences in worldviews, differences in how people understand of the rating system or any number of other things. These biases don’t mean the raters are bad people or even bad raters. It just means their ratings represent different things. If you take no steps to address that, your ratings can wind up looking like:


This image shows the ratings broken down by individual rater for the Cook et al. paper. The columns go from zero to seven. Zero meant no rating was given. The others were given as:

1 Explicitly endorses and quantifies AGW as 50+%
2 Explicitly endorses but does not quantify or minimise
3 Implicitly endorses AGW without minimising it
4 No Position
5 Implicitly minimizes/rejects AGW
6 Explicitly minimizes/rejects AGW but does not quantify
7 Explicitly minimizes/rejects AGW as less than 50%

The circles in each column are colored according to rater. Their size indicates the number of times the rater selected that endorsement level. Their position on the y-axis represents the percentage of ratings by that rater which fell on that level.

As you can see, these circles do not line up. Some circles are higher than others, meaning those raters were more likely to pick that particular value. Some circles are lower than others, meaning those raters were less likely to pick that particular value. That shows the raters were biased. If they weren’t, the circles would have lined up.

Now then, the authors of the paper did take a step to try to address this issue. When two raters gave different ratings to the same abstract, they were given the opportunity to discuss the disagreement and modify their ratings. This reduced the biases present in the ratings, making the data look like this:


As you can see, the post-reconcilation data has no zero ratings. It also has fewer biases. Fewer is not none, however. The problem of bias still clearly exists. That problem will necessarily affect the study’s results. The biases of raters’ whose circles are largest will necessarily influence the results more than those of raters’ whose circles are smaller.

To see why this is a problem, remember each circle’s size is dependent largely upon how active a rater was. Had different raters been more active, the larger circles would have been in different locations. That means the combined result would have been in a different location as well.

To demonstrate, I’ve created a simple image. Its layout is the same as the last figure, but it shows the data for the 12 most active raters combined (yellow). It also shows what the combined result would have been if the activity of those 12 raters had been reversed (red):


There are readily identifiable differences given this simple test. That shows the effect of the bias in raters affects the final results. It’s true this particular test resulted in differences favoring the Cook et al results, but that doesn’t mean it’s okay. Bias influencing results isn’t okay, and a different test could have resulted in a different pattern,

Regardless, we now know the results of the Cook et al. paper are influenced by the raters’ individual biases. That’s a problem in and of itself, but it raises a larger question. All the people involved in this study belong to the same group (Skeptical Science). All of these people know each other, talk to one another and have similar overall views related to global warming.

If biases between such a homogenous group can influence their results, what would the results have been if a different group had done the ratings? How would we know which results are right?

Update: It’s worth pointing the paper explicitly said, “Each abstract was categorized by two independent, anonymized raters.” That would have mitigated concerns of bias if true. However, it’s difficult to see how a small group of friends can be considered “independent” of one another. That’s especially true when the group actively talked to one another (on a forum ran by the lead author), even about how to rate specific papers, while the “independent” ratings were going on. This issue was first noted here, and it’s highly relevant when considering issues of bias.



  1. Reich.Eschhaus, you may think pointing out people doing a survey of literature failed to account for bias in surveyors is “crying.” You may even think people have “impossible expectations” when they expect literature surveys to account for such bias. If so, you’re wrong.

    The problem discussed in this post is a widely known problem. Its solutions are widely known as well. People who do work like this professionally know how to account for it. One of the simplest steps to take is to simply train your raters. You tell them what’s expected of them, then you give them examples to test if they understood what was wanted. You repeat the process until all of your raters are on the same page. It’s simple, easy to do, and it makes your results far better. And Cook et al didn’t bother to even try to do something that simple. In fact, Cook et al didn’t even bother to try to train their raters.

    There are many other things one can do to address this problem. Professionals do them all the time. Research students are taught them as part of their courses. Test graders deal with them every year.

    I can’t imagine how you’d think it’s having “impossible expectations” to expect people publishing in a scientific journal to take basic steps to ensure the quality of their data. You’re practically calling Cook et al incompetent. That’s the only reason they wouldn’t be able to do something so simple.

  2. I was re-visiting some old material in preparation for another post when I came across something rather interesting. The issue of bias was brought up in the Skeptical Science forums (in a topic titled A question of bias). In it, John Cook said:

    Once all papers have been rated twice, I’m going to do an analysis calculating the average of all SkS raters – will be interesting to see who is the most “warmist”. Not sure what that will tell us but hey, it’s an opportunity for an interesting graph so why not?

    In other words, Cook planned to do the same analysis I did in this post. That means he was fully aware of the issue, and he knew how to look for it. Despite that, he never said anything about it in public.

  3. Brandon, I don’t know whether Cook ever got back to you or not, but I think it would be appropriate to anonymize the evaluators, in the event you release the still-incomplete data set that you have.

  4. Carrick, he did respond to me. His response was… weird. About the only thing of value he said is something I pointed out in my e-mail to him. Namely, even if I anonymized the rater IDs, it wouldn’t protect the raters’ identities. Enough information is available to identify a number of those raters even if their IDs are anonymized.

    On a related note, the identities of the raters are all known. I struggle to see how any harm would be caused by associating names with a particular set of ratings. That, combined with the fact anonymizing the data would not be very effective, makes me question the value of doing it.

    On a related note, I’d say Cook’s response to me actually frees me from any obligation to anonymize the data. I provided him the opportunity to request I take steps to protect people’s identities. He chose not to. You can’t (sensibly) complain about confidentiality being broken if you chose not to try to protect that confidentiality.

    A similar note goes for other participants in the study. I know at least several of them are aware of my offer to address any privacy concerns they may have. None have made any effort to take me up on it.

    I find it all very confusing. At this point, I’m showing a greater concern for the study participants’ privacy than the people who ran the study.

    Speaking of which, I’m going to follow-up with Cook one last time (he ignored me after his first e-mail). If he continues ignoring the offer, there’s no room for him to assert confidentiality later on.

  5. Brandon, I’m not sure what value is served by anonymizing the evaluators in any case. One reason you would normally do this is to prevent them from influencing each other.

    However, in this case, they know each other, and clearly did influence the ratings in a number of the papers that were discussed.

    Did Cook give an argument for why he wanted to protect the identify of the evaluators?

    Also, do you know if there is an IRB approval associated with this?

    If there is not, there are no ethical issues that I can think of for not releasing the reviewers by name.

  6. Nope and nope. John Cook didn’t say anything to indicate I should ensure confidentiality. As for an IRB approval, I’m pretty sure there wasn’t one. Cook never referenced one when designing the project in the forum. I’d imagine he’d need to if there was one so he could ensure he followed its guidelines. The only way I can see it making sense he got one is if he somehow got it retroactively (which I’m not sure is even possible).

    By the way, you should check out my newest post. It’s fairly relevant to your exchange with Dana Nuccitelli.

  7. The other thing that occurred to me is that the breech of anonymity. occurred due to additional information made available by SkS. Releasing the data by itself, with anonymized ids, could not possibly breach anonymity.

  8. It’s not supposed to be possible to get retroactive IRB approval, but I think it does happen sometimes.

  9. I’m a little sympathetic on that point because it’s not John Cook’s fault he got hacked. Well, his stupid security holes are his fault, but at least he could argue he tried to ensure the data’s confidentiality.

    Of course, he should never have provided information to the raters about each others’ progress in the first place.

  10. I’m not sure if it matters, but John Cook did mention wanting to get advice from a university’s legal department. That suggests they may have given some sort of approval for the project, meaning there ought to be some sort of ethical review agreement.

  11. Not withstanding the unwitting global release of data via the hack, if you discussed your role to individuals who weren’t bound by confidentiality agreements, as was done here, I say you have a very weak case to expect confidentiality at that point.

    Having been involved in numerous cases where data confidentiality had to be maintained (including HIPPA), I can guarantee we didn’t have a group chat discussing the confidential information involving third parties, even third parties we thought we could trust.

    In fact, there was generally a “is it okay to discuss this with XXX present” (“yes, our contract covers all employees….”)

    If you scraped the data using a tool he provided, though, Cook’s pretty much screwed.

  12. I’m curious. If John Cook did get IRB approval for this, how would one go about getting a copy of it? Can you just ask for it? If so, would you ask the University of Queensland (who he lists on the paper as his affiliation), or could he have gotten it somewhere else?

  13. @Carrick
    I’ve been asking for the rater IDs for almost a year now. The standard answer is that it would be illegal (their words) to release even IDs because the various leaks imply that we now (a) the names of the raters and (b) how many abstracts they rated. Putting the three together is easy — and would apparently violate the terms under which they agreed to do the rating.

    There is an interesting twist to this: Lax security is as much a breach as is explicit release.

    Ethics approvals can be requested from the relevant committee. I guess that is U Queensland (Cook’s employer) but it may be U West Anglia (where Cook’s doing his PhD).

  14. Richard Tol:

    I’ve been asking for the rater IDs for almost a year now. The standard answer is that it would be illegal (their words) to release even IDs because the various leaks imply that we now (a) the names of the raters and (b) how many abstracts they rated. Putting the three together is easy — and would apparently violate the terms under which they agreed to do the rating.

    Stranger and stranger.

    Cook and his institute could easily have provided a confidentiality agreement with Richard Tol, that allowed Richard access to Cook’s data in such a way as to protect the identity of the evaluators (and the authors of the study).

    There is an interesting twist to this: Lax security is as much a breach as is explicit release.

    Very much so.

    But I would like to see a copy of the consent form that the evaluators signed. I believe Cook is obligated to produce this.

  15. Congratulations on the work you’ve put into this. I didn’t comment on the first article on the subject because I didn’t understand the graphic, even after your explanation. Now I do, and I’m sorry to say I don’t think it supports your conclusion about rater bias. Or rather, I think the faults you have identified in this ghastly paper have been wrongly described, which leaves you open to all sorts of misleading and irrelevant criticisms.

    You say:
    “Some circles are lower than others, meaning those raters were less likely to pick that particular value. That shows the raters were biased. If they weren’t, the circles would have lined up.”
    The question of bias can only arise if there is one and only one correct “score” for each abstract. But the whole structure of the paper is based on the premiss that this is a subjective evaluation exercise, with evaluations subject to correction on consultation.
    Cook seems to hesitate between treating his definitions as points on a Lickert scale, and as discrete “bins” which are logically exclusive and exhaustive. They are neither.
    Look at his definitions. The words “implicitly”, “explicitly” and “minimise” can only be interpreted subjectively. Did any abstract explicitly say “This paper explicitly/implicitly minimises/doesn’t minimise anthropogenic global warming”? Of course not. Only the raters said that, and they have the right to disagree.
    Furthermore, the definitions are not exhaustive. For instance, there is no bin for those who endorse AGW but minimise it, which is the most common sceptic position. Definitions 2 and 3 accept the possibility of endorsing AGW while minimising it, while definitions 5-7 seem to treat “minimise” and “reject” as synonymous, and so on. It’s neither a Lickert scale nor a series of discrete and exhaustive bins, but a splodgy Venn diagram full of holes. So the fact that your circles don’t overlap perfectly is simply a function of Cook’s incompetent definition of what he’s trying to do.
    Cook, like Lewandowsky, is simply incapable of conducting a simple survey or analysis of data. As with Lewandowsky’s Moon Hoax paper, the problems of bungled conception and methodology are logically prior to any statistical errors. “97% of scientists say..” should read: “24 friends of the author think 97% of scientists say..”.

    Please publish the names of the 24. Then we can get down to some serious ridiculing.

  16. @geoffchambers

    It’s neither a Lickert scale nor a series of discrete and exhaustive bins, but a splodgy Venn diagram full of holes.

    I think that is an excellent description, better than I can say. I totally agree, but my “Before any fancy stats analysis look at the questions; they are totally rubbish” doesn’t cut it. 🙂

  17. Brandon,

    What happens if you decompose the above graph(s) by publication year/period? In other words, could you produce (say) four copies of the graph(s), each covering a different five-year period: 1991-1995; 1996-2000, 2001-2005, 2006-2010?

    Depending on how the abstracts were randomly assigned, the observed differences (especially for the less active raters who might not have worked through the full sample) may in large part reflect evolving consensus over time, rather than any kind of systematic biases between raters.

  18. @Carrick
    I indeed repeatedly offered to sign just such an agreement.

    I also offered to write the code in R or Stata or Matlab for them to run so that I could see the test results without seeing the data.

  19. Richard Tol:

    I indeed repeatedly offered to sign just such an agreement.

    In point of fact, Cook has an ethical obligation to cooperate with you on this, when it does not conflict with his ethical obligations to his evaluators.

    Leaving aside that he has never established that an actual obligation was undertaken between him and his evaluators (let him produce a blank version of the consent form if he wants to rigorously establish this), the fact that he could have protected them and allowed his research to be evaluated independently, but failed, speaks both poorly of him and his thesis advisor.

  20. geoffchambers, I’m afraid you’re wrong on this point:

    The question of bias can only arise if there is one and only one correct “score” for each abstract.

    Bias doesn’t require the answer chosen be incorrect. You can be biased toward a particular answer even if that answer is legitimate. I’d wager most biases lead people to pick justifiable options.

    That said, I agree with your point:

    Furthermore, the definitions are not exhaustive. For instance, there is no bin for those who endorse AGW but minimise it, which is the most common sceptic position.

    I discussed this back when the paper was new. I wrote about the overlap in categories a number of times, showing even the strongest “Reject AGW” category can overlap with explicit endorsement of AGW. I even showed Cook et al were aware of how vague their definition was, having discussed it in their forum. Their approach used was, amusingly, described as “Ari’s P0rno approach,” a reference to the famous quote by justice Potter Stweart.

    Cook seems to hesitate between treating his definitions as points on a Lickert scale, and as discrete “bins” which are logically exclusive and exhaustive. They are neither.

    Yup. I think they made the rating scale symmetrical because people are so used to thinking of scales that way. The problem is a Likert scale is one-dimensional, but they were looking at a question with two-dimensions. You can’t measure quantification and endorsement levels with a single, symmetrical scale. I discussed this back when the paper was new too. I even showed how you could make a real rating system for this issue.

    That was quite a few links, all to things I’ve written. I kind of feel like I’m just promoting myself… on my own blog.

  21. @Carrick
    Not just Cook and Lew.

    I appealed the decision with Ove Hoegh-Guldberg, Cook’s boss, and Peter Hoj, vice-chancellor of U Queensland; and with Dan Kammen, editor of ERL and Peter Knight, publisher of ERL.

  22. Grant McDermott, we’re told the abstracts were randomly assigned to people so they should have an unbiased distribution in regard to publication date. I’ve seen nothing to make me question that so far so I have little motivation to spend much time on it.

    That said, since I haven’t released the data, I understand people can’t do tests for themselves. I’m happy to do them. The problem is separating the data based on various things like publication year takes a bit of work because that data wasn’t stored with the ratings. I’d have to combine data from multiple files. That’s not a problem, but there are a couple steps between reading the data and generating the images which I haven’t automated yet. I have to do a number of things every time I change the data input.

    It’d be easier on me if I could know all the tests people would like me to run. That’d let me prepare a single data set that can work for all of them. Are there any other things you’d like me to check? How about anyone else?

  23. By the way, I should point out John Cook said he wanted to get legal advice about this issue, but he didn’t ask me to refrain from releasing the data. I told him if he wanted me to wait while he sought legal advice, he should say so. He ignored me. I waited a day and reminded him he had still not requested I refrain from releasing the data. He still didn’t request I refrain from releasing the data. I then sent him another e-mail specifically telling him if he wanted me to refrain from releasing the data while he sought legal advice, he needed to make request I do so. (He finally did.)

    I find that mind-boggling. I can’t understand why I had to badger him into requesting I not release data he claims is confidential. If I hadn’t, I could have released the data and he’d have been culpable right along with me. Similarly, nobody else involved in the study has tried to request I not release the data even though most (all?) are aware I have the material. How in the world do people justify complaining confidential data might be released if they refuse to make the slightest effort to ensure it doesn’t get released?

    I am the only person in this situation actually exhibiting concern for people’s privacy rights. How screwed up is that?

  24. It seems to me that Brandon could release the parts of the data that are interesting without needing to releasing information that could de-anonymize the evaluators.

    Perhaps Richard could provide Brandon with the R-scripts of things he’d like to examine?

  25. I’m thinking in terms of this comment “I also offered to write the code in R or Stata or Matlab for them to run so that I could see the test results without seeing the data.”

  26. Richard Tol, I’m assuming you want those for both pre-reconciliation and post-reconciliation. Is that right? A second question is there are lots of tests I could use for inter-rater agreement. Which would you like me to use?

    Carrick, the only part I’ve noticed people focus on has been rater IDs. Are there other parts are you thinking would be interesting?

    As for code, the requests so far are very simple, needing only a line or two of code. The only real obstacle to any of them is just getting the data in the right format. On that topic, I have a question. Does anyone know how to make a list of table results in R that’s easy to work with? I’ve been using the table() command to extra results for each rater.

    The problem is not all raters selected all values. That means the tables are of different lengths. So far I’ve manually inserted the 0s for the empty results, but I know there are better ways.

  27. I had heard good things about those before, so I decided to try plyr when you mentioned it. While working out its syntax, I must have made a mistake, because suddenly every spare ounce of CPU was sucked up by R. I couldn’t interrupt the program, not even to switch to a different application or bring up a console to kill it. It was only after R consumed over three gigs of RAM and the command crashed that I could do anything. Even then, R was holding so much memory I could barely do anything. I had to kill R, and that was without saving a workspace or anything. The only thing I was able to save were my scripts for making those images.

    I’m annoyed about that. Fortunately, I tried the data.table command next, and it works great. I should be able to produce the results much more easily with it. I just need to figure out some details about syntax and I should be back to where I was before.

  28. Cohen’s kappa is in library ‘irr’.

    Code I used for calculating kappa (or any statistic/function) on every nth block of data and capture output in a text file:

    seqA<-seq(1,11944, by=30);seqB<-seq(30,11944, by=30)
    seq<-cbind(seqA, seqB)

    for (i in 1:x)
    capture.output(kappa2(cook-data[seq[i,1]:seq[i,2],]), file="output.txt", append=TRUE)

    This works the same way as ‘zoo’ rollapply does on single time-series.

  29. After spending a little time learning about some code options in R, I’ve found I can generate the data for Tol’s first and third requests with a line or two of code each. A request like Grant McDermott’s takes only a few more. R doesn’t make its I/O options that easy to learn, but once you do it, it’s amazing how easy things are.

    That said, I’m going to hold off on releasing some of the results due to the amount of rater ID information exposed by them. I’ll happily discuss the results of the tests in general terms though. Before I do though, I want to highlight a point which I’m not sure anyone has ever discussed before even though it’s important. Namely, the people doing this study did not rate random abstracts.

    The reason I say this is the web application used to submit ratings gave each rater five abstracts at a time, but they were not required to rate all five. They could rate anywhere from none to all. I noted this point while reading the Skeptical Science forum material, but the data proves it conclusively.

    It may seem like nitpicking to point out a rater skipped rating an abstract because the abstract confused him, but this is a hugely important point. It means the raters did not rate random abstracts. They rated a self-selected sample of the abstracts. They could use whatever criteria they wanted when not submitting a rating of an abstract.

    Taken to extremes, one could have gone through all the abstracts and rated only the subset they wanted to rate. A person could have gone through and rated only papers which started with the letter “A.” A person could have gone through and rated only those papers they felt rejected global warming.

    Even if we leave aside the fact this gave raters the opportunity for malicious abuse, this is a serious problem as it introduces an impossible-to-quantify source of rater bias.

  30. Brandon, yes this does seem to be another problem for the survey. There is a lot here to wrap my head around at one time, especially when I have my own program to tend to (grant writing time).

    I’m thinking you can probably solve John Cook’s dilemma for him, by you simply not releasing the data to anybody, except with a nondisclosure agreement with that person, that protects the anonymity of the evaluators.

  31. Carrick, I could, but why on Earth would I? I’m already bending over backwards for them. I never had any obligation to keep this material private. Not only am I free of all legal concerns, I’m free of all ethical concerns. No journalist or lawyer in my position would doubt for a moment they were free to release this stuff. That’s why you won’t see anyone do more than whine and moan. That’s why not a single person involved in this study contacted me. That’s why not a single person has come here and asked me not to release the material. That’s why I had to badger John Cook into even trying to get me not to release the material (albeit temporarily).

    They’re happy to repeatedly accuse me of criminal actions without the slightest evidence, but that’s all they’ll do. They have no case, and they won’t do anything other than whine.

    On the topic of it being hard to keep up, I noticed you seem to have misunderstood what I said in my newest post. Over at Anders’s blog, you said:

    Actually no, it was publicly available via the tcp interface provided by Cook, as Brandon has explained.

    That’s not correct. There was data available via that interface not available elsewhere. That data was data relating to 521 papers excluded from Cook et al’s data files. It’s not related to the data I found. In fact, the data I found has the same 521 papers excluded from it.

  32. Anyway, back on the issue of datestamps. I think one thing everyone would be curious to know is the largest number of ratings done in one day was 283. I think that’s a remarkable number. Also remarkable is the day before, the same rater did 238 ratings, and the day after, he did 244. That’s three of the four largest values in the data set.

    The first rating took place on February 19th, the last on June 1st. During that ~100 day period, ratings were done on only 76 days. Five or fewer ratings were done on nine of those days.

    In total, there were 500 rater-day values (ratings done by one rater in one day). Six exceed 200; 75 exceed 100 (and two more are exactly 100). Nearly half (199) come in at 50 or more.

    Related to my above comment describing how raters were presented five abstracts at a time, but could choose not to rate any abstract they were presented, it seems worth looking at how many rater-day values were multiples of five. The answer? 128. There may be non-remarkable reasons why some rating sets wouldn’t be done in groups of five, but it’s interesting to see things like 38 rater-day values lower than five.

    I’ll do the Cohen’s kappa/Krippendorff’s alpha tests next. I can publish their entire outputs. And of course, if you have any specific questions about the data I’ve discussed feel free to ask. As long as it’s not revealing rater ID information, I’ll answer.

  33. @Brandon
    I had not realized there was self-selection.

    If the data reveal which abstract was offered to whom but not rated, then it should be possible to use a Heckman model to correct for selection bias.

  34. Brandon:
    So, there are 500 rater-days, 128 with #ratings that are a multiple of five, and 372 with demonstrable self-selection?

    You can use the rater-days with 5 or fewer ratings to estimate the prevalence of self-selection.

  35. That’s right, unless someone can think of a different reason why raters presented abstracts in groups of five would wind up submitting a group of ratings that wasn’t a multiple of five. I can’t think of any other reasons save when there weren’t enough abstracts left to rate to give a full set. That’d only come up a couple times though. Anyway, a quick breakdown of the lower range:

    Fewer than five: 38
    Five: 13
    Six to nine: 34
    Ten: 15
    Eleven to fourteen: 26
    Fifteen: 12

    Anyway, onto the topic of Kappa scores. There are 24 raters. There are pre and post reconciliation ratings, plus final ratings. That gives a huge number of comparisons we can make. It’s huge even if we only take the top 12 raters. Do you have preferred pairings you’d like calculated?

    Here are kappa scores for some of the raters’s post-reconciliation ratings and the final, official ratings (taken only from the 12 most active raters):


    Those tend to be worse than the scores calculated between individual raters. I haven’t seen a score between individual raters below .7 yet. The average seems to be somewhere around .85-.9.

    My impression so far is agreement between pairs of individuals after reconciliation is greater than the average agreement. Additionally, some raters’ (post-reconciliation) ratings are significantly more accurate predictors of the final ratings than those of others.

    I want make some modifications to my code so I can test pre-reconciliation ratings as well. Also, I want to see if I can write a wrapper for the Kappa function that will fix table dimensions for me. If I manage both, I’ll be able to run all possible comparisons with very little effort.

    The cool thing is the code is looking to be fairly flexible, meaning I should be able to replace the Cohen’s Kappa test with other tests with very little effort. I’m not great with R, but I’m getting better at it.

  36. Richard Tol, I assure you, I know how loops work. They are useless if you have to make tweaks while you run your code. That’s why I pointed out I want to make a wrapper for the Kappa function. Until I do that, I have to manually intervene whenever there are unequal table dimensions.

    Incidentally, the pseuodcode you posted wouldn’t work. (i+1:24) would generate a range of values from 2:48. What you’d actually want is the range 1:24, sans i. To achieve that with R’s syntax, I’d use:

    for (j in c(1:24)[-i])

  37. I suppose a simpler version of the above would be: for (j in i:24). My instinct is to use the other since it makes neater tables, but you can always mirror the table generated via this code if you want that. You’d save computational time.

    Of course, if you’re worried about saving computational time, you probably shouldn’t use loops in the first place.

  38. The ‘reconciliation’ is a sham. Initial ratings carry a total 33% disagreement rate. The final disagreement rate is 0%. The ‘consensus’ is ~97% in both. Which means the reconciliations simply and in effect arbitrarily switch abstracts between categories with no net effect. For e.g., rater A moves 10 of his abstracts from a previous ‘3’ to ‘4’ while rater B moves 10 from ‘4’ to ‘3’.

    ‘Reconciliation’ is arbitrary in that rationalizations (rather than unbiased application of set criteria) are used by raters given the knowledge that another volunteer disagreed with them. They are not independent ratings. Additionally, volunteers are actively trying to minimize discordance. Strictly speaking, kappa is not useful/applicable in such circumstances. It may yet serve as a measure of discordance.

    The initial ratings on the other hand reflect the process of naive application of classification criteria.

    If you are comparing same raters’ pre- and post reconciliation agreement scores, you will definitely get high kappa scores if they tended to give 3s and 4s.

    This graph shows the progression you found: . It shows why there would be little difference between column 4 and 5.

    The graph also makes evident though there is high disagreement rate between the first and last column, it makes no difference to sum totals of the ratings as calculated by Cook. In fact, you could have a 100% disagreement and still have a 97% consensus. This is the point I tried to illustrate with the bird example.

  39. ie rating criteria 1. 2 50% or greater AGW,

    therefore category 2 (endorse) must be less than 50%) which equals 6 and 7, – way to open to interpretation..

    but we know the purpose of the paper (leaked the consenus project forum, sks), was to generate a 97% headline/soundbite and spread the pr..
    nothing more.. and it has succeded

  40. Shub Niggurath, I’m using the Kappa command from the vcd package. It returns a few coefficients, from which it is easy to grab any you want.

    Richard Tol, I did. The problem was different sets of raters would have different table dimensions. One comparison could result in a 7×7 table while another resulted in a 2×3 table. Kappa scores are calculated on the diagonal so the dimensions need to be equal. Before, I was manually reshaping the tables to fix their dimensions.

    It wouldn’t have been hard to fix if not for R having some odd quirks. The solution I went with was to generate a matrix filled with zeros that could hold the table. I then inserted the table’s values into that matrix. There was a bit of a catch in that some ratings are zero, and R doesn’t let you have a zero index.

    There’s probably a better solution. Problems like this in R usually have elegant solutions. I just don’t care to spend more time trying to find one. This may not be ideal, but it’ll work for everything I need to do. I’ll post results in the next comment.

    Barry Woods, sort of. Category 2 doesn’t actually rule out a contribution under 50%. That means it overlaps with the lower categories, but it isn’t exactly equal to them. A more direct point is they never defined their “consensus.” I highlighted this point in a recent post because Tom Curtis claimed their consensus was obviously the strong consenus (humans are causing most global warming).

    But I think this older post of mine does a better job of highlighting the problem.

  41. Alright then. Time to post an actual list of Kappa scores. First up all the scores calculated between individual raters’ post-reconciliation ratings and the final ratings:

    [1,] 0.9244706
    [2,] 0.7998332
    [3,] 0.9105172
    [4,] 0.7422780
    [5,] 0.9162364
    [6,] 0.5961084
    [7,] 0.8649784
    [8,] 0.8838480
    [9,] 0.8865097
    [10,] 0.8136698
    [11,] 0.8806458
    [12,] 0.8511571
    [13,] 0.4316598
    [14,] 0.9044259
    [15,] 0.2749446
    [16,] 1.0000000
    [17,] 0.8626374
    [18,] 0.6842105
    [19,] 0.5329670
    [20,] 0.1369863
    [21,] 0.1428571

    A few are missing because they couldn’t be calculated. The rest are in descending order of rater activity. That means you should probably focus on the top half. Later on, I can extract the margin of error for these values if people want.

    Next up are the scores calculated between individual raters’ post-reconciliation ratings. A lot of these couldn’t be calculated, and others were either 1 or 0 due to having miniscule sample sizes. I filtered them out to save space in this comment. You’ll note there are a fair number of duplicate values in these results. That isn’t from parings being repeated. It’s from small sample sizes leading to the same distribution:

    [1] 0.8996785 0.6937574 0.8070022 0.9563253 0.5911602 0.9262735 0.9683215
    [8] 0.6945455 0.6363636 0.6937574 0.8070022 0.9563253 0.5911602 0.9262735
    [15] 0.9683215 0.6945455 0.6363636 0.8070022 0.9563253 0.5911602 0.9262735
    [22] 0.9683215 0.6945455 0.6363636 0.9563253 0.5911602 0.9262735 0.9683215
    [29] 0.6945455 0.6363636 0.5911602 0.9262735 0.9683215 0.6945455 0.6363636
    [36] 0.9262735 0.9683215 0.6945455 0.6363636 0.9683215 0.6945455 0.6363636
    [43] 0.6945455 0.6363636 0.6945455 0.6363636 0.6945455 0.6363636 0.6945455
    [50] 0.6363636 0.6363636 0.6363636 0.6363636 0.6363636

    These calculations should be more interesting when I do them with the pre-reconciliation ratings. I’ll try to do that later today. I don’t need to do much to alter my code to do it, but the sun will be up soon, and I’d like to get at least a little sleep.

  42. I know. 16.

    13, 15, 20, 21 proved completely useless to the cause. 😉

    The real test? Percent agreement and kappa between pairs of highly active raters for all abstracts that over-lapped between each other.

    kappa scores are susceptible to low-prevalence data points. I.e, if there are only few ‘6’ or ‘7’, they would artificially lower kappa for the whole calculation set. Kappa ideally should be checked with low-prevalence ratings – ‘1’, ‘6’ and ‘7’ – taken out.

  43. Brandon:

    Carrick, I could, but why on Earth would I? I’m already bending over backwards for them.

    If you judged it to be the more ethical thing to do, then you would. It’s certainly the case that Cook repeatedly failed to protect his evaluators. (Whether they even wanted “protecting” has never been established.)

    I don’t have a clear idea about what the best approach is, but I was pointing out that it is an option here. Given that none of the evaluators have requested that their anonymity be protected you can do as you want.

    As an aside, the emoticons on wordpress are really creepy.

  44. Carrick, I’ve suggested to John Cook I would need to know what assurances of confidentiality were given. He didn’t respond on the issue. I suspect there were no formal assurances of confidentiality for the raters themselves. I’m not sure about the authors they e-mailed. I think somebody actually posted a copy of such an e-mail once, but I couldn’t find it via Google.

    I don’t really care about the unfiltered self-ratings file I have. The only difference is it contains six entries which are identifiable. There’s really nothing to be learned there. I’d just like to resolve the individual raters issue because there’s a specific bit of information I’d like to release that hasn’t been discussed anywhere.

    As for the emoticons, you are so right. One creeps me out so much I’ve considered banning it.

  45. Apparently I was more exhausted than I realized last night. I had thought the second output in my last comment seemed wrong when I got it, but I went through the process and everything seemed fine. I just sat down to run the same test for pre-reconciliation ratings, and I immediately realized the results didn’t make sense. It only took me a minute to realize I had mis-nested one of my for loops. Because of that, I didn’t re-initialize variables properly, and it messed up the results. I think someone should have questioned it. The results are just way too similar. I guess that’s one of the downsides to not having your data/code available.

    Anyway, here is the corrected version:

    [1] 0.76258290 0.91835465 0.73711404 0.88796515 0.49949602 0.82531831
    [7] 0.95819209 0.86409257 0.78455833 0.80130351 0.81205849 0.21933086
    [13] 0.64757709 0.03076923 0.28571429 0.69257778 0.52356787 0.62859074
    [19] 0.30389492 0.58684325 0.64892005 0.53451883 0.49338228 0.70085778
    [25] 0.49154614 0.14691943 0.44711538 0.23076923 0.40000000 0.80106885
    [31] 0.85368104 0.59932518 0.77759309 0.90662387 0.86040027 0.76722117
    [37] 0.79332850 0.67230169 0.90731707 0.15294118 0.69402514 0.23777238
    [43] 0.66292384 0.68453189 0.65818627 0.71813342 0.53255238 0.76706827
    [49] 0.46666667 0.75308642 0.24615385 0.25000000 0.44705600 0.76456389
    [55] 0.89539422 0.86036671 0.74965706 0.74298142 0.53230337 0.44444444
    [61] 0.31428571 0.52127660 0.70381451 0.64473684 0.77091856 0.53131749
    [67] 0.59504685 0.27272727 0.01886792 0.66448802 0.77044381 0.77933188
    [73] 0.76482787 0.90364334 0.65384615 0.72727273 0.47826087 0.89822668
    [79] 0.87240664 0.77358491 0.90769231 0.33333333 0.77159703 0.82430279
    [85] 0.90243902 0.40000000 0.36842105 0.75772259 0.86516854 0.69696970
    [91] 0.62765957 0.28000000 0.20000000

    The first set of results were accurate, but I’ll re-post them for simplicity:

    [1] 0.9244706 0.7998332 0.9105172 0.7422780 0.9162364 0.5961084 0.8649784
    [8] 0.8838480 0.8865097 0.8136698 0.8806458 0.8511571 0.4316598 0.9044259
    [15] 0.2749446 0.8626374 0.6842105 0.5329670 0.1369863 0.1428571

    Here is the same test done with pre-reconciliation ratings:

    [1] 0.5595880 0.7159408 0.5624254 0.7106636 0.6846306 0.5961084 0.7709593
    [8] 0.4950931 0.5740323 0.4392028 0.6985237 0.6382718 0.1856066 0.3114465
    [15] 0.2749446 0.5290424 0.6842105 0.5329670 0.1369863 0.1428571

    I find it interesting the Kappa score for the first rater’s pre-reconciliation/final ratings comparison was .56. If you look at the scores above, you’ll see the Kappa score for his post-reconciliation/final ratings was .92. Similarly, the third rater went from .56 to .91. Those changes stand in stark contrast to the second rater’s Kappa scores (which went from .72 to .80) and the fourth rater’s (which went from .71 to .74).

    I should point out somewhere around the fifteenth entry those two stop being in alignment because of where results got filtered out for being impossible to calculate. I can put correct indexes in there later if necessary. I haven’t bothered yet because there’s so little data once you get past the first twelve or so raters. Anyway, here are inter-rater Kappa scores, pre-reconciliation:

    [1] 0.29656397 0.46742286 0.41455832 0.57655995 0.28752908 0.52254387
    [7] 0.65056118 0.40566356 0.32092580 0.51704059 0.43495809 0.31596091
    [13] 0.10034602 0.37602778 0.42169741 0.38702392 0.21732322 0.42922639
    [19] 0.31000528 0.20151829 0.13081470 0.43709732 0.32878481 0.04424779
    [25] 0.29411765 0.23076923 0.40000000 0.55903664 0.50863431 0.32168052
    [31] 0.44112610 0.36523680 0.40441421 0.15998242 0.40677966 0.20426829
    [37] 0.13278008 0.48148148 0.45993651 0.23945879 0.50574372 0.41887971
    [43] 0.30997805 0.22650321 0.35871965 0.41883768 0.18181818 0.25170068
    [49] 0.24615385 0.25000000 0.35180442 0.43950178 0.55318439 0.43001080
    [55] 0.27653453 0.60648148 0.64612115 0.44444444 0.20000000 0.33333333
    [61] 0.35981542 0.26550079 0.28361164 0.15356058 0.36009174 0.36896135
    [67] 0.04000000 0.16417910 0.01886792 0.25507901 0.37083595 0.33628319
    [73] 0.48592411 0.57032920 0.47826087 0.23659306 0.52325581 0.48148148
    [79] 0.74528302 0.25963439 0.46786632 0.57943925 0.30769231 0.58974359
    [85] 0.36842105 0.32098765 0.17968750 0.55477032 0.11111111 0.25000000
    [91] 0.40000000

    There’s no way to figure out which of those scores go with which raters since a lot got filtered out, and I didn’t add indeces. I’ll try to do that next. In the meantime, I can at least tell you the first eleven of those scores were calculated between the most active rater and the next eleven most active raters.

    I don’t find these Kappa scores as interesting because they speak to error rates, and I’m currently more interested in the issue of bias, but it is remarkable how bad they are. Anybody running a study who found nearly negligible agreement between their two most active raters would immediately think something was wrong.

  46. From the picture, it looks to me like the vast majority of ratings were in the “no position” category. Am I right? If so, where does the fabled 97% come from?

  47. Peter Ward, you’re right that most (~8,000) of the ratings were neutral. Even of the the ~4,000 papers they rated as endorsing AGW, ~3,000 were rated as only implicitly endorsing it. That is, they said things as strong as, “[Methane is] implicated in global warming.” Seriously. That’s an actual quote.

    The key is Cook et al filtered out the neutrally rated papers by only discussing papers which (they say) take a position. They then grouped their papers by whether or not they endorsed AGW. The endorsement could be as strong as, “Greenhouse gas emissions will kill us all,” or it could be as weak as, “Methane is a greenhouse gas.” Either way, they were lumped into the “Endorse AGW” category. That gave them a 97% “consensus” because almost nobody disputes the greenhouse effect is real.

    The trick is Cook et al then acted as though their “consensus” was regarding a strong statement of concern about global warming. It wasn’t. Their rating system was so poorly defined even the raters couldn’t give a clear definition of their “consensus.” In fact, they intentionally avoided a clear definition of the “consensus” even though they discussed the problem of defining it. You can see this here

    It is purely dishonest. I was hesitant to call it that before, but I recently reviewed commentary surrounding my original criticisms of this paper, much of which was posted within days of the paper being published. One of the first things I noticed about this paper is the only category in it which supports the notion humans are causing most global warming was Category 1. Categories 5, 6 and 7 rejected that notion while Categories 2, 3 and 4 were neutral regarding it. This was important because there were only 64 abstracts rated as belonging in Category 1, but there were 78 abstracts rated in Categories 5-7. I received a lot of backlash for pointing this out, including multiple condemnations from Dana Nuccitelli, second author of the paper. One of the clearest examples can be found here:

    I have to say that Brandon’s effort to compare our Category 1 (explicit endorsement with quantification) with Categories 5–7 (all implicit + explicit minimizations and rejections) is really a gross distortion of reality. I think that’s the nicest way I can put it – I’ll refrain from saying what I really think of it in the interest of keeping the discussion here civil.

    At the time, I pointed out this argument was wrong, but I didn’t notice an important detail about it. In the Skeptical Science Forum, Nuccitelli had explicitly suggested the exact comparison I used:

    For ‘humans are causing most of the warming’, #1 qualifies as an endorsement, while #5 through 7 are rejections.

    For ‘humans are causing warming’, #1 through 3 are endorsements, while only #7 is a rejection.

    That is, one of the authors of the paper told the other authors they should compare Category 1 to Category 5-7 to determine the level of consensus on the statement, “Humans are causing most of the warming.” After the paper was published, I made that comparison, showing more abstracts were rated as rejecting that consensus than were rated as endorsing it. Nuccitelli’s response was to basically call me a liar.

    I repeat. The second author of this paper explicitly defined a strong consensus position. He proposed a test to examine the strength of it. I applied his test, showing his data set failed that test. He responded by claiming the results of the test he had proposed were “a gross distortion of reality.”

  48. To be fair, the categories Dana Nuccitelli referred to in the above quote were slightly different than the categories used for the study. I don’t believe it changes anything I said though. In case there is any doubt though, Nuccitelli said this in a later comment:

    In short, in case I’m not being clear, my problem is that if you define AGW as “>50% observed warming over the past ~century is anthropogenic”, then either:

    you’re forced to make major assumptions in claiming that a large percentage of papers are endorsing that definition (for which you’ll be criticized, and rightly so), or

    you can break out those which explicitly endorse that definition and those which simply endorse “AGW”, in which case you’re not making any such assumptions (which IMO also makes it less subjective, though perhaps you can replace much of my ‘subjectivity’ concerns with ‘assumption’ concerns).

    Nuccitelli said it’d be right to criticize the authors of this study for assuming the “consensus” is “humans are causing most of the warming.” He says the only way they can avoid legitimate criticism should they define the “consensus” that way is if they separate out those papers as explicitly endorsing that “consensus” (i.e. Category 1 papers).

    I pointed out a problem with this paper. Nuccitelli insulted me for it even though he had pointed out the exact same problem more than a year earlier. Tom Curtis, a participant in the study, has explicitly argued the very point Nuccitelli previously said was wrong.

    I’m sure we could find other examples of similar hypocrisy.

  49. If you look at the explanations offered by the Cookists in various blog comments on how the rating scale works, they contradict each other. Sometimes their own earlier ones.

    By the way, we know when Cook emailed authors about their own papers for ratings, more than one author ended up rating the same paper. But they sent ratings that disagreed with each other. Did you know Cook ‘averaged’ their ratings? For eg, if a paper got a ‘1’ and a ‘6’, Cook would give it a ‘3’.

  50. @Shub Niggurath

    Did you know Cook ‘averaged’ their ratings? For eg, if a paper got a ’1′ and a ’6′, Cook would give it a ’3′.

    A further interesting thing about that fact is when Cook et al defenders inevitably retreating to the last redoubt of the most robust author self rated numbers of 237 consensus, 9 reject -I think Dana claims that but I worked out different from SI) e.g. you see Dana claiming this gives about 96% consensus – they seem to miss the fact that this actually only means 237 against 9 author rated papers support the consensus, not individual authors.

    Cook et al doesn’t tell us the number of authors per paper with their individual ratings AFAIK . Maybe our host here has that information? The number of authors per paper and their rating differences could potentially change that crude 96% quite a bit. I mean, what if each of the 9 ‘denier’ papers had 10 authors each all rating their paper against the consensus, while the rest of the consensus papers had average 2 authors with more conflicts? 🙂

  51. Brandon:

    That gave them a 97% “consensus” because almost nobody disputes the greenhouse effect is real.

    As I pointed out on Ander’s blog, and just today on Eli’s blog, the actual consensus for AGW among scientists (restrict ourselves here to those with at least one peer reviewed paper relating to climate science), must be much higher than 97%. I speculated it was better than (10000-50)/10000.

    One thing that bothered me about the paper wasn’t the paper, but how it was used to conflate agreement on the weak consensus statement “humans are causing global warming” with the strong one “AGW is real and dangerous”. (As I’ve pointed out elsewhere you get to “dangerous” partly through physics analysis and partly through economic analysis).

    If what I am saying about the actual percentages are true, the paper is grossly inaccurate.

    It then becomes an interesting exercise to ask why nobody who is using this paper to further their talking points is even addressing, or is worried about whether the 97% number is even meaningful, or even, what it’s actually a measurement.

  52. Carrick, your point about the consensus being higher than they found is interesting as many of the papers they labeled “Reject AGW” don’t actually dispute the greenhouse effet. They actually lowered their “consensus” value by adding in the “minimization” qualification.

    That is, a person could say humans have caused 45% of the observed warming, but that’s enough for global warming to cause serious problems so we need to take drastic action to combat global warming. Cook et al’s classification scheme would label that as “Reject AGW.” Heck, we could create hypothetical positions where people labeled “Reject AGW” call for action more strongly than people labeled “Endorse AGW.”

    While you’re right to say Cook et al did not measure a strong consensus position, they also didn’t measure a weak consensus position. It just seems they did because that position is so common. In reality, what they measured was some nebulous position which cannot possibly be defined.

  53. By the way, there’s an aspect to this study I find remarkable, but people don’t seem to have picked up on. John Cook and Jim Powell (and Dana Nuccitelli?) looked at the abstracts before designing the rating system. They did this because they were rating the abstracts for a different paper they wound up submitting to Science. Cook then used the knowledge he gained in the process to make sure he got the results he liked in his Cook et al paper.

    I’ll quote some of Cook’s comments from a month before the rating began to show this. First, a couple which show he had been examining the data:

    How do we distinguish between neutral and implicit? That is the key question that’s been occupying me for a while as I’ve been rating papers and building up some guidelines.

    It might help at this point to see papers that we’ve categorised as either Rejections or Possible Rejections… I’ve also included the Notes where Jim and I debate the various papers, whether they should be included as rejections or not. So the internal debates we’ve had should give some idea of the kind of issues we’ve grappled with. Looking at the types of Possible Rejections we considered might help us devise a more formal definition other than the pornography definition (I know it when I see it).

    This second of these was especially interesting as Ari Jokimäki responded by pointing out an obvious problem with it:

    You want us to classify these papers but give results beforehand? That might mess up our objectivity, if we see that certain paper has been classified as rejection by you.

    There can be no question John Cook looked at the data before designing his rating system. That’s inappropriate. However, the truly disturbing part is what he said in another post:

    The reason we have the extra categories – implicit endorsement and possible rejections are because…

    Re possible rejections, the reason we did these was an organisation thing at first – we highlighted all “possible rejections” then had a closer look at each paper to see whether each was a definite rejection or not. Some possibles while they “smelt” like rejections, once you looked at the paper, was apparent that they didn’t reject AGW (one even explicitly endorsed AGW in the text so it pays to look closely). So it was a procedural thing, to ensure we didn’t miss any. It was also useful in the publishing of our final results to say “we found 23 rejections of AGW – we also identified 19 other possible rejections that we decided didn’t go so far as to reject AGW but even if you include them in the rejection list, our conclusion of a negligible denial impact and strengthening consensus is still robust”.

    This is actually a key result from this survey – even if you include all the disputed papers that *might* be rejections, the end result of a strengthening consensus still stands. So that’s why we’re happy to publish our ratings in an open, transparent fashion, challenging others to reproduce our work and confident that the results stand.

    They’re “happy to publish [their] ratings in an open, transparent fashion” because they already know the results will be what they want them to be.

    Also of note, people were encouraged to do many ratings for personal gain:

    The end goal of Phase 2 is publishing the results in a peer-reviewed paper. As far as co-authorship of the paper goes, I was thinking perhaps a practical approach would be that to be a co-author on the paper, you rate at least 2000 papers – seems a fair requirement to get your name on a peer-reviewed paper.

    Which is always a questionable practice. Finally, any time you see someone use this “97% consensus” as though it is tied to a strong position, remember Dana Nuccitelli said this before the study began:

    The problem is that technically the ‘consensus’ has to be the simple existence AGW, since the vast majority of papers aren’t specific about the magnitude of tha human contribution. If we assume every paper talking about AGW is saying the effect is >50% of the observed warming, we’re over-reaching.

    But then papers that say humans are contributing, but the contribution is small, are technically endorsements. So then you get deniers like Scafetta in the endorsement category, and that weakens the whole argument.

    It’s remarkable how prescient Nuccitelli was prior to them publishing this paper. You have to wonder how he forgot all the things he knew back then.

  54. Brandon:

    That is, a person could say humans have caused 45% of the observed warming, but that’s enough for global warming to cause serious problems so we need to take drastic action to combat global warming. Cook et al’s classification scheme would label that as “Reject AGW.” Heck, we could create hypothetical positions where people labeled “Reject AGW” call for action more strongly than people labeled “Endorse AGW.”

    What’s really interesting about that is even if you think that less of the observed warming is due to CO2…that doesn’t necessarily lead to a lower estimate of CO2 environmental climate sensitivity, nor a reduced risk.

    For example, suppose I thought natural variability is larger than predicted by the models (so more of the warming till now is explained by natural variability), but that part of the warming that would have been present if CO2 were the only anthropogenic forcing was being masked by other (negative) forcings like aerosols. I could stipulate 30% to date was anthropogenic, but the higher ECS number means more serious warming to come. Add to that greater natural variability, well that gets added on top of any trend: Think storm surge (warming trend) combined with tides (natural variability) as an example. Places with higher tides get the worst of the flooding. Same with climate change. Areas that experience greater natural variability (semi-arid areas) will similarly experience worse consequence from any secular shift in climate.

    This is one of the criticisms I’ve had of Mann’s attempts to disappear the MWP—he’s driving to extinguish natural variability, but larger natural variability actually makes things worse not better.

    People like Eli are happy to smile and thumbs up Cook and Mann because it helps their immediate politically driven debate. But it also tells you, really, just how little concern they actually have about climate change. Otherwise the answers would matter more to them, than they do.

  55. This is commonly encountered in the evolution of classification systems. Grading scales and multi-layered complex systems originate purely on a conceptual, empirical basis, and the bad fit with the underlying data comes out later. Look at this (monstrosity) for e.g: Yet, I’ve used it and I got pretty good at it. The problem, of course, was that my own score reproducibility would be good but there would be little knowledge on inter-rater reproducibility. It goes without saying there is a ceiling in complex rating scales that is impossible to overcome and the ceiling gets lower the more complex the classification/grading system gets. With such scales the purpose and the evolution curve is however different: the authors write up the scale, they are considered experts in their field and they train an initial set of people to use it the way they used it. This then gets disseminated via conferences and teaching sessions over the years, with increasing numbers of people adopting it. Eventually, (clinical) data and reproducibility data starting trickling in, and the rating scales get revised, modified or become obsolete over time. In several areas, the eventual conclusion has been (1) simpler systems are reproducible for the maximum number of users (2) complex, granular system do not necessarily provide greater/useful information

    Look at the Ishak scores: If a new observer is trained to use the system and then handed a handful of specimens, he or she would force them into existing categories regardless of whether the features are real. Give the same material with a different scale with different levels and people will fit them into those too! Cervical cancer grading in Papanicolau smears was performed on a three-tier system and specimens would get three grades. The middle grade was commonly abused as a dumping ground and reproducibility was poor. So they moved to a two-tier system. But they also created a handful out-of-rating-scale entities and these get used as dumping grounds/parking slots. This implies the material contains information that cannot be forced into simple grading systems but cannot be captured by a more granular system reliably. Nevertheless, in both instances, information is present that is subject to a semi-quantitative scale, the thresholds of which are difficult to discern.

    What we have here, on the other hand, is the use of an empirically evolved rating scale that undergoes no testing or validation, whose fitness-to-purpose is unknown, is applied on a one-time-only basis to produce results the authors are not willing to subject to examination. It is actually good if Cook examined data before -hand and came up with the system. But (a) they decided to retain a ‘porno’ approach, (b) the authors of the system participated in classifying papers themselves. Secondly the underlying material – the abstracts – do not contain the information that is subject to the rating system. 90% of abstracts do not contain written information on the question.

    What would be the outcome of subjecting a mass of undifferentiated text to a classification system that has certain arbitrary number of levels/grades? The people applying it would fit the abstracts into those categories on an subjective basis, mainly because they were provided with it. The search strategy pulled up abstracts in the climate impacts and mitigation literature (~77% of abstracts). A third of such materials can fairly be expected to say something *about* global warming, which, in the hands of volunteers looking for something to hang their hats on, becomes *in favour of* anthropogenic warming.

    Ultimately this is junk science, junk data and junk methods, just like Steven Mosher said. Only in environmentalism and climate communication do such studies pass muster.

  56. That would be for pre-reconciliation ratings #1 and #2, i.e., the two different initial ratings each paper got.

    I guess we won’t see it for a while now.

  57. Ah, gotcha. I wouldn’t be so pessimistic though. I doubt the University of Queensland will decide to pursue this. My impression is the letter was an empty bluff. If I’m right, the delays in releasing this data should be about over.

  58. This has probably been mentioned in previous discussions on Cook”s paper, but the abstracts themselves are also a source of bias. Abstracts relate to published papers, and climate skeptics have a more difficult time getting published than warmists. For a recent example, see Climategate also provided multiple examples.

  59. I think there are other methodological issues in the Cook paper. Here are a few: publication bias (approval of skeptical papers is an uphill battle whereas pro-AGW probably get a free ride in several journals); brown-nosing bias (CAGW needs to be mentioned to please funding agencies); off-hand bias (“hey, the official IPCC plot is AGW and who are we to say otherwise; we simply need an introduction”); irrelevancy bias (mentioning AGW in a paper is not scientific evidence for AGW); post-hoc statistical bias (papers not mentioning AGW were simply discarded); presumption bias (it is implicitly assumed that all authors of a paper have the same view on AGW).

  60. I have not had time to read all the poster’s comments (yes, I did read the original post). My understanding of Cook original article is that is said something along the lines of–

    of the thousands of articles read by the raters, 66% expressed no opinion about AGW and of the 34% (approx) that did express an opinion, 97% agreed with the conventional viewpoint.

    My first contention is that this whole line of reasoning smacks of “50,000,000 Elvis Fans Can’t be Wrong,” (the title of one of Elvis’s albums) How many people agree on something is independent of it being right or wrong.

    My second contention is that it may very well be that the only people writing these papers who are GOING to express an opinion on AGW are those who agree with it. Meaning, that only people who want to promote that concept are going to express an opinion on it. Because, most of the people who want to do that research are setting out to confirm it to begin with.

    A better way of expressing it might be that of the 16,000 or 26,000 climate related papers published, only 33% expressed an opinion in favor of AGW.

    Another way of looking at it is, “Of the Red Sox fans we surveyed, 97% of them liked the Red Sox.”

    Another way, Let us do a survey of those 34% of articles that expressed an opinion on AGW and see what percentage agreed with AGW BEFORE they began their research.

    My overall opinion now is–“Come back to me when the global average temp goes above 1998.” I doubt I’d hear from anyone in the future. By the way, saying, “The decade is the hottest decade…..” is only another way of saying that global average temp has not risen above 1998. Figures either side of a peak are almost always going to in a similar range to the peak. The important thing is “was the peak surpassed and the predicted pattern re-established?”

  61. Rather than wait for the University of Queensland to sue you or cower under their threat of suit, have you thought about bringing an action for declaratory judgment that the data and the letter are not protected by IP laws?

    The question of where you could bring such a suit (e.g., in Australia versus where you live) is potentially tricky, depending on the law in the country and/or state where you live, plus whatever activities (other than sending the lawyer’s nasty-gram) the University undertakes in the country and/or state where you live.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s