A Re-Analysis of the “Consensus”

Hey guys. I’ve been mulling over an idea I had, and I wanted to get some public feedback. What would you think of a public re-analysis of the Cook et al data set?

A key criticism of the Cook et al paper is they didn’t define the “consensus” they were looking for. There’s a lot of confusion as to whether that “consensus” position is weak (e.g. the greenhouse effect is real) or strong (e.g. humans are the primary culprits). The reason for that is Cook et al tried to combine both definitions into one rating, meaning they had no real definition. You can see a discussion of that here.

I think it’d be interesting to examine the same data with sensible definitions. Instead of saying there’s a “97% consensus,” we could say “X% believe in global warming, Y% say humans are responsible for Z% of it.” That’d be far more informative. It’d also let us see if rating abstracts is even a plausibly useful approach for measuring a consensus.

My current thinking is to create a web site where people will be able to create accounts, log in and rate a particular subsample of the Cook et al data. I’m thinking 100 “Endorse AGW” abstracts to start with should be enough. After enough ratings have been submitted (or enough time has passed), I’ll break off the ratings, post results and start ratings on another set of abstracts.

The results would allow us to see tallies of how each abstract was rated (contrasted with the Cook et al ratings). I’m thinking I’d also allow raters to leave comments on abstracts to explain themselves, and these would be displayed as well. Finally, individual raters’ ratings could be viewed on a page to look for systematic differences in views.

What do you guys think? Would you be interested in something like this? Do you have things you’d like added or removed from it? Most importantly, do you think it’d be worth the effort? I’d be happy to create it, but it would take a fair amount of time and effort. It’d also take some money for hosting costs. I’d like to have an idea of if it’d be worth it.

An added bonus to doing it would be I could move my blog to that site as well. Self-hosting WordPress takes more effort than using WordPress.com, but it allows for far more customization. I’d love that.

So, thoughts? Questions? Concerns?

By the way, don’t hesitate to tell me I’m a fool if you think I’m spending too much time on the Cook et al issue. I’ve been telling myself that for the last two weeks.

Advertisements

67 comments

  1. Levi Russell, rating abstracts doesn’t take any special skill or knowledge. All you have to do is look read an abstract and ask yourself, “Does it endorse the greenhouse effect? Does it quantify how much global warming humans have caused?” Anyone could do it.

    In fact, laypeople might be better than knowledgeable people. Knowledgeable people may have biases due to their exposure to papers that laypeople wouldn’t have.

    Plus, you wouldn’t have to rate any abstract you were unsure about.

  2. Brandon, as a university researcher involved with student learning assessment, I recommend that you follow some of the standard practices.
    1. have clear definitions of what and how you are rating.
    2. develop rating rubrics.
    3. train raters, at least minimally.

    Of course, each of these should be tested and modified as needed before general release. It’s not an easy undertaking to do it right, but there’s no point in doing it wrong. You will provoke attacks so the method has to be as air-tight as possible.

    For reference, this website deals with academic assessment practices: http://www.learningoutcomeassessment.org/

  3. Brandon Shollenberger,

    My problem with the approach of rating abstracts is it is not a rating based on reading the paper itself. The approach includes a false and unstated premise. Namely, the false and unstated premise is that one can, with any reasonable intellectual integrity and with any objective scientific basis, assess a paper by just reading the abstract of the paper.

    No, I would not participate in re-analysis of Cook’s approach based on reading abstracts of the papers.

    I suggest you take an approach of rating based on the reading of the whole papers.

    John

  4. Brandon,
    This would be a major effort and probably a lot more work than you might think. I’d be very interested in the result though and would participate, time permitting. That said, I think doing it well and in detail might take a very long time. When you set out to do something right, it takes many times as long as doing what SkS did.

    You may want to break it up into smaller goals. For example, take a first cut on the number of papers that fall into the category of mentioning AGW as a given, but in a paper that is about something else. For example, a paper that says “given the world will warm two degrees in the next century, we examine the impact to the range of western ridgeback three toed sloths”. That more than likely specifically endorses AGW, by SkS definition, but is really just a paper that isn’t about determining if AGW exists at all. Another category might be papers that study changes in the biosphere and blame them on AGW without actually presenting any evidence that the local climate has in fact changed (I’ve seen enough of those to make we want to puke).

    Categories like that could be more easily achieved and reduce the number of paper that do actual AGW quantification for more in depth analysis while still providing insight into the collection as a whole. Plus you get to release results in chunks this way and drag this whole thing into the spot light over and over again 😉

  5. Brandon Good idea.

    Ive laid out some of the principles I learned doing content analysis ( on student writing for placement) so it is vital that you follow a few procedures

    I’d volunteer to rate abstracts.

    There are some other cool things you can do as well

  6. oh, and I think you’d need a whole category for “endorses but isn’t bad” or some such thing. For example, a paper that estimates climate sensitivity at less than 1.5C per doubling obviously ensdorses the idea that AGW exists, but not necessarily that CAGW exists. Where you draw the line on that is a whole debate unto itself.

  7. Gary, the first practice you describe is a given. It’s the the primary reason I’ve been considering doing this. I haven’t given much thought to the second point because of how simple the definitions I’d use would be, but it’s probably wise.

    The third point I’m not so sure on. One of the benefits of doing rating in sets is you can discuss the previous ratings and figure out how well they were done. You can even have people “practice” on them. One thing I was considering is including some abstracts from previous samples in new samples as a consistency check. That gives you and idea if a rater is rating things in a manner consistent with your intent. That’s a lot easier than trying to train random volunteers.

    John Whitman, I think you missed the point of this. I’m not seeking to find the “right” consensus values. I’m seeking to examine the methodology used by Cook et al. If people want to say the results are bogus because you can’t judge the “consensus” by abstracts, that’s fine by me. I’ll agree with them and say they should take it up with Skeptical Science.

    Steven Mosher, there are a lot of cool things that can be done with an approach like this. The problem is “cool” often conflicts with sound research. For example, examining how results change from set to set as people learn more could be cool. That’s not good science though.

    Or at least, it isn’t if the goal is to get a “right” answer. Fortunately, I’m not. I’m not looking to publish results from this as showing a new consensus value. I’m just looking to test the data and methodology used by Cook et al. That frees me from a lot of burdens.

  8. davidmhoffer, I actually have a pretty good idea how much work this would take (at least on my part). I know how long it’d take me to set up the database and server for it. I also know the login/account functionality wouldn’t take me too long as long as I don’t get fancy with it.

    The main thing I’m uncertain about is how long it’d take to create the ratings interface. That one is a bit more iffy, especially depending on what sort of functionality I want for examining the data afterward.

    The other main source of uncertainty is I’m not a graphics person. I don’t know how much time I’d wind up spending trying to get things to look nice. That’s a non-essential aspect though. I think it’d be okay (but not good) even if I didn’t spend time on it.

    The real question I have to ask myself is, how much would this cost me? The cost of having the site itself isn’t bad (I’d pay ~$120 per year), but I don’t know how I’d measure the value of my time.

  9. Sounds good. I’d like to see Stephen Mosher’s ideas for how it should be done. I’m curious what to do with that abstract you posted about metallurgy. It is saying that humans are responsible for the majority of global warming.

  10. The result will be decided simply by which side gathers the largest army of supporters.

    Skeptics already suspect that’s how the original result was obtained.

  11. Generally a bad idea, in my opinion.

    First, even if it’s done well, nobody is going to pay attention. Academia won’t care, and certainly the media won’t care. If somebody does pay attention, it will be to say, “Oh, look at the silly deniers!”

    Second, the whole emphasis on a consensus is silly. A consensus in science happens naturally over time, and normally people don’t pay it much attention. I don’t remember ever seeing endless media coverage of 97% of scientists agreeing on plate tectonics, or evolution, or the big bang, or dark matter. It’s pure politics that a consensus is pushed down our throats so strenuously in the area of global warming. I don’t think skeptics should be playing that game. Skeptics should “rise above” this sort of thing … not wallow in the mud with the warming zealots.

    Just my opinion.

  12. Joe Public, I disagree. For one thing, bias amongst raters is not the reason Cook et al found a high consensus value. The reason is the rating system was inherently biased to give a result like 97%. A different set of raters could have gotten a lower value, but they couldn’t have gotten one too low if they followed the guidelines.

    Another issue is I intend to make raters’ ratings viewable. If bias across groups is an issue, we’ll be able to see evidence of that by examining the work of individual raters. That means we could filter out raters people feel are biased and see how it affects results. That’d give us a range of possible answers.

    Finally, if there is significant disagreement on certain abstracts, that’d be easy to spot in the data. People could look at those disagreements and judge for themselves what the rating ought to have been.

    But really, since I’d make a simpler and clearer rating system than the one Cook et al used, there should be less disagreement in the results. Also, there would be more ratings per abstract, meaning spotting “errors” would be easier.

  13. ZombieSymmetry, I agree the emphasis on a consensus is silly. That’s part of the reason to do this. The results could show how referring to a singular “consensus” is meaningless, showing people a more nuanced discussion is necessary.

    The idea is to use an argument ad absurdum to show the fallacious nature of a common argument. That’s the opposite of wallowing in the mud. That’s pointing to the people in the mud and explaining how their behavior makes them pigs.

    As for whether or not people would pay attention, I can’t say. I do think it’d provide a strong talking point for anyone rebutting the “97% consensus” argument though. Any time someone goes on about the “consensus,” these results could be brought up to show how meaningless their argument is.

  14. Thanks for the explanation to my earlier comment, Brandon.

    However, it’s still a distorted sample.

    “(Cook) emailed 8547 authors an invitation to rate their own papers and received 1200 responses (a 14% response rate).”

    It’ll be analysing the 14% who themselves chose to respond to Cook.

  15. Brandon Shollenberger
    May 20, 2014 at 3:14 pm

    John Whitman, I think you missed the point of this. I’m not seeking to find the “right” consensus values. I’m seeking to examine the methodology used by Cook et al. If people want to say the results are bogus because you can’t judge the “consensus” by abstracts, that’s fine by me. I’ll agree with them and say they should take it up with Skeptical Science.

    and

    Brandon Shollenberger
    May 20, 2014 at 3:14 pm

    Or at least, it isn’t if the goal is to get a “right” answer. Fortunately, I’m not. I’m not looking to publish results from this as showing a new consensus value. I’m just looking to test the data and methodology used by Cook et al. That frees me from a lot of burdens.

    – – – – – – – – –

    Brandon Shollenberger,

    I would be very interested in what you find if you (and a host of volunteers supporting you) pursue “test[ing] the data and methodology used by Cook et al”.

    This is a question of priorities.

    My thinking is that a whole set of invalidating premises used by the Cook et al (2013) ‘Consensus’ paper makes the paper not even irrelevant to any physical science of the climate. So testing the ‘Consensus’ paper’s data and methodology to me seems the lowest of many priorities. Public debate focused on convincing the UQ to provide the hidden data openly for all scientists to analyze would seem to me to be the preferred priority; like you and others are already hard at work doing. : )

    John

  16. Joe Public, I’m afraid I don’t understand what you’re saying. I wouldn’t be examining the abstracts whose papers were rated by their own authors. I’d be examining the abstracted rated by Cook et al. Even then, I’d probably only examine the ones Cook et al rated as taking a position. That sample would unquestionably be biased, but that’s alright. The bias would work in favor of Cook et al, making any results gotten all the more damning.

    ZombieSymmetry, I’m still debating on whether or not it’d be worth creating a rating category for whether the paper seeks to provide evidence or not.

    John Whitman, I agree about the problems with Cook et al’s paper invalidating it. The problem is convincing people of that. I think there’s a sizable number of people who would tune out most criticisms of the paper but would be interested in the results of something like I describe.

    We can look to Skeptical Science for why that’s true. As John Cook has repeatedly said, if you don’t offer people an alternative, they’ll usually stick with something even if it’s bad. They’re counting on that. That’s why they respond to criticisms by telling people to do their own study. They know nobody wants to do their own study, and as long as nobody does, their study will be accepted.

    As for other priorities, once you start getting into formal and legal issues like ethics approvals and confidentiality agreements, things slow down. You could spend a week or two with absolutely nothing new to discuss on the matter. It’s not necessarily bad, but it does mean you can have time for side projects like this.

  17. Running this in public as opposed to doing it in a secret secret, well almost secret forum is a great model. As you know Anthony Watts did a similar operation with his Surface Stations project. It made a real impact on everyone’s awareness of the weaknesses in the USHCN. A thoughtful survey of the literature on climate science could be equally interesting and revealing.

    Responding to the “if you don’t like our results then run your own study” charges made by the Kidz et al is also very interesting. While replicating The Consensus without an Object Project would be silly, improving on it by categorizing the literature into more than “97% agree and the other 3% were written by deniers” would provide a potentially useful result rather further (and intentionally) polarizing the discussion.

    What portion of TCP papers “took a position?”

  18. What portion of the TCP papers took a position? Another good question. The Cook team nonsense includes papers that ‘implicitly’ took a position. In reality, only about 10% actively took ‘a position’. The rest is just reading positions into text.

  19. DGH, I’m glad to hear a number people are favorable to the idea. As for your question, I was thinking I’d filter out all the abstracts Cook et al rated as 4 (neutral). There could be value in examining them, but that category has little information density compared to the other categories. That makes it a lower priority (at least to me).

    Shub Niggurath, stating you agree with a position is enough to make you part of a consensus. I have no problem with that. Consensus is about the level of agreement, not the level of evidence. Keep that distinction clear, and things are fine.

  20. I think an additional portion of the analysis of the abstracts, making the ratings more solid, is if the abstracts were run through a typical line numbering program, similar to legal briefs, so that the rater can cite specific parts of the abstract that support their rating. The reason, I fear, is the potential for bogus raters. Allow for reader to quickly see what triggered a specific criteria in the rating. But unfortunately, would add time and effort.

    Would happily read abstracts.

  21. I see one other issue however. A lot of abstracts have the ‘pay homage’ quote like. “Global warming is a serious threat to plant …” then absolutely 100% of the rest of the abstract details their analysis of some data set and their conclusion is in direct conflict with the consensus or discusses nothing about anything to do with GW or AGW. Have a nice paper on my desk, from about 1 year back, on analysis of Greenland melt rates and their conclusion was …our studies does not support agree with the models.

    So how does one establish a category for this?

  22. While I don’t expect that the Kochs, Heartland, GWPF, Exxon (or even Tetra Tech) will make a gift to your homegrown survey, there’s no reason not to avoid the Big Oil issue. WUWT got a lesson in that regard during the Gleick affair.

    To the extent that you receive significant gifts from friends while you are running the project you should have a disclosure policy in that regard. If you accept something over a certain value or a certain percent of budget then the info should be made public.

    My comment anticipates a level of success (as it were) that your project might enjoy. Good luck.

  23. Levi Russell casts some doubt on his ability to rate abstracts. Brandon answers well and I agree that it is in fact people like Levi who might be best suited to such tasks. After all, as long as you can read and understand the abstract and then compare what you learn from it with the questions being asked it’s not so difficult. In fact being so far removed from the subject can help remove bias. if your dog’s not in the fight you won’t be tempted to tip in his favour.

    Brandon. I’d be more than happy to offer my time for a project such as this. I am educated to a reasonable standard and attended two universities. My field has been more in the area of electronic engineering and computer server networking.
    I consider myself to be objective and willing to accept that what I believe may not always be correct.

    I’d caution against having too many variable in the ratings, muddying the waters but also be careful not to make it an either or situation which may force raters to choose an option that are not comfortable with but do so because it’s the lesser of two evils if you will.
    Fully support the idea whatever the outcome. Science is meant to be replicated.

    Objectivity of the raters must be paramount. There are plenty of people who do not take the same views as the Skeptical science team who would be just as willing to skew results for what they consider a favourable outcome as there were in Cook’s study.

  24. I am inclined to agree with joe public. This will just degenerate into a tribal war. Who has the biggest tribe wins. The problem is the pro warmist, de-industrialisation ant west anti humanity crowd believe the end justifies the means and will not rate honestly. No matter how good your methodology is they will ignore it. They already ignore reality.

  25. Timo Soren, adding line numbers is a good idea. I’m not sure how much work it’d be, but it’s something I’ll look into if I create this system. As for your question, my thought right now is a consensus is about measuring how popular a position is. It’s not about examining evidence for that position. As such, there’s no need to look at the issue you highlight.

    That said, I can always add additional things to look at. It’s not hard to do. The problem is just that you don’t want to overwhelm raters. One possible solution to that is, since I’d plan to have abstracts rated in sets, is to not ask every question of ever set of abstracts. A similar option is just running a second round of ratings for abstracts where the questions were different.

    DGH, I don’t think the source of funding would be an issue for this. I wouldn’t expect to receive much in the way of financial backing. I’d probably be willing to set up a preliminary version for testing purposes for ~$500. That’s wouldn’t be much of a bribe.

    If for some reason I did start taking in more significant amounts of money, I’d definitely want a disclosure policy. I just don’t foresee that happening. It’s not like I’d be turning this into a job or something.

  26. Brandon,

    Some sample rubrics that could be modified for this project:

    http://rubistar.4teachers.org/index.php?screen=ShowRubric&rubric_id=1060532&
    http://rubistar.4teachers.org/index.php?screen=ShowRubric&rubric_id=1993164&

    Training can be as simple as having a rater use your rubric on a couple of sample papers that you have evaluated previously. Compare the results and explain why you rated the sample as you did. I assume you will be registering raters in some way. Training can be part of the process.

    The Zooniverse https://www.zooniverse.org/ does a lot of crowd-sourcing with simple training.

  27. zootcadillac, I came up with a rating system when Cook et al first came out. It’s pretty simple. The core is just two categories: 1) Endorsement of global warming in general (basically believing in the greenhouse effect); 2) Amount of quantification.

    I’m currently debating on how to handle the question of implicit vs. explicit. I had thought to give it a category of its own, but I’m not sure if that’d work. What if something explicitly endorsed the greenhouse effect but only implicitly quantified human contribution? Also, I’m thinking about whether or not it’d be worth having another category for whether or not global warming is a threat.

    I think it can be done pretty simply. At a minimum, you just need three options for the two main categories. 1) would have “Endorse, Reject and Neutral.” 2) would have “0-50%, 51-100%, Unspecified.” Even if you add implicit/explicit to those options (or as separate categories), it won’t be bad.

    On the issue of bias, I definitely agree that’d be a potential problem. However, I also believe it’d be a potential benefit. Comparing how various raters’ ratings diverge from one another could provide useful information about how biases can affect the results. That might not be feasible when you only have two raters per abstract, but if you have ten or more? It’d be much more doable.

    Plus, I’d intend for raters’ ratings to be public. That’d mean people could look for themselves to see if they felt bias was an issue. Heck, it might even be worth creating a histogram or something similar showing their ratings.

  28. Gary, I’m not sure how you’d use training as part of the registration process. Would you not allow an account if the rater couldn’t pass a test? Actually…

    I have an idea. I don’t think I’d require you pass a test to sign up. I think I would have a set of test abstracts people could practice on. However, the “testing” would be done differently. What I’d do is seed the sample sets with “test” abstracts. People would rate them like any other abstracts, not knowing what they were.

    That would let us “grade” raters. We could then use those grades to filter results if we wanted to. Also, once the ratings of a subsample were finished, people could look and see how they did on the “test” abstracts of that subsample. It wouldn’t be immediate feedback, but it would provide a constant check of people’s reliability.

    Plus, if people disagreed with the ratings for those tests abstracts, we could change the “right” answers and regrade raters with little effort.

    I think that would work. I’m not sure though. I just came up with it. Also, it’d be a lot of work.

  29. Thank you for the reply Brandon. Transparency was the thing I omitted from my post. Indeed. All raters should agree to be named and have their views subject to public scrutiny.

    As for the rating system, that seams reasonable. I just have a bit of a beef with polls and questionnaires. Too often I feel I’m being railroaded into saying something I don’t agree with because the scope of the answers does not allow for honest responses. I don’t doubt though that your intentions are pure here and I’m becoming more excited for the project. I hope you feel able to undertake it, I’m sure that you will get plenty of the right help should you need it and funding would be a minor issue that a few of us could sort out in an afternoon.

  30. zootcadillac, I’m not sure I’d require people be willing to be named. I was thinking I’d only make ID numbers visible. That would let people scrutinize everything while still allowing users to be confidential.

    On the issue of the rating system, I agree. The Cook et al system is a great example of what you’re referring to. Even some of the raters commented on that problem. The problem is they didn’t understand how to create a rating system. They felt it was necessary to create a symmetrical scale. That was a big no-no. I discussed this here:

    http://rankexploits.com/musings/2013/why-symmetry-is-bad/

    Basically, their scale had two dimensions (endorsement and quantification), but they tried to express it in only one. That could be done, but it couldn’t be done on a symmetrical scale because the two dimensions were not independent of one another.

    Anyway, I’m pretty sure my categories would work fine. If not, and people felt they were bad, I could always change them. That’s one benefit of doing ratings over specific subsamples. It makes it possible to test ideas.

    The more I think about this idea, the more I want to do it. I’m tempted to go buy the domain name and hosting plan I had my eye on and start working on it tomorrow…

  31. Brandon, I think it’s a terrific idea, if you are willing, or have help who are willing, to set up and coordinate such an effort.

    I think, based on something you said in a previous post on this subject, that a side project might be of value, as well. I believe it was you who demonstrated that there are papers which have little or nothing to do with climate, but which toss a reference to climate into the abstract, as a magic, “Get Peer Approval” device, and because of this, some were counted as valid members of the set of all papers “discussing” climate change, and part of the consensus study.

    Might it be of interest to review papers where mention of climate change in the abstract seems out of place, to determine the validity of the use of such papers in the study at all?

    Also, if there turned out that there was a significant number of these, what, spurious (for lack of a better term) citiations of climate change where it was clearly no part of the subject of the paper–in spite of the abstract–might that fact not also be of scientific interest? An example of researchers exploiting a bias in academic peer research to gain approval? Evidence of the knowing use of such a cynical ploy—should the evidence support this little hypothetical of mine—might also be of interest, no?

    I realize this is a larger undertaking, and calls for judgement calls which will draw fire from academics involved (i can only imagine the scorn they would heap on me for stepping “above” myself like this), but I think it would be interesting to find out, no?

  32. P@ Dolan, I’m glad to hear you like the idea. I am capable of doing this. The only thing involved (other than getting interest and things like that) I’m not good at is the artistic part. My pages would probably look fairly bleh.

    As for your proposed side project, that’s definitely something I’d like to do. My current thinking is if it wouldn’t bog things down, I’d add a box people could check to flag an abstract as questionable. After each set of ratings were done, we could look for ones which were flagged and examine them. It wouldn’t add much overhead, and it could give a steady stream of material to discuss.

  33. @Brandon

    it is a great idea und it would be fun taking part. You can count on me.

    But before starting the project I would direct my complete energy to “convince” Cook, the Journal or UQ to retract this study. Including FOI, the media, blogoshere, crowd source, politics and so on….a really organized campaign. With making sure to name names in the end. My best wishes.

  34. Any attempt to replicate Cook would fall into the same trap: The number of abstracts is too large, the category boundaries too imprecise, the wording of abstracts too ambiguous.

    It would be useful to have a SMALL number of abstracts rated by a LARGE number of people (rather than the other way around).

    It would be useful to use the results to train a computer to rate the remaining, large number of abstracts.

  35. I would be willing to help out. There are issues though. Brandon’s statement about support for a weak or strong GW position is a relevant area to consider. But there are other dimensions as well. For instance, a study of the effects of climate change on “one-eyed warbling hyenas” should be probably be ranked a tad lower than a paper on climate sensitivity. A separate grading (independent of the reviewer) needs to be on relevance to the core of the subject, which in turn needs a clear definition of what “climate science” is. Further, a paper that withstands the test of time should be ranked more highly than something recently published.
    Gary’s comment about clear definitions should also be followed. But a trial needs to be conducted as well.

  36. Zombie Symmetry makes me think a more useful project might be to review how many of the papers had little or nothing to do with climate but were Cooked up to allegedly be so.

  37. Brandon, you raise a disturbing point. I do not think the issue is whether or not there is a greenhouse effect. The question is whether or not there is a climate crisis. Perhaps you could clarify what it is you are seeking?

  38. Shub Niggurath, stating you agree with a position is enough to make you part of a consensus. I have no problem with that.

    Right. But what is meant by ‘stating a position’? This is where things fell apart. Actually, there is *nothing* called an implied position.

    When you say something like ‘implied position’, you are assuming
    (a) such things happen in abstracts, and,
    (b) the abstract text had the degree of freedom to ‘imply’ something, and it so happened that it ‘implied support’ to the IPCC orthodox position.

    In reality, both (a) and (b) are unfounded assumptions that arise solely as a result of the classifier’s need to find implication in the first place. A need created by a classification system that states such a category(ies) exists.

    Take the ‘explicit’ category. Of the papers rated as explicitly supporting IPCC orthodox statement (>50% of warming in 2nd half of 20th century due to man), what was the rate of agreement between two sets of observers? 12%. If this is how (pathetic) the system of abstract+rating scale performs with the best-defined of entities, how could it do any better with lesser groups?

  39. I’ve always been surprised that the 97% wasn’t 100%

    The greenhouse effect or back radiation or whatever you want to call it is real, like it or not. Observations indicate
    # that the effect is largely lost in the noise of variation and
    # it more than likely invokes so far unexplained negative feedbacks that allows the climate to self regulate.

    And to think we can occupy the planet at our level of consumption of goods and energy without affecting it seems a bit mad to me.

    So, if I published a paper it would, as far as I can see find itself in the 97%. I guess I’ll have to give back my membership of the denier club. 😉

    My surprise is that they managed to find 3% of papers contradicted the two posits.

    Unless you can show a methodology that removes rater bias, you are open to the same criticisms as Cook of publishing based on subjective judgements.

    Trying to quantify the Y% of scientists say humans are responsible for Z% is interesting. However, your sample of papers making this exact claim without some interpolation of numbers (possibly open to bias) would be pretty small I would think.

    To me the original premise was flawed and repeating it is nugatory.

  40. I fear that too many abstracts pay lip service to climate change (so as to get published), but neither the authors nor the results within the paper actually support same. One even encounters papers in fields quite distant from climate that nonetheless have words thrown into the abstract to implicate climate change in some way. I think any appraisal of consensus by means of analyzing abstracts may therefore exaggerate the result compared to reality.

  41. Brandon, the idea about training is merely to expose raters to the rubric, not certify ability. Over time raters become more skillful with familiarity. Training shortens the learning curve. Sorry if I didn’t make that clear earlier. Even with training there will be variation, but that, as you know, is quantifiable. Your idea of seeding and evaluating will work.

  42. Pointless

    anytime anybody mentions Cook’s 97% paper. Ask the person to Google HerrCook and SkStroopers
    and mention how creepy it is that they used SKS logo’s instead of swaztikas. And they have never explained why Ski had these.

  43. Do it. I would gladly participate. And yes, you are a fool for being that concerned with Cook et al. but we need fools like you.

  44. strike, thanks. I don’t agree about devoting all my energy to a campaign like you describe. A campaign like that is relatively slow, and there’s only so much you can do. That imposes limits to how much effort you can put into it at a time.

    Richard Tol, I’m not seeking to replicate Cook et al’s work. I’m seeking to re-analyze their work. Primarily, their categories were poorly defined, as was their “consensus.” I’d like to test what happens when you fix that. Once that’s fixed, the issue of how ambiguous abstracts are can be examined via measures of rater disagreement. As for how many abstracts there are, my approach involves breaking the abstracts into small subsamples. Examining five sets of 100 abstracts is quite feasible, and it could provide useful results.

    manicbeancounter, hunter, Shrunn, there are many different things which can be examined. The best approach is to examine a few at a time. We can always ask additional questions as a separate thing. My thought is people should leave comments on abstracts (or at least flag them) if they feel the abstracts are problematic. People could then discuss them to see how they should be viewed.

    hunter, I’m not seeking an “answer.” My hope is to get people thinking about the issues by showing the current “answer” is grossly over-simplified. From there, work could be done to look for the right answer. You just can’t get people to look for an answer if they believe they already have it.

    Clovis Marcus, that’s actually a (semi) intended aspect of the Cook et al paper. John Cook discussed this in the forum when designing the rating system. It was pointed out the definitions they used might lead to labeling abstracts as rejecting AGW even though the actually endorsed some form of the consensus. He said that’s okay because they’re so small, the consensus will still be a high value. In other words, he knew his vague definitions would lower his results, but he knew the results would remain very high despite that.

    On the issue of bias, I can’t possibly remove bias from raters. What I can do is create a framework where bias from raters can be examined. We’d be able to remove ratings from raters we believe are biased and examine how the results are affected. As for the inadequacy of the sample size, I agree with you. I’m okay with that. If the sample turns out to be unacceptable for our purposes, that just shows Cook et al’s results can’t be relied upon.

    Repeating a bad process is bad for getting good answers, but it can be good for demonstrating other people’s use of the process was bad.

  45. Gary, that makes sense to me. I probably should have understood it the first time. Anyway, it’s similar to what I had in mind for allowing people to “practice” on some abstracts, so I’d probably do it in one form or another.

  46. Brandon,
    Thank you for your kind response and thankyou for the large amount of work you undertook to pop the bubble of credibility that has kept Cook & ganag aloft for far too long.
    I would like to understand a bit more the ideas around the greenhouse effect. I do not see the GHE as the test in any way for skepticism of the climate issue. Was this Cook’s test to decide if a paper was with the consensus or not?

  47. Hi Brandon,
    I think what you are doing is admirable, I would also be glad to help. I have BS in chemical engineering and a BA in psych so I am familiar with the differences in formats.

    First I think you should divide your approach into two different vectors.
    1. Replication
    2. addressing the criticism.

    The first is relatively simple use cooks technique and methodology to rate abstracts and see if you get the same results with a different peer group. Failure to replicate would be a confirmation of rater bias and be solid grounds for a rebuttal.

    Secondly and more importantly A paper identifying and correcting for the criticisms of Cook et al. would be beneficial.

    1. To start the secondary approach should address rater fatigue, This is easy limit the number of abstracts to a maximum of say 10 per day.

    2.The second major criticism was a question of rater bias, this can be accounted for by offering a random set of abstracts, which occasionally introduces repeats to see the variance in subjective ratings.

    3. Elimination of irrelevant publications, In this series of posts you have listed a chemistry paper which was counted that had to do with what seemed to be an unrelated subject but simply mentioned the authors belief in warming. This is easy, allow the rater to mark papers as irrelevant, every marking as such will require a raters comment as to why they feel the paper isn’t related to climate change as a whole. When a paper receives that mark with enough frequency it is removed.

    4. Ask reviewers to run a quick Google search for a scientific rebuttal, Papers rebuttal publications outnumber retractions. as part of the review ask for a link to any peer reviewed rebuttal

    5. ask reviewers to list ECS/TCR if indicated, year and peer reviewed rebuttals

    Analysis

    1. Weigh the results by author. Some authors publish a lot, some publish very little, some publish with minor changes on the same theme. In the cook system it is the papers which list agreement not the scientists which was reported. Have the ratings provided in each paper be tied to an authors name, This would be another way of analyzing consensus.

    2. weigh paper by publication year, in the early parts of the 2000’s we had just recently entered the current hiatus, thus results were overwhelmingly tied to that 20 year period of warming. now papers are a bit more cautious, mapping the “consensus” over time is an intriguing endeavor.

    3. Weight papers by models vs observational data sets. This could be a useful means of comparing and contrasting the data sets

    4. weight the results by unrefuted papers.

    There is alot of dat in 10k+ papers and there were lots of issues in the cook study addressing both will be a time consuming PITA but it may be the most effective way to refute a ridiculous argument.

  48. hunter, you’re welcome! Just about everyone accepts the greenhouse effect. Those who don’t get mocked by people on all sides of the debate. That’s how Cook et al could get their results.

    They didn’t actually measure that consensus though. The trick is they didn’t actually measure anything. Shortly after this paper came out, I pointed out they the “consensus” they found doesn’t reflect anything serious, and that more abstracts were rated as rejecting serious concerns than as endorsing them. Later on, I pointed out they didn’t actually measure anything, quoting John Cook as saying:

    Okay, so we’ve ruled out a definition of AGW being “any amount of human influence” or “more than 50% human influence”. We’re basically going with Ari’s p0rno approach (I probably should stop calling it that 🙂 which is AGW = “humans are causing global warming”. Eg – no specific quantification which is the only way we can do it considering the breadth of papers we’re surveying.

    That was their trick. They mixed together the statements, “Humans contribute to global warming” and “Humans cause the majority of global warming,” knowing most would support the former and few would comment on the latter. The ones discussing the former give you ~94%; the rest discussing the latter give you your ~3%. The remaining ~3% are those papers supporting both statements.

    If they had only examined the first “consensus” statement, they would have gotten unbelievable results of 99%+. If they had only examined the second “consensus” statement, they’d have gotten results of ~40%.

    Or at least, that’s my hypothesis. I’d like to test that idea with the project I’m discussing in this post.

  49. @ Brandon on May 20th
    “In fact, laypeople might be better than knowledgeable people. Knowledgeable people may have biases due to their exposure to papers that laypeople wouldn’t have.”

    Thanks Brandon,

    That is a great suggestion and I wholeheartedly agree! As a “layman” that has had to rely on the weather as a farm manager but one with no “formal” education . I believe people like me would be able to give some really hands on observations just with our experience built up over decades.
    I learned with hands on ( taught by my Father) and as I grew older the people coming out of colleges or universities with all their Phd’s and doctorates could not even drive a tractor (let alone start one) or calibrate a sprayer. (btw I have weather records dating back years and years ).
    So count me in!!

  50. Brandon,
    Thanks for clarifying that. Yes, it seems Cook was pulling a cutesy publicity stunt. In a way this is even more damning than if he had only fabricated the phonied up parts.
    His study was designed to hide information, silence critics and reduce the amount of informed thinking in the public square.

  51. Brandon,

    Thanks for your reply.

    Although I’m not sure of the value of what you are trying to do, I’m happy to assist with ratings.

    I do have a proper job but I can probably fit in a bit of tome in my teabreaks 😉

    Clovis

  52. Brandon,

    I admire your courage.

    While I hold no opinion on the AGW issue, I love science and efforts like yours are desperately needed. Therefore, I would like to see if I can assist you.

    I’m a hobby Ruby on Rails developer. The app you’ve described doesn’t sound too complex. As far as design is concerned, I believe Twitter’s Bootstrap is pretty enough.

    Exams are over in two weeks. Feel free to contact me using my email. I will work pro bono, or for whatever you think is appropriate.

  53. Hi Brandon,

    Handshake bit: I’ve come over from WUWT, first time here. I’ve followed WUWT for about 3 years and occasionally made comments there, using a Yahoo address rather similar to the Gmail address I’ve given you (I’m gradually changing over).

    I’ve been involved in education and examining for many years, using and designing rubrics for evaluation.

    I think the suggestions so far could still invite criticisms of ‘subjective!’, so I suggest something more countable or identifiable, something that two or more people could agree on objectively (or be wrong about visibly and objectively). We would need a number of categories, which of course would need to be debated and decided, and clearly written statements (‘descriptors’) corresponding to ‘grades’. For example, a category of ‘Use of evidence’:

    Use of evidence (0 – 5)

    0: Makes statement without reference 1: Refers to one report / paper which supports statement 2: Refers to several (2 or more) papers which support statement 3: Also refers to 1 or more papers which dispute statement 4: Results reported in the paper support statement 5: Makes statement unambiguously in relation to the references and new results

    How would this work? Well, many papers simply assume AGW, and many of them simply refer to one or another IPCC report, then move on to talk about the range of bumble bees or whatever. These would get 0 or 1 in relation to the statement ‘AGW is real’. If there are more relevant references to the literature, then 2; any proper treatment of a topic should include references to differing views / results, so a paper that does so would get 3. (No review paper, which does not report new results, would get more than 3.) 4 and 5 would cover how definitively the paper makes the statement in relation to the evidence. To get any particular grade, the paper would have to do the things mentioned in the descriptions of the lower grades. (The descriptors I’ve put are first thoughts, open to criticism and refinement.)

    We would need at least two people to evaluate each paper, plus a mediator to decide in the case that their evaluations do not agree.

    All of this would have to be discussed, debated, criticized and finally agreed and published before any evaluation of the papers was done. Rubrics could be trialled and evaluators could be trained using scientific papers on any unrelated topic first (for example, Biology Direct is open access, and also has open peer review).

    Happy to be involved.

  54. I hope you are planning a small scale test of your rating methodology before asking for help with thousands of abstracts. The value of such a survey will depend greatly on exactly what information you collect.

    Many papers study the consequences of consensus (usually IPCC) projections of climate change made only using climate models, implicitly endorsing those models but not providing any data supporting those projections. Those papers that mention only an upper limit for climate change (or simply defined future change in catastrophic terms) in their abstract are inherently alarmist and probably belong in a separate category from implicit endorsements of a consensus range for a specified emission scenario. The few papers that mention the larger uncertainty in the consensus estimates for TCR or ESC – a more accurate estimate of the uncertainty in projections – are the only ones that recognize that the range of model output does not fully reflect the full uncertainty acknowledged by WG1. So I see three categories of papers implicitly supporting the “consensus”: the alarmists who mention only worst case scenarios, the naive who cite the GCM consensus range for a given emission scenario, and the realists who cite the wider range of possible futures based on our understanding of TCR and ECS. (For completeness, a category for those de-emphasize risk by only mentioning lower limits for change would be appropriate.) I think a survey will show most implicit endorsements of the consensus are alarmist and naive.

    The most important papers are those that contain data useful for predicting future climate change or evaluating the reliability of AOGCMs (and EBMs and other methods for determining sensitivity to CO2). Since the consensus range is so wide, even papers that differ significantly from the IPCC consensus (Otto 2013 or Lewis 2013) don’t contradict that consensus. Rating the amount of support for the consensus in there papers is nearly impossible. Linden and Choi is one of the few that contradict the consensus. A few papers show that observations are inconsistent with model output in rainfall, temperature rise during the hiatus, the hot-spot in the upper tropical troposphere, etc. However, the percentage of papers showing a disagreement between models and observations is misleading – even 1 paper out of the 10,000 could invalidate the consensus, if that paper were strong enough. AR5 and recent blog posting (climate dialogue, Lewis at GWPF) have compiled the all useful published estimates of TCR and ECS, so there is no need for your survey to attempt to report on this critical subject.

  55. Clovis Marcus, you’d have to take your tea very seriously to not be able to squeeze in some ratings while on a teabreak!

    Nadeem J. Qureshi, thanks! I’m not imagining anything complicated, and that’s part of the reason don’t want to use a framework for this. The biggest benefit of frameworks is simplifying things. When it’s already simple, that’s not as big a deal. I’d rather not give up the control and flexibility to use one.

    That said, future developments may make a framework more desireable. That’s especially true in regards to visuals and interfaces. I’ve never used Twitter’s Bootstrap, but from what I hear, it might be good for a project like this. That’s something I’ll definitely consider if the project gains enough interest. And of course, I’ll be happy to accept help from people on it.

    Peter Hannan, your scale might be good for examining the amount of evidence abstracts provide, but that’s not something I intend to look at right away. There’s plenty of work to be done just addressing the issues Cook et al covered. Expanding beyond that would be great, but it’s something to be done in stages.

    By the way, I don’t think we need to reach a final agreement on how to handle this idea before starting ratings. One benefit of breaking data into small samples is you can test ideas against individual samples to see how they work without “peeking” at the other data.

    I am thinking I should try to do a write up of the project’s initial scope though. If I expect anyone to contribute financial support for a project, they probably ought to know just what the money is for. The problem is figuring out where I want to draw the line for a preliminary system.

    (By the way, I have decided to try to do this. I should be setting up the database and server over the next few days.)

  56. Frank, the plan so far has been to do ratings for only a small amount of abstracts at a time. I’ve been thinking of doing it in ~100 abstract samples. That should allow a healthy amount of testing of the methodology.

    I’m thinking I may also want to have a handful of people test it out before opening it to the public. I’m not sure how I’d pick them though.

  57. My problem with “supporting the consensus” is what does “the consensus” refer to and how is it being supported? A worst-case projection? A range for a given emissions scenario/RCP (presumably business as usual)? The wider range of possible futures associated with the IPCC’s 70% ci for ECS and TCR?

    Is the consensus being supported by citing projections in papers that use to projections but do not support them with new data/analysis? Is the consensus being confirm by the data/analysis in a paper? What kind of paper would be inconsistent with a 70% ci of 1.5-4.5 degC for ECS? Part of the pdf for ECS for even Lindzen and Choi overlaps with this consensus pdf for ECS. What if some observational data (change in rainfall or the hiatus in temperature rise) is inconsistent with the multi-model mean?

    If I were to go to the trouble challenge the 97% consensus, I would want to find out what is and ISN’T supported by 97% of the papers. The vast majority of these citation don’t acknowledge the possibility that ECS could be 1-2 degC. Do those papers support the IPCC consensus or DISTORT it?

  58. Frank, one benefit to this idea is it’d allow us to examine any questions we wanted. It’d be easy to change the questions/add new columns in the database.

    Speaking of which, does anyone have suggestions on how I should pick some testers for this? I won’t need any for a bit, but it’d be nice to be able to get feedback once I’ve got things (partially running).

  59. Ask those of us who have commented on this post. We have some interest in the project. Even (especially?) critics of it would be helpful. Contact us privately.

  60. “Endorse AGW” or “Don’t endorse AGW” is the wrong question to ask.

    “Endorse CAGW” or “Don’t endorse CAGW” is the relevant scientific question. Oh, and if you believe in the “C” part (catastrophic), then what is your plan to mitigate and what is the cost benefit analysis?

    I completely endorse AGW but I’m not the least bit worried about it at this point. Humans have added maybe half a degree of temp and 2″ of sea level. In exchange, we have amazing benefits of fossil fuels. Seems a fair trade to me but I like to keep an open mind as new evidence rolls in. The last 7 years or so has not been favorable to the “C” part.

    I’m an atmospheric scientist and proudly part of the so-called 97%. But I won’t be invited to ride on Al Gore’s jet anytime soon.

  61. I think an authoritative review of the scientist’s views would make for an interesting paper. A series of questions, extremely carefully worded as so to avoid not being clear, covering the bulk of “climate science”. In addition to this data comprising their answers, it should have links to papers/studies/research in the field, that provides the basis for the scientist’s answers/views.

    This way we can bind real science to this debate, and convince scientist’s who have heretofore avoided stating publicly what they know to be true, for fear of retribution in the workplace, to come forward with honesty on these topics. If enough scientists come out publicly together on their opposition to CAGW, this could be the impetus to restoring integrity to the climate sciences.

  62. This all sounds very interesting & bravo for doing it.

    Judith Curry has written some good articles on the relevance & meaningfulness of consensus & non-consensus on her blog. She also has some interesting pieces on Skepticism, ‘Deniers’ & Denislodm.

    I think rather a lot hinges upon that 97% figure in much of the real world & if some sort of qualified quantifiable methodical analysis can be done to –

    . quantify some real figures of some meaningful categorisations of the fields of science relevant to climate science,
    . the researchers in those different sciences contributing to climate science,
    . their support for AGW/DAGW/CAGW in various ways & the extent of their support

    then, you’re on the way to somewhat blunting the relevance of the widely accepted & supposedly inarguable 97%

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s