Disagreement vs. Lack of Agreement

If you don’t agree with me, you’re stupid.

Not really. Being right doesn’t mean you have to agree with me. I could, theoretically, be wrong. You’d be a fool to disagree with me. You wouldn’t be a fool to be ambivalent though. Not taking a position is different from taking the opposite position of me.

Why do I point out the obvious? It’s not because I like saying I’m right. I’m too humble for that. I just wanted to make it easy to understand how some critics of John Cook’s paper examining the “consensus” on global warming went way off-base. It first came up here:

The implication of such error, i.e. of inter-observer agreement and reliability, can be calculated. In the Cook group data, kappa is 0.08 (p <<< 0.05). The Cook rating method is essentially completely unreliable. The paper authors’ ratings matched Cook’s for only 38% of abstracts. A kappa score of 0.8 is considered ‘excellent’; score less than 0.2 indicates worthless output.

In reality, the author ratings are the weakest link: they invalidate the conclusions of the paper. It is evident the reviewers have not looked at the data themselves: they would have seen through the trickery employed.

The issue is simple. Cook and his associates rated summaries (abstracts) of many papers, attempting to determine how many endorsed a particular “consensus.” Authors of some of those papers did the same, only with the whole papers (instead of summaries). The results agreed only 38% of the time.

That sounds bad. According to the quote above, it “invalidate[s] the conclusions of the paper.” Only, it doesn’t. That there’s a lack of agreement ~62% of the time tells us nothing about how much disagreement there is.

The quoted post uses this example:

A bird reserve hires a fresh enthusiast and puts him to do a census. The amateur knows there are 3 kinds of birds in the park. He accompanies an experienced watcher. The watcher counts 6 magpies, 4 ravens and 2 starlings. The new hire gets 6 magpies, 3 ravens and 3 starlings. Great job, right?

No, and here’s how. The new person was not good at identification. He mistook every bird for everything else. He got his total the same as the expert but by chance.

If one looks just at aggregates, one can be fooled into thinking the agreement between birders to be an impressive 92%. In truth, the match is abysmal: 25%. Interestingly this won’t come out unless the raw data is examined.

But it’s easy to see why this example is misleading. The only times the two individuals failed to agree in this example are when one disagreed with the other. That’s not the case with Cook et al’s paper. Their paper specifically points out:

Among self-rated papers not expressing a position on AGW in the abstract, 53.8% were self-rated as endorsing the consensus.

Rating a paper as “not expressing a position” is akin to taking no position. That is, Cook et al found 53.8% of the time they took no position, the authors of the rated papers endorsed a particular “consensus.” That’s no surprise. A full paper has more information than a summary. Of course fewer papers will take no position than their summaries.

Let’s go back to the bird watcher example. Suppose the fresh enthusiast in the example knew his limitations. Suppose he knew what a raven looked like, but not what a magpie or starling looked like. He saw 12 birds, rated four as ravens and said he didn’t know what the other eight were. His agreement with the skilled bird watcher would only be 33%. Would that “invalidate” anything?

Of course not. That they failed to agree ~62% of the time, I mean, 66% of the time, doesn’t mean anyone was wrong. It doesn’t mean the fresh enthusiast was stupid. It doesn’t mean we can’t trust the fresh enthusiast’s answers. All it means is when different things (e.g. ornithology knowledge of two people) are measured, different answers were gotten.

So when Richard Tol publishes a paper criticizing the Cook et al paper, saying (in part):

No less that 63% of abstract ratings differ from the paper ratings, 25% differ by more than 1 point, and 5% by more than 2 points

Realize it doesn’t tell us anything useful. A failure to agree 63% of the time doesn’t automatically indicate a problem. If it did, Cook et al wouldn’t have highlighted the fact there were substantial differences in the results. That means the only problem Tol highlights regarding this is:

0.7% of ratings were rejections in one case and endorsement in the other (Fig. S21).

But disagreement ~1% of the time is hardly notable. It certainly doesn’t “invalidate the conclusions of the paper.” It doesn’t mean “trickery [was] employed.” The only “trickery” here is from some critics of Cook et al. They are the ones saying if two people don’t agree, one must be stupid.

They can dress it up with statistical tests. They can pontificate upon things like Cohen’s or Krippendorff’s Kappa values. It doesn’t matter. It’s all nonsense. It’s logically no better than if they said, “If you’re not with us, you’re against us.”

Wait for it…

Advertisements

21 comments

  1. In your example of disagreement in bird categorization, if the rater didn’t recognize a bird, that’s a problem with the observer lacking the education needed to appropriately categorize the birds. That’s a problem with the study and the lack of agreement in that case is actually informative.

    This isn’t that dissimilar to the problem in which, if a rater didn’t know how to rate a particular paper, this could indicate a lacking of proper training” or “inappropriate education level” variety. As above, “fresh enthusiasts” should have no roll in scientific measurement. If they are included, that already indicates a methodological error.

    I should mention that Cohen’s kappa is an agreed-to standard metric used for this type of data, though there are issues with the method:

    Cohen’s kappa coefficient is a statistical measure of inter-rater agreement or inter-annotator agreement[1] for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation since κ takes into account the agreement occurring by chance. Some researchers[2][citation needed] have expressed concern over κ’s tendency to take the observed categories’ frequencies as givens, which can have the effect of underestimating agreement for a category that is also commonly used; for this reason, κ is considered an overly conservative measure of agreement.

    That said, high values of κ are generally more informative than low values about the quality of the data measurement process. If you get a value of κ above 0.8, that is going to be generally accepted as “good data”. A very low value of κ on the other hand (0.08 in this data set if I remember correct) is not particularly diagnostic by itself: It could indicate an innocuous problem, or it could be more substantive in nature. I think, in looking at the problems with the methodology of this paper, it is predominantly the result of bad methodology rather than an avoidable problem.

    However, I can’t imagine very many papers where such a low level of agreement would allow for a publishable paper, without some effort made to characterize why the disagreement was present. Simply adding some waffling language, as Cook did, to suggest that his data may not be as terrible as they look at first sight, isn’t generally considered the appropriate level of care for peer-reviewed research.

    If he has a hypothesis as to why he is getting the level of disagreement he is getting, he needs to produce evidence that his hypothesis is correct. After all, data that we can’t establish the validity of are actually worse than useless. They cost people more time and money and other resources dealing with the mess that was made in producing these data, than if the study had never actually been written up.

    I believe the origin of the level of disagreement is one question that your own repetition might yield insights into, though the question of training of raters and adjusting for the various level of education of the raters is still an issue there.

    With regards to this:

    A full paper has more information than a summary. Of course fewer papers will take no position than their summaries.

    That’s not obviously true, and may even be false. People do put “throwaway” statements in the abstract to “sex up” their paper, where no real position was taken in the body of the paper.

    Looking the level of agreement of abstracts and comparing it to the level of agreement with the paper might tell you something about the forces that produce conformity in opinion in science, but I can’t imagine it would be a particularly useful way of analyzing the actual level of consensus indicated by the body of the paper itself.

  2. * it is predominantly the result of bad methodology rather than an [unavoidable] problem.

  3. Carrick:

    In your example of disagreement in bird categorization, if the rater didn’t recognize a bird, that’s a problem with the observer lacking the education needed to appropriately categorize the birds. That’s a problem with the study and the lack of agreement in that case is actually informative.

    It’s informative in the sense a low kappa score is necessary for a problem to exist, but not sufficient to indicate one does exist. I might be okay with people saying that. Nobody has. The argument in the post I linked to and in Richard Tol’s peer-reviewed paper is merely: there are low kappa scores, therefore Cook et al’s paper is junk.

    A low kappa score means two data sets do not measure the same thing with any skill. In Cook et al’s paper, we knew that in advance. In the bird watcher example, it’d be like having one person be farther away than the other, with them reporting the person farther away couldn’t see the birds as well.

    In neither case does it tell us anything if somebody comes along and points out the kappa scores are low. Everyone would have stipulated to that in advance.

    If he has a hypothesis as to why he is getting the level of disagreement he is getting, he needs to produce evidence that his hypothesis is correct.

    If the disagreement were central to his results, I’d agree. I don’t agree in this case though. The self-ratings were offered as a form of checking results. They show a similar pattern to the primary results, and there’s a perfectly plausible explanation for what disagreements there are. That’s sufficient for what’s basically an optional test.

    That’s not obviously true, and may even be false. People do put “throwaway” statements in the abstract to “sex up” their paper, where no real position was taken in the body of the paper.

    While it’s possible an abstract will take a position not supported by any other portion of the paper, abstracts are part of the paper. When authors were asked to rate papers as a whole, that included the abstracts.

  4. Brandon:

    It’s informative in the sense a low kappa score is necessary for a problem to exist, but not sufficient to indicate one does exist.

    Put another way, the data fail to validate, but you don’t know why, because Cook made zero effort to find out. He has an obligation as the author to demonstrate that his measurements are valid. Instead he seems to argue that it doesn’t matter that the data can’t be validated.

    If the disagreement were central to his results, I’d agree.

    If you can’t make a statement regarding the validity of the rankings, what exactly is he measuring?

    While it’s possible an abstract will take a position not supported by any other portion of the paper, abstracts are part of the paper.

    I don’t agree with this argument:

    The abstract is meant to summarize the body of the paper. “Results” that appear in the abstract, but fail to be substantiated in the body should carry zero weight. They shouldn’t appear in principle, but oddly they sometimes get added at the insistence of reviewers (and sometimes even editors) in order to get the paper accepted for publication.

    When authors were asked to rate papers as a whole, that included the abstracts.

    Unless you interrogated the authors, you have no way of knowing how they made the rankings:

    If you asked me to rank a paper in terms of its position on AGW, I would do it as an author based on what I knew the paper to actually say.

  5. Brandon:
    That was not my point at all. Cook et al. present the paper ratings as a validation for abstract ratings. This is not supported by their data. First, they use a non-representative subsample to validate their sample. Second, their data rejects the null that papers and abstracts are rated the same. Their validation test thus fails on two scores.

  6. Carrick:

    Put another way, the data fail to validate, but you don’t know why, because Cook made zero effort to find out. He has an obligation as the author to demonstrate that his measurements are valid. Instead he seems to argue that it doesn’t matter that the data can’t be validated.

    No. That kappa scores “fail” in no way indicates the data is unacceptable or unsuitable for their purposes. That means it is not a validation test.

    If you can’t make a statement regarding the validity of the rankings, what exactly is he measuring?

    This is built upon a false premise. Low kappa scores in no way mean “you can’t make a statement regarding the validity” of the data.

    I don’t agree with this argument

    Abstracts are included in the published paper. What makes you feel they are not part of the paper?

    Unless you interrogated the authors, you have no way of knowing how they made the rankings

    I’m not sure what your point is. I never claimed to know how they made the rankings. What I said is is twofold: 1) They were asked to rate papers as a whole; 2) Abstracts are part of the papers. I don’t see how either point can be disputed.

  7. Richard Tol, you claim this “was not [your] point at all,” yet there is no meaningful distinction between what I discussed and your claim:

    Second, their data rejects the null that papers and abstracts are rated the same. Their validation test thus fails

    As I pointed out in this post, there was absolutely no reason anyone should have expected the papers and abstracts to be rated the same. Cook et al directly indicated we shouldn’t expect such. The low level of agreement was inevitable, and it was highlighted by the authors of the paper.

    First, they use a non-representative subsample to validate their sample.

    Whether or not it is true is irrelevant to this discussion. This post is about one point, and one point only. That point is, several critics of Cook et al claim their results “fail” because of an entirely expected result the authors told us about.

  8. Which seems a worth critique to uh… critique, given that it misses or otherwise distracts from real problems with said… results, to use the term loosely.

  9. Brandon:

    No. That kappa scores “fail” in no way indicates the data is unacceptable or unsuitable for their purposes. That means it is not a validation test.

    I think you are getting the the valance of the test backwards, and not truly understanding the significance of the term of art “fails to validate”.

    Kappa tests whether the data are acceptable/suitable. Failure to validate means you don’t know whether they are valid or not. That makes the data useless.

    This is built upon a false premise. Low kappa scores in no way mean “you can’t make a statement regarding the validity” of the data.

    No it isn’t a false premise: A low score exactly means that the data didn’t validate. It doesn’t mean they are invalid…but it does mean you don’t know whether they are valid.

    It exactly means “you can’t make a statement regarding the validity” of the data.

    This is exactly the scenario Wolfgang Pauling was thinking of when he described a result as “not even wrong”.

    Simply having a plausibility theory for why the data could have a low kappa value means nothing, unless you can test the theory. Otherwise it’s just another version of smoke and mirrors.

    Abstracts are included in the published paper. What makes you feel they are not part of the paper?

    They are meant to be a summary of the contents of the paper, not a replacement for it.

    Any competent academic will tell you that reading an abstract is not a good proxy for reading a paper. Even reading the conclusions (which are long-winded compared to the abstract) can be misleading.

    1) They were asked to rate papers as a whole; 2) Abstracts are part of the papers. I don’t see how either point can be disputed.

    I don’t dispute either of these. My point was that if you asked the authors to rate the paper it would be based on the contents of the paper, rather than the abstract of the paper.

    In other words, if you are rating papers based on abstracts, you’re using a summarized account of the information contained in the paper that the authors themselves will likely ignore.

    How useful could that be?

  10. Max, that’s a point some people seem to miss. Bad arguments weaken your case. Even worse, they weaken the case of everyone on your “side.” The best thing people can do is focus on a few key points. Everything else should be discussed only as asides.

    The entire Cook et al PR campaign rests upon them misrepresenting their results, to the point they are now flat-out lying about what they found. People should focus more on that and less on exaggerating the results of trivial tests which tell us little, if anything.

  11. Brandon:

    As I pointed out in this post, there was absolutely no reason anyone should have expected the papers and abstracts to be rated the same. Cook et al directly indicated we shouldn’t expect such. The low level of agreement was inevitable, and it was highlighted by the authors of the paper.

    So put another way, we learned nothing about the validity of the data by comparing the two. But so what if Cook acknowledges this? That fixes nothing. It still remains a fact that you have no idea whether the data are valid or not. And it tells me something about the very poor quality of his supposed scholarship that he retained this comparison, admitted it was useless, but still trumped it up like it were a key result.

    The point of measurement in science is to obtain data that are valid, rather than just data that are not provably invalid.

    Anyway, demanding that somebody prove that the data are invalid is a really a form of demanding that somebody “prove a negative.” The point of validation testing in science and engineering is to confirm that the data pass the validating testing, and so can be used. Not knowing whether the data are valid is exactly equivalent to saying “this result is not trustworthy as we have no way of demonstrating that it is correct.”

  12. Carrick:

    Kappa tests whether the data are acceptable/suitable. Failure to validate means you don’t know whether they are valid or not. That makes the data useless.

    No, it doesn’t. Kappa values are used to analyze the ability of two data sets to measure the same thing. It has no inherent meaning regarding the validity of the two data sets. We have to interpret the kappa values to figure out what they mean.

    If you have a thermometer and a timer, you don’t say low kappa scores prove their data is invalid. You never expected them to measure the same thing so you never expected to get a high kappa score. Similarly, you can have situations where high kappa scores indicate a problem.

    If anything, low kappa scores in the Cook et al data set are a good thing. One of the hypotheses they set out to test was there’d be a significant amount of non-agreement between abstract and paper ratings. High kappa scores would run contrary to their premise.

    This is particularly interesting because it is an actual validation test. Their hypothesis was both data sets would show a particular pattern (their “consensus), would exhibit significant amounts of non-agreement (low kappa scores), and the non-agreement would take a particular form (polarization). Their data passed that test with flying colors, and you guys are claiming that proves their data is untrustworthy.

    I don’t dispute either of these. My point was that if you asked the authors to rate the paper it would be based on the contents of the paper, rather than the abstract of the paper.

    You acknowledge the abstract is part of the paper, yet you claim people rating the paper as a whole would (should?) disregard the abstract. I find that impossible to understand. If a person is asked to rate a paper as a whole, they are, by definition, asked not to ignore any part of the paper.

    So put another way, we learned nothing about the validity of the data by comparing the two. But so what if Cook acknowledges this? That fixes nothing. It still remains a fact that you have no idea whether the data are valid or not.

    Cook et al did not acknowledge that, and it is not remotely true. You’re acting as though one test is the only test possible. Not only is that wrong, it requires ignoring the fact Cook et al discussed a different test.

    Amusingly, cherry-picking a single test like you guys have done means you can’t call what you’re doing validation testing. By acting as though a single test’s results are all that matter, you define “fails to validate” as “invalidates,” the very confusion you suggested I was guilty of.

    Anyway, demanding that somebody prove that the data are invalid is a really a form of demanding that somebody “prove a negative.”

    You guys could have tried to show the data doesn’t fit the validation test discussed by Cook et al. You could have also tried to show the validation test they discussed was inappropriate. Neither of those amounts to trying to “prove a negative.”

  13. It is a bit of an illusion to think that the authors went back to the paper, carefully read it, and faithfully reported its contents.

    More likely, the authors’ rating reflects a mix of their recollection of the paper, their current opinion, and perhaps a quick reread of the abstract.

  14. Richard Tol, I’d suggest focusing on the issues at hand rather than bringing up new topics. It’d be helpful if you could explain how data showing the exact patterns Cook et al predicted it’d show would mean their data was untrustworthy.

    As it stands, that just seems like when you said finding patterns in sorted data proves that data is untrustworthy – beyond silly.

  15. Brandon:

    No, it doesn’t. Kappa values are used to analyze the ability of two data sets to measure the same thing. It has no inherent meaning regarding the validity of the two data sets. We have to interpret the kappa values to figure out what they mean.

    Technically, Cohen’s kappa is designed to measure “level of agreement.” A high level of kappa validates that the two data sets are measuring the same thing.

    A low level of agreement doesn’t demonstrate that the two data sets aren’t measuring the same thing (see “proving the negative”), but it does mean you can no longer say “they are measuring the same thing”.

    If the claim is that the authors ratings represent the entire paper rather than the abstract, then a high level of agreement between the aggregate raters results and the authors results would indicate they are measuring the same thing.

    A low value of kappa means we can’t distinguish between “measuring the same thing” and “results obtained by chance”.

    Good science hinges on being able to make this distinction.

    If anything, low kappa scores in the Cook et al data set are a good thing. One of the hypotheses they set out to test was there’d be a significant amount of non-agreement between abstract and paper ratings. High kappa scores would run contrary to their premise.

    The experimental design was so poor that an adequate level of agreement could not in principle be obtained. And this is now a good thing?

    That’s not how good science is done: You need to be able to distinguish between whether your result was a meaningful measurement or whether it is random noise.

    If this has validated anything, it is only that the experimental design is very poor and that the data should never have been published.

    This is particularly interesting because it is an actual validation test.

    It’s not unless a validation test unless you can predict the level of kappa you would get just by chance and then show that the value of kappa that you measured

    You acknowledge the abstract is part of the paper, yet you claim people rating the paper as a whole would (should?) disregard the abstract. I find that impossible to understand. If a person is asked to rate a paper as a whole, they are, by definition, asked not to ignore any part of the paper.

    Based on this argument, we’d expect a high level of agreement in looking at abstracts versus the entire paper. So a low value of kappa would be a bad thing

    These are not self-consistent arguments.

    Anyway, is well understood by anybody who is engaged in paper writing that abstracts are “politicized” accounts of the contents of the paper, and not a reliable measure of the contents of the paper.

    Cook himself is acknowledging this when he cautions against expecting a high level of agreement.

    Cook et al did not acknowledge that, and it is not remotely true.

    Cook acknowledges that he does not expect a good level of agreement, that is we cannot verify that the two data sets are measuring the same thing, you’ve even acknowledged this, so it is completely true that “we learned nothing about the validity of the data by comparing the two” data sets using kappa.

    Low values of kappa can occur by chance. With this experiment high values of kappa, apparently by design, can never be achieved.

    So we can never distinguish random noise from measurement.

    This is not good science. This should not be publishable data.

    You guys could have tried to show the data doesn’t fit the validation test discussed by Cook et al. You could have also tried to show the validation test they discussed was inappropriate. Neither of those amounts to trying to “prove a negative.”

    The weakness of Cook’s validation test was very weak. It was not an sufficient metric for testing whether the data sets are measuring the same thing. That has been discussed.

  16. Truncated sentence: “It’s not unless a validation test unless you can predict the level of kappa you would get just by chance and then show that the value of kappa that you measured [was significantly different than the value that would be obtained by chance]”

  17. Carrick:

    Technically, Cohen’s kappa is designed to measure “level of agreement.” A high level of kappa validates that the two data sets are measuring the same thing.

    Nobody claimed “the two data sets are measuring the same thing.” Ever. You guys are the only ones talking about that idea. Cook et al agreed, from the start, the two data sets don’t measure the same thing.

    The experimental design was so poor that an adequate level of agreement could not in principle be obtained. And this is now a good thing?

    No. You’re just creating an imaginary problem by creating artifical standards.

    It’s not unless a validation test unless you can predict the level of kappa you would get just by chance and then show that the value of kappa that you measured

    You seem to act like kappa are automatically necessary for every validation test. They’re not. Kappa scores can be completely ignored while running half a dozen different validation tests.

    Based on this argument, we’d expect a high level of agreement in looking at abstracts versus the entire paper. So a low value of kappa would be a bad thing

    These are not self-consistent arguments.

    You offer no explanation for this claim, and I have no idea how you think it could be true. You ought to explain what contradiction there is if you’re going to accuse someone of making contradictory arguments. Otherwise, I could just wave my hands and say the same about you.

    Also, I’ll note you just completely ignored the point being discussed. Please don’t do that. Making bold claims then just randomly “forgetting” about them is obnoxious in discussions.

    Cook acknowledges that he does not expect a good level of agreement, that is we cannot verify that the two data sets are measuring the same thing, you’ve even acknowledged this, so it is completely true that “we learned nothing about the validity of the data by comparing the two” data sets using kappa.

    I’m glad you’ve now acknowledged these kappa tests were completely uninformative. It’s a welcome change. I don’t know why you keep talking about the two data sets measuring the same thing though. Telling us we can’t verify a claim nobody made is… strange.

    The weakness of Cook’s validation test was very weak. It was not an sufficient metric for testing whether the data sets are measuring the same thing. That has been discussed.

    Sure. People have repeatedly talked about how his test doesn’t show the opposite of what he hoped for. What you guys haven’t talked about is what he actually tested for.

    I’ll repeat a point since this entire discussion hinges upon you guys ignoring it: The paper and abstract ratings were not expected to measure the same thing.

  18. The original and best switcheroo between disagreement and lack of agreement is by Al Gore, who in his carbon-trading infomercial An Inconvenient Truth describes Oreskes’ search as finding that “not a single paper disagreed!”

    I’m sure an experienced tobacco salesman and Southern preacher like Gore would be pretty confident that millions of viewers would misremember this line as, “not a single paper didn’t agree!”

  19. Brandon Shollenberger, good question:

    “Wouldn’t Oresekes be the first?”

    Possibly—it depends on her exact wording, which I don’t remember and which almost nobody seems to bother to read. Gore’s sleight-of-hand was performed in front of orders of magnitude more people, which is why I care more about it. The trick consisted in *how he described* Oreskes’ inane results. They could have been described honestly, but Gore chose not to do so.

    PS Only my mum addresses me as Brad Keyes…. am I in trouble, Brandon? 😀

  20. Nah. I just refer to people (online) with the same style, whether they’re present or not. When I refer to someone for the first time, I call them by their full name. I then use their last name for the remainder of my comment (unless enough space has passed I feel repeating the first name improves clarity). That is, if they use a real name. If they use a handle instead of a name, I just always use that.

    I’m not sure if there’s a reason I do that or not. I know I also use people’s full handles in chat rooms and video games. It’s just what comes naturally to me. It might be some sense of formality/respect drilled into me at a young age (I still use “ma’am” and “sir” on a regular basis), or it might just be habit.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s