I was looking over the Cook et al consensus paper data I recently obtained, and I noticed a small oddity. I’m curious if anyone can offer an explanation.
Each abstract examined in this paper was rated by two people. Once those ratings were complete, raters were allowed to discuss their disagreements with one another through anonymous forms, updating their ratings if they so desired. After this step, raters still disagreed about what 1870 (16%) abstracts said about the consensus. These disagreements were resolved by a “TiebreakRater.”
The oddity is 7% of the time the TiebreakRater disagreed with both of the original raters. Can anyone explain that to me? I know 7% of 16% is tiny (~1%), but should we really trust the other tie breaks?
I don’t know. What I do know is it is interesting to look at these tie-breaks. This is their distribution:
1 2 3 4 5 6 7 13 210 871 749 24 2 1
That’s pretty skewed. That’s not necessarily bad though. Abstracts in certain categories may be more difficult to rate. As a way to check, we can look at the distribution of the ratings which disagreed (divided by two and rounded to make them more comparable):
0 1 2 3 4 5 6 7 8 46 279 786 708 31 8 2
The distribution is a bit different. The tie-breaks seem more conservative, slightly skewed to the endorsement side. That’s interesting, but for the sake of thoroughness, we shouldn’t combine each set of ratings. We should look at them individually:
0 1 2 3 4 5 6 7 13 69 336 812 586 39 12 3 4 24 222 760 831 23 5 1
There are significant differences. The first round of ratings are dramatically different from the second. That shows how raters rated abstracts changed over time. Tie break ratings favored the second round of ratings because the tie-breaks were done after the secound round of ratings. We can see this bias by looking at agreement rates:
Tie Break Rating/First Rater – 31%
Tie Break Rating/Second Rater – 62%
Tie Break Rating/No Rater – 7%
Since we’re talking about biases, it’s worth pointing out this may not just be a bias introduced by raters rating abstracts differently over time. Another issue might be who performed the tie-breaks. Suppose all of the tie-breaks were done by one person. Couldn’t that influence the results?
It’s an idea worth considering. Unfortunately, the University of Queensland has threatened to sue me if I examine the data to look into it. Fortunately, I know that threat is complete bunk. Even if you believe I can’t disclose the data I have, there is no question I can look at it.
When I do, I see 681 of these 1870 disagreements were resolved by a single rater. That rater had only performed 711 ratings prior to the tie-break stage. Somehow a person who performed ~3% of the original ratings wound up performing ~36% of the tie-break ratings. I think that might have affected the results. Don’t you?
And don’t you wish we could investigate things like this without having to worry about being sued?