|
MR. CICCHETTI: First of all, I want to thank the National Academy for inviting me to present today. I also want to say that I really owe a debt of gratitude to the sources of my information.
I will be talking about not only peer review of grants but also of scientific faculties in general, maybe articles submitted to professional journals in medicine, behavioral scientist and, to a more limited extent, in physical science.
I want to thank the journal editors for making this data available, and also some of the data that I have on peer review, I want to thank the National Academy and National Science Foundation for making this data available, that I was able to use.
So, I will be talking today about the role of peer review, and reliability of peer review, the data for manuscript reviews in behavioral science, medicine and the physical sciences, the data for grant reviews, arguments for reliability of validity assessments, arguments against reliability of validity assessments -- there are those -- can reliability and validity be improved, what do the data indicate, and developing criteria for training of reviewers.
I mean training of reviewers in a different sense than was talked about here. I mean really taking the criteria that we use to evaluate research, whether it be grants or other things, and actually defining those very clearly, and then training people to use those reliably.
I think one of the problems that we have in the review process is not that people don't know what the criteria are, but we tend to use them to carry our own frame of reference as a defining point. That, I think, is a problem.
The role of peer review, in non-scientific areas, is to evaluate the level of performance, obviously. It is used in non-scientific areas, and it is used in areas that we are familiar -- films, art and music.
Also, it was noted by my accountant that they also have peer review. In the scientific areas, they are for evaluating track records, for hiring, firing and promoting, for evaluating the merits of scientific documents, journal submissions, grant submissions. It dates back 200 years, maybe more.
The major criteria for evaluation in paper management and grant submissions is of course the scientific contribution and overall merit that we have talked about, the research design, data analysis, originality and innovativeness, and the adequacy of literature review.
I should say that, in fact, there is a series of studies that I will get back to later that indicate, no matter what the design of a specific study there was, no matter how many non-responders you had, over and over and over again the data indicate that scientists believe that the importance of the contribution to the field, and the research design, and certainly consistent with what people have been saying over the last couple of days, are the two most important criteria for evaluating scientific documents.
In fact, I have data, based on 500, 600 manuscripts, where we have looked at the two separate reviews and, for each one of them, we asked, for the first set of reviews, what is the correlation between how you scientific contribution of the worth of the article to the field, and whether you recommended acceptance or rejection, to what extent.
The correlation is very high, .72, .73. Research design, which is the next highest, is .67, .68. The problem is, reliability levels are 1.6. Data analyses, originality and innovativeness, data analysis, literature review.
Now, criteria that are specific to grant reviews are adequacy of the research environment within which the research is to take place, the appropriateness of the budget request itself, and second, the track record of the applicant except, of course, in the case of applicants that come in for the first time from a new investigator.
Criteria specific to manuscripts or paper reviews, or whether the article appeals to the journal readership, there are some funny results.
Whether the journal space, whether there is adequate journal space available, whether the paper submission adequately fits conference themes or objectives.
Then, there are certain stylistic criteria, the writing style and organization, following the journal or grant writing formats. We have talked about that during this conference as well. The level of succinctness.
The failings of peer review, which I will get into in more detail, the peer review of scientific documents is known to be flawed, and there is evidence for low reliability and low validity.
The statistic employed is kappa, for the most part, for a major class R, which are mathematically equivalent, depending upon the rate of agreement weights that we use.
The formula for kappa is very, very simple, the difference between observed and chance agreement relative to the maximum difference that we can have between the two.
So, one minus PC, if there is 100 percent agreement, this reduces to one, or one minus PC over one minus PC.
These are the data published between 1970 and 1990 for the reliability of the overall finding of merit in behavioral science journals. These are all very prestigious journals.
These data were all supplied by the editors of the journal, and with no defensiveness about what the results may or may not show us, because they were very aware of how unreliable the process is, and what the problems are in the review process itself.
As you can see, the RI, or kappa, values range from a low of 19 for the Journal of Abnormal Psychology based on 1,300 data points.
It goes up to 54 for American Psychology, but there is a problem here. That is not real. I published that in 1980. Sandra Starr was the editor, at that time, of those articles that could be roughly classified as having scientific quality to them. They were in a separate category.
I looked at the data and I said, you know, this doesn't make a lot of sense. I had never seen an RI or kappa that high, but it is only based on 87 manuscripts. When you look at all these other areas, most of them are based on a much larger N than that.
So, I thought what happened is that, just through a random kind of process, some articles that were, let's say, easier to rate turned up in the batch and that is what raised the level of reliability.
So, I said to Sandra Starr, that I would like to do a replication study. My prediction would be that, in the next pass, the level of reliability would drop below .40 and, sure enough, it did.
When I tried to publish that -- by the way, the American Psychologist was very eager to publish my results, when I gave them a reason to do it, but in my talk with Chuck Keasler, the late Chuck Keasler(?), who was editor then I said, you know, I just don't believe this. I think it is a random kind of phenomenon. I want to do replication studies.
No, no, no, no, Dom, you don't understand. We are editors who know what we are doing. We so carefully match our reviewers with our submitters. That is why we get such good agreement. It is like the other journals don't know what they are doing.
He later referred to that as the role of the wise editor. The rest of us are not too wise.
When I did the replication studies and I tried to publish that in the American Psychologist, and the reasoning was, well, what do you think this is, a random process? You know, it is not random. So, why can you expect to get reliability levels that are high.
Well, at any rate, you see that it is not very trusted. The reliability and overall scientific merits of medical journals published between 1965 and 1990, two so-called untitled medical journals, for whatever reason the editors didn't want me to tell you the names of the journals, the people doing the research, The New England Journal of Medicine, a major subspecialty medical journal -- I am not supposed to tell you the name so I won't -- the British Medical Journal and Physiological Technology(?), these are statistically significant for reliability.
Now, if you go from the overall scientific merit to the overall agreement on specific evaluative criteria that we, as scientists, carry over when we make these decisions, the importance, which is somewhat embodied in the overall scientific merit, but also looked at separately in importance to the contributions of the field, the research design, data analysis, reputation.
Remember that I said that importance and research design, that is what people use as individuals to make a decision about funding, to make a decision about publication, to make a decision about whether or not an article should be accepted or presented at a meeting, and the correlations are pretty high.
When you look at actual reliability for the same independent reviews on the same manuscripts, you are getting dismal results. So, people are carrying different frames of reference with them, and that is why it is so important to use, in my judgement, criteria that we can agree on, and translate them so they can be used reliably, just as we do for any new assessment instrument that we development.
You would hardly try to put it on the market without establishing its reliability and validity. I can't think of an assessment in peer review that we don't think that is important, but somehow, some of us who believe in peer review don't think that is necessary.
Now, in the behavioral sciences, there are also results in the Journal of Personality and Social Psychology. I was going to tell you a curiosity with respect to reader interest. Let's look at that.
.07, this is simply, does the content of the article interest the readership. You would think the editor, who has to read the abstracts, at least, to decide where to farm out the manuscripts, or the associate editor the same, would at least read the abstract and say, look, this doesn't fit the readership, and don't even include that as a criteria.
It gets included in the criteria and, even worse, the reviewers can't agree on something that basic. It is unbelievable.
MR. HENLEY (Committee on Research in Education): Is there much variability on that?
MR. CICCHETTI: Oh, sure, The British Medical Journal, the importance, what they call scientific reliability, originality, suitability, again.
The American Association for the Study of Liver Disease -- I bet you didn't know about that, the good old AASLP -- in 1976 did a study with Harold Kahn(?), who was a world expert in the study of liver disease -- these RI values were based on three independent reviewers. I also did it separately by pairs, and the results are just as dismal. They can't agree on the importance of the article, and the design applications.
In 1996, my colleagues in pediatrics -- Ken McCarthy(?), Kemper, myself -- we published data that looked at the reliability of extended abstracts that were submitted to APA -- not the American Psychological Association, not the American Psychiatric Association, but the American Pediatric Association.
We varied them by reviewer types in these two areas of assessment. In 1990, the reviewers all came from the APA board of directors and, for 246 abstracts, there are 11 reviews. The rating kappa was .19.
In 1995, we divided it into those -- we were trying to improve the reliability process by doing this, making it more focused -- general pediatrics, those working in emergency medicine and those in behavioral pediatrics. Again, no effect whatsoever.
Now, reliability of peer review, I made a prediction in 1991 when I published my article on peer review in the behavioral sciences -- behavioral and grand sciences -- and what I said was -- we didn't have any data at the time -- that once we get data from either Physics or Chemistry and other prestigious journals, what we will find -- rocket scientist theory aside -- is that they will be no better at judging the overall merit of articles than we are, in the behavioral sciences and ordinary medicine. They will be just as dismal. That turned out to be the case, .20, 392 peer reviewed, for the overall scientific merit.
Let's go to grant reviews, what is available here. The first article that I am aware of was published in 1977 by Ruler(?) et al, and it involved the reliability of grant reviews from the New York State affiliate of the American Heart Association, and the reviewers came from that association.
There were 101 grants and 17 fellowships, and these are the results of pairs primary reviews, and the criteria, literature review, prior work, methodology, .1 for methodology, objectives .29, research grant, .31, data analysis, .23, usefulness of results, .36. Investigator competency is .16, institutional prestige .29, investigator's age .31. I don't know what they mean by that. Was the investigator 43, 45 or 27.
MR. HENLEY: Can I ask how you judged the investigator competency?
MR. CICCHETTI: That has to do with whether the reviewers feel that the investigator is competent to do the research that is proposed, based on track record or based on the ideas that are expressed in the grant.
Okay, this is another set of data that I will be presenting to you. It is based on the article that was published in Science by Cole, Cole and Simon(?) in 1998, based on 150 grants.
Cole, Cole and Simon did not present any specific data on the reliability of the reviews, but simply to say that there was a lot of variability in terms of the measurement of these -- in terms of a simple standard deviation type statistic, but there is no direct measure of reliability.
The data consisted of ratings across the various independent reviewers for each grant that was looked at independently when this data was available. None of it was analyzed before, available for COSEPUP, the Office of Science and Public Policy, the National Academy of Sciences and the National Science Foundation. These are independent reviews of the same 150 grants.
On average, there were about four reviewers for each grant submission, with similar of the numbers for NSF and COSEPUP evaluations.
There was a special statistic that had to be used for this that takes into account the varying number of reviews that you get for each proposal.
This is what we found. Combined over the areas, for data, based on 150 manuscripts, .32. For chemical dynamics it was as low as .16, not even statistically significant. Now, in the evaluation process of scientific documents, that is a very small number, it assumes variability. Solid state physics was 34, economics was 35.
Now, what the RI or kappa data tell us is the unreliability of assessments of manuscripts, papers and grant submissions were statistically significant, at less than .05, but these results were of little practical or clinical significance.
Think of this in another context. We develop a new assessment instrument and the reliability level is at a level that I showed you for manuscripts and joint reviews, and you can tear it up and throw it away.
Criteria developing levels of clinical significance. Because of the large enough number of whatever items you are evaluating, you will get statistical significance with very, very low levels of chance corrected agreement, as reflected in the kappa or the RI.
It becomes important to develop a set of criteria -- you know, as biostatisticians, we are interested in these criteria. As a colleague of mine once said to another statistical criteria, I am worried about coming up with a criteria for a particular value regardless. He said, what the hell are you a statistician for, if you don't want to develop criteria.
Anyway, less than .40, we agree in the field, we take that as a rather low use. Between 40 and 59, we established as fair, between 60 and 74 good, and above 75, excellent.
So, with these clinical criteria, virtually every reliability assessment we just presented is at a poor level of chance for ever agreeing.
Now, going beyond the low RI and low kappa, I started to think, okay, the overall agreement level if really terrible for the evaluation of articles that are submitted to major medical journals, but what if we decompose that overall agreement into agreement separately for acceptance or rejection of journal articles, and for approval or disapproval for grants, based on what the priority scores are.
It turns out that what I was doing was capitalizing on the various mathematical factors. The overall agreement is nothing more than a weighted average of agreement on positive and negative evaluations. So, we get an overall agreement level.
If you don't break it down into agreement on positive or negative cases, how do you know exactly what you are doing?
Also, since the same kappa value can have a wide range of levels of agreement, then you certainly want to know what is the separate agreement of positive and negative. That gives you some insight into what the data means.
Again, establishing criteria, I took the data on the manuscripts that I showed you, but where I had the most data, of course, was for the Journal of Abnormal Psychology, based on 1,319 paired reviews.
So, what I said was, let's combine accept subject to revision or any combination of that, in terms of the two paired reviews.
So, there were 203 such manuscripts, and 86 percent of those were accepted. If it was anything else -- resubmit or reject, to resubmit or whatever -- only five percent of them were accepted.
That includes to resubmits. So, I said, subject to revision, high probability of acceptance, and resubmit and reject, a high probability of rejection, and now I want to re-analyze the data to see what happens.
So, there was some indication in the literature -- although we hadn't done this at the time in this much detail -- there was some indication in the literature, and it came from Franz Ingelthayer, who was then the editor of the New England Journal of Medicine.
As he looked at the separate recommendation categories -- accept as is, accept after revision, resubmit, reject -- he noticed there was more agreement on rejection than any of the other categories, but he hadn't done the statistical test of it.
So, I started to look at this, and used that idea. For the Journal of Abnormal Psychology, it was .14. If we looked at the agreement on acceptance, it was only 44 percent, or less than that. Rejection, it was 70 percent. the difference between that in terms of tests for variables is enormous, and clinically, of course, there is quite a difference between 70 percent and 44 percent.
Very similar for medical specialties area, 50 percent versus 76 percent. These were based on pretty huge numbers, too.
For developmental review, 52 percent accept, 74 percent are rejected, and that is with smaller numbers. So, that actually comes out to be quite statistically significance.
The American Psychologist, when I pooled together the results from those two studies --
MR. FLODEN (Committee on Research in Education): Can you just clarify what those numbers mean? What is the 44 percent accept. What is that?
MR. CICCHETTI: This means that, of those articles that both reviewers felt should be accepted, there was only 44 percent agreement.
MR. FLODEN: Forty percent agreement on?
MR. CICCHETTI: On acceptance. They agreed on all of these. In other words, you have an accept pile from one set of reviewers and an accept pile from another set, and now you ask, to what extent do they agree on accept. That is 44 percent.
The weighted average of this 44 percent and 70 percent would give you the overall agreement. It is very similar for the American Psychologist, 56 versus 78, et cetera.
So, what we are finding here is that there is more agreement on rejection than on acceptance for major journals in medicine as well as in behavioral sciences.
Now, what does that mean? One would say, well, gee, you know, maybe those are valid results. These are in articles that we can agree should be rejected.
The behavioral and medical sciences, the prestigious journals examined here, have rejection rates on average about 70 percent, important point.
The results suggest that both editors and reviewers, being quite aware of this phenomenon -- and rejection rates are published all the time and they are very, very similar one year to the next -- they gear the reviews toward rejection rather than acceptance.
Practice in agreement much more frequently results in more agreement on rejection than acceptance, but is this a valid determination.
The opposite occurs -- let me go back to 1991 -- for this Chemi(?), which is a chemistry journal. Remember, I predicted the overall level of agreement with these journals, and I also predicted that the agreement on acceptance would be much higher than the agreement on rejections.
Why did I predict that? Because in the physical sciences, in terms of the major journals, there are not tons of journals that a lot of people call on, so if you get rejected by one, you can publish in one of these other journals later.
So, what happens to authors in these fields is, if their articles are rejected, this is a real shameful kind of thing. They are going to have to go to a second tier journal, or they have to give up the research.
So, this is what happened here. We did a rejection analysis. So, this is what I just said, very prestigious outlets, rejection means publishing in lower tier journals. There are high acceptance rates, better than 70 percent. It is just the reverse phenomenon that you get in medicine and behavioral sciences. It is a professionally damaging phenomenon. Reviewers get more practice in acceptance than rejection.
If you look at the criteria for this journal, the recommendation categories -- accept as is, accept subject to minor revision, accept subject to major revision, or reject. So, when in doubt, accept it.
MS. SCHNEIDER (Committee on Research in Education): If you know the rejection rate is going to be 70 percent, what if you just randomly did -- if you did a study where you randomly assigned -- you know that 70 percent of the time you reject. You look at the agreement based on the fact that 70 percent --
MR. CICCHETTI: Well, by chance alone --
MS. SCHNEIDER: That is what I am saying, by chance alone, wouldn't you get a higher agreement then?
MR. CICCHETTI: On chance alone, you would get -- .7 times .7 on the reject. So, you get 58 percent agreement by chance alone, if you did that.
MS. SCHNEIDER: So, isn't this relevant, chance alone 58 percent? If the argument is that they are just doing it by chance alone rather than by any --
MR. CICCHETTI: It is a balance kind of thing. In the behavioral sciences and in medicine, they do better on rejections, but they do terribly on acceptance.
So, when we balance it all out and the reviews are averaged over all the journals, it is better than chance. The .74 in the physical sciences, at least in terms of this one journal we have now, it is the reverse of that.
MR. FLODEN: Did you say the statistics you use, RI, is looking at things better than chance? You are correcting for chance already?
MR. CICCHETTI: That is right, the same as the kappa, exactly.
MS. SCHNEIDER: So, you are correcting.
MR. CICCHETTI: Absolutely, but this was a phenomenon where I was looking beyond the overall agreement and saying, is there a balance. Is it just overall agreement is terrible, whether you are talking about acceptance or rejection, when you are talking about approval or disapproval, or does it really favor one or the other, and you can see that it favors rejection.
Now, I tried to do something analogous with the grant review data that I had. What I did was, I looked at -- at the time of this data, things had changed at NSF. Now, I understand, but at that time, the proposals were ranked anywhere between 10 and 50.
If the scores -- this is base times data over a number of years -- were between 40 and 50, the probability of funding at that time was 92 percent.
If the score was below that, there is only a 14 percent probability of approval. I therefore said, as my criteria, then, 40 to 50 is scored as high probability of acceptance, and 39 is scored as low rating, then I could analyze what I did with acceptance or reject with the journals, just another way of getting into the overall level of agreement and decomposing it to see what happens.
Now, when I did that, where was the most agreement? It was on the low ratings, the grants that have a high probability of getting rejected.
So, even among the pile that we accept, the agreement level is very low. You can bet it is only where your agreement level is -- you know, it is only those manuscripts where there is agreement that are getting funded, but there is a whole set of other ones where, independently, each source -- whether it is COSEPUP or NSF -- said, these are worth funding, but there was agreement.
R. FLODEN: I am still having trouble understanding just what the numbers mean. So, the 60 percent -- so, say, out of 100 proposals that were funded -- is that what I am looking at, to see if it makes sense?
MR. CICCHETTI: Yes. No, that were recommended by either COSEPUP or NSF independently, should be funded. There was only 60 percent agreement.
MR. FLODEN: Sixty percent agreement between these two different agencies?
MR. CICCHETTI: No, two different agencies, but on those proposals that they had both said independently should be funded.
MR. FLODEN: So, of the 100 proposals that they both said should be funded -- am I getting this right?
MR. CICCHETTI: Right.
MR. FLODEN: The 100 proposals that both said should be funded, 60 percent were what?
MR. CICCHETTI: They agreed on 60 percent of them.
MR. WISE: The denominator is the union of the sets, that either one or the other said should be funded, and the numerator was the number that they both said could be funded?
MR. CICCHETTI: That is right, precisely.
MR. FLETCHER: It is like having two psychiatrists look at the same patient, and see whether they agree that the patient has schizophrenia. There is always agreement between paired raters.
MR. CICCHETTI: Yes. We can argue about this, whether a phenomenon can be any more valid. I can think of a few situations where studies get designed inappropriately where the inverse could be true, but the fact of the matter is that the overall level of agreement is terrible, but there is more agreement on rejection in behavioral sciences and in medicine.
So, we might say at this point, well, gee, this suggests that the ratings have some ring of validity. That is where we got most of our agreement. Let's see what happens.
Eighty to 85 percent of articles rejected by Science, the New England Journal of Medicine, the Journal of Clinical Investigation, these are all very, very prestigious journals, and prestigious journals as Behavioral Science are published elsewhere.
MS. SCHNEIDER: How do you know that?
MR. CICCHETTI: The study has been done. This is Realm(?) 1978. Wilson did it for Clinical Investigation and the editor of Science at that time, Hagelson(?), did it for Science.
MS. SCHNEIDER: They go back and track the authors?
MR. CICCHETTI: Yes, and they also looked to see where the article had been published. These are journal editors themselves.
They are often published in journals as prestigious as the rejecting journals. This is true of 70 percent of the New England Journal of Medicine rejections. Most manuscripts are revised not at all or only in minor ways. That is in the 1978 study of the New England Journal of Medicine.
So, what does that tell us? We certainly can't say that the process is valid, or else those articles that were rejected would not have been so easily accepted in journals of similar prestige.
MS. SCHNEIDER: The qualifier on this is that we know that, since those people that publish are very different from those people who don't publish at all, if you talk to people who publish, they will tell you that, if they are rejected in one place, then they will continue to try and find a place to publish their work.
I keep thinking that, at one level of this, that maybe the idea is that it just takes a lot more time in the system to get published, but the question is going to be why, then.
MR. CICCHETTI: Because the process is unreliable and invalid. It is very simple.
MR. SLOANE: Could I ask a question? With respect to the New England Journal of Medicine, often one of their criteria is timeliness of the article.
The point is that they have lots and lots and lots of rejections. Their rejection rate could be very, very high. It doesn't necessarily mean that the manuscript isn't good.
MR. CICCHETTI: Yes, but the acceptance rate of these other journals is high, too.
MR. SLOANE: To the same degree?
MR. CICCHETTI: I don't know. I don't think that really answers the phenomenon at all. The rejections are high in all journals, on an average of 70 percent or more.
The fact that most of these articles are being published elsewhere in equally prestigious journals, that is troublesome to me.
Now, articles describing -- here is another comment. Articles describing truly innovative discoveries in medicine such as blood typing and radioimmunoassay were originally rejected by major medical journals -- Lancet and the Journal of Clinical Investigation, respectively.
Citation rates of rejected and accepted articles is often used to study the validity of accepted and initially rejected articles.
That perspective is known to be flawed. Citation accounting alone does not equate to scientific merit, given the multiple reasons for author citation.
Articles are cited for all kinds of reasons, sometimes for historical reasons, sometimes as exemplars of bad research. So, to use that as a criteria will not work.
Recall that reviewers demonstrate the lowest levels of agreement on the grants receiving the highest priority scores, the ones most likely to be approved and funded.
This suggests strongly that the grants that are approved are not always the ones that are most deserving of funding.
As disapproval rates continue to increase -- this is old data, I mean, we are talking today about a situation where you have 100 applications and only five can be accepted.
I am thinking to myself, if two different funding agencies looked at those same grants, do they really want to put any money that they will save by grants that are going to be selected? I don't think so.
As disapproval rates continue to increase, and fundable ranges continue to narrow, the probability of data accurate, invalid and arbitrary funding decisions will continue to increase.
MS. PETERSON: On the basis of that criteria, if that were the case, then three of the five can be accepted for funding; right?
MR. CICCHETTI: No, they are going to look at all 100. You may want to do it the correct way. You want to know if your selection process is making a difference in the prediction.
The low reliability of peer review extends beyond the area of specific cites of the document. Since the end of all peer review is to produce the best of science, we should be concerned when it fails to accomplish this worthwhile objective. This seems to be the case.
Arguments against reliability. This is the wise editor that I already talked to you about, the idea that high reliability stifles creativity, encourages the acceptance of articles that do not challenge cherished beliefs. This is Armstrong 1997.
The idea is that the editor merely uses peer reviewers as sources of information -- or the granting agency -- as sources of information, in that it is the editor that we trust to make the appropriate publications, too.
John Baylor said this as a commentator in an article in 1991. He said, you can't tell me this is the point here, because I just use reviewers as a source of information and, in fact, we get three reviews for every article that is submitted to the Journal of the National Cancer Institute that I am the editor of, and it is not unusual for me to reject articles that get acceptance from reviewers.
So, I am thinking, that is curious, because I would like to see the whole data set. So, I invite him to give me the data set, because I don't believe him.
The basic flaw in reasoning, as editors, associate editors and submitters, we all derive from the same research species.
Therefore, in my judgement, there is no rational reason to expect that we exercise more reliable or valid judgement as editors, as granting agencies, and as reviewers.
It is also known, quite apart what Baylor will tell us and what the late Chuck Keasler will tell us, it is known that reviewers' judgements are highly correlated with editorial decisions. Let's see how that works.
Here is the data, 1,313 data sets. The joint review recommendations are on the left, the number of manuscripts middle columns, and the editors' decisions to accept or reject.
We can look at this data and say that the editor makes his decisions independently of what is provided by the reviewer, and one can do it.
Accept as is, if both reviewer recommendations were accept as is, and that only happens in 17 manuscripts, a little over one percent, all 17 were accepted by the editor. That sounds fair.
If the decision was some combination of accept subject to revision, that happened to 86 manuscripts. ninety percent of them got rejected. Only one percent of the manuscripts get a vote of accept as is by both reviewers and yet, if you get the accept revised, which occurs most of the time, still, one percent of the reviewers accept it.
If you get down to both of them are accept subject to revision, 19 percent are rejected. Accept resubmit, we haven't gotten to rejection yet, and 38 percent are rejected. Second revisions resubmit, 49 percent, almost half of them, are rejected, and there hasn't been a vote for rejection by the reviewer.
Those that are resubmitted, 69 percent are rejected. If one says accept as is and the other says reject, 79 percent of them are rejected.
Resubmit reject, if one says accept as is and the other says accept subject to revision and the other says reject, 82 percent are rejected.
Rejected, resubmit, any combination, 94.5 percent rejected. If they both say reject, 99.4 percent are going to get a rejection.
So, when we look at this between the lines and see what the editor is doing, the editor is trying to make sense out of a review process that is unreliable.
When you look at all of those that are what we might politely call a mixed review, rather than saying they are terrible and unreliable, what is happening is that the editorial principle is, when in doubt, go with the low.
I submit to you that this is the same sort of thing that probably happens in review of grants. My point is that, when you are talking about funding five out of 100, that when there is a discrepancy between scores, it is go with disapproval.
Somehow, the person who recommended approval starts to feel a little bit shaky, and then you have got consensus.
So, we take something that is unreliable and, by this so-called consensus, which is totally biased, you get 100 percent reliability.
If you really believe the process, you even think it is 100 percent valid. I don't.
MR. FLODEN: What is striking about that is that I am reminded of the experiments where you are looking at obedience to authority. When the authorities are divided, obedience declines markedly.
What you have there is a very sharp fall off on editors following reviewers recommendations where there is any split.
MR. CICCHETTI: The other thing that makes sense about that is that the editor is rejecting in a very -- he might not be aware of it -- calculated way, because he or she has to get that rejection rate up to 77 percent. So, arbitrary decisions have to be made, and the same thing happens with grant review, I submit.
MS. SCHNEIDER: Couldn't it also be that somebody is discovering that data is flawed? It could be that maybe some reviewers haven't seen the fatal flaw, but if one of them says that, yes, there is a flaw.
Having been an editor for six years, I think you can look at the data, and if you find something in the review and you take a look and, gee, that can't be fixed --
MR. CICCHETTI: Sure. I am asking, how often is that the reason.
MS. SCHNEIDER: I don't know, but you didn't ask them, did you?
MR. CICCHETTI: I have some other data -- you have to make an inference. What happens when there are three reviewers, the 112 where there were three reviewers and all three reviewers said accept of five manuscripts, 100 percent of them were accepted.
If two said accept and one said reject, 73 percent were accepted. If one said accept and the others said reject, it now goes down to 20 percent, and if all three say reject, it is zero percent is accepted.
So, the principle here seems to be, when you have three reviewers, when in doubt, go with the money.
MR. FLETCHER: I can't tell. Do you think that is the right thing to do? I mean, it seems that that --
MR. CICCHETTI: It is what you have got to do if you want to keep your rejection levels at the same level. In other words, if the rejection rates of these journals was 50 percent, then you can be sure that, with those discrepancies, a lot more of those articles would have been accepted.
MR. FLETCHER: In other words, Dom, there is a base rate phenomenon that is occurring here as well. So, these ratings are going to vary, depending on the base rate of acceptance and rejection.
MR. CICCHETTI: And the base rate comes right out of the editor's heads.
MR. FLETCHER: Actually, out of the publisher's head, how many pages do you have.
MR. CICCHETTI: That is true. Okay, here is what happens, getting back to your question of, when one is accept and the other is reject.
The British Medical Journal, it is even worse than go with the low when in doubt. It is go with the really low.
If both reviewer recommendations were accept, okay, accept it. If one said accept and the other said reject, that happened with 68 manuscripts, and only six of them were accepted.
Now, does one think that the 61 other articles had fatal flaws? I don't think so. I mean, that is a stretch. Maybe but --
MR. FLETCHER: If they only have so many pages, how could they do anything different?
MR. CICCHETTI: That is okay, but then don't justify it and say that the review process is not in trouble. It is.
MR. FLETCHER: This is really all about the wise editor, the idea that you have a wise editor that -- what you are saying is that the editor's behavior is governed by the reviewers' behavior.
MR. CICCHETTI: Yes, but in a way that keeps the rejection rate constant. As I say, if you look at the rejection rates over the years, they are so similar. In terms of the random process, sometimes you will get 80 percent of the articles should be rejected, other times maybe only 50 percent.
In structuring it so that you have to reject 70, 72 percent, you will end up with this sort of thing.
MS. SCHNEIDER: Having edited a journal right now, if you got into my head, I have never said to myself, I have to have a 76 percent rejection rate.
I also know that I feel very uncomfortable publishing something that someone has a rejection on, and you look at it and, just as Pen said, it has a fatal flaw.
MR. CICCHETTI: But how much of it is that phenomenon and how much of it is squeezing things so that you get the rejection rate. The journal rejection rate is a constant thing over the years.
MS. SCHNEIDER: The journal rejection rate is high. It is clearly high.
MR. CICCHETTI: From year to year.
MS. SCHNEIDER: I am willing to give you the data. We have a data file.
MR. CICCHETTI: I would like to see it improved.
Okay, Armstrong's argument is, journals, grants, are focused on publishing or funding innovative, groundbreaking research. We all agree with that.
What is the problem? It is well known -- one of the most elusive definitions, a priori, is creativity. Bob Sternberg(?) is not here today but, if he were, he could talk about this.
I mean, he has devoted a considerable amount of time and effort to study this, and that is a unique criteria.
So, one of the criteria for defining creativity, those who talk about it simply know that what the reviewers tell us is true.
By innovative studies, I mean those with evidence -- this is what Armstrong said -- where existing beliefs are incorrect. More generally, innovation refers to all advances in scientific knowledge and concern with important innovations.
Now, what Reed Moore has to say is, it turns out that this is self serving, because he is talking about articles of his that he thought were ground breaking and didn't get published, and he had to call the editor and say, why did you reject my article.
So, conclusions deriving from a number of peer reviewed studies, the scientists agreed that the two most important criteria for evaluating the overall scientific merit is societal importance, and reliability under constraints.
So, what are they? For the Journal of Abnormal Psychology -- I just picked that because I used it before -- 23 and 32. The British Medical Journal, pretty much the same thing. Abstracts, the American Association for the Study of Liver Disease, same thing.
So, while scientists agree about the critical value of criteria for evaluating scientific merit, they have very different views about what they mean.
So, the solution, create clear definitions of evaluating criteria and train reviewers to use them reliably. This has been accomplished, and the results are very encouraging.
Coopers and Zangwell(?) published in 1990. They trained reviewers in 15 specific criteria for manuscripts submitted to the Journal of Internal Medicine.
The inter-rater agreement, 86 percent, the kappa is 26.5, for the reviews of submitted manuscripts. Pretty impressive.
Strayhorn and McDermott(?) didn't really use train in the same sense, but they still got results. They provided reviewers with specific material, but as sort of like the training, they gave them a package of that for the manuscripts submitted to the Journal of the American Academy of Child and Adolescent Psychology. Inter-rater agreement improved from .27 to .43.
Now, the best study of all, in my judgement, was done by this group from McMaster's University in 1991. Charlie Goldsmith is a statistician. I am a great respecter of his work.
He studied the inter-examiner reliability in evaluating the scientific merit of 36 published review articles for prestigious behavioral and medical journals -- the American Journal of Internal Medicine, The American Journal of Psychiatry, the Psychological Bulletin.
There were three groups of reviewers, three in each group. Three were experts in research methodology, three were MDs with research training, three were research assistants.
The amount of training they got, like about an hour, they pilot tested the instrument that they used before that, and these were the evaluative criteria rated on a validated seven point scale.
They were answers to questions about the status report of reviews, to what extent were research methods reported, to what extent was a comprehensive search made. To what extent were inclusion criteria reported? To what extent was selection bias avoided, and so on and so forth.
These are not simple kinds of things for training, and yet, they did it. Validity criteria, to what extent were they reported. To what extent was validity assessed appropriately. To what extent were the methods for combining the studies reported. To what extent were the findings combined appropriately.
To what extent were the conclusions supported by good data. What was the overall scientific value.
These are the results. Research experts are at .77. Research MDs, 74, very good, research assistants 62 overall, 71.
If we look at the specific criteria for all nine reviewers, research methods reported, 85, comprehensive search 65, inclusion criteria 85, selection bias 55, limited criteria 65, limited access(?) 70, methods for combining reports 60, same for findings and conclusions supported by data.
So, a surrogate of the research literature suggests strongly that peer review of all scientific documents suffers from both low reliability and questionable validity.
Arguments against reliability and validity assessments appear seriously flawed, and I would say some seriously.
Summary and conclusions and implications. While scientists agree strongly that the contribution, blah, blah, blah, I said that -- they have not been able to agree on what needs to be done.
Therefore, simply making these criteria available to reviewers, which is something that people do, will not improve reliability of the assessment.
In fact, I knew that from years ago, because we did a study, with the Journal of Abnormal Psychology, where half the manuscripts, more or less, were used on a set of very explicit criteria, for them to use and rate. Another group of people evaluating the manuscripts were not given the criteria.
You might think, well, of course, the group that has the criteria, they are going to show more reliability than the reverse.
The reverse happened and I am thinking to myself, how did this happen? I think if you don't train people to use the criteria, they will look at them, and they make a judgement, after looking at the paper. Then they get this gut feeling that they are going to accept it.
So, once they accept it, they check all the things on the checklist to make that occur. It is not a reliable process without training.
Recent research indicates, as I said, that reviewers can be trained to improve reliability to acceptable levels.
The arguments by some that the problems can otherwise be resolved by having a wiser group of reviewers is seriously flawed, because no such criteria exist.
The idea that increasing the reliability and validity is the take home message.
Peer review stifled creativity and original growth in the sense that it restricts judgement and contains flaws.
Okay, let me say something here personally. When I began my scientific career, the reliability of validity in a psychiatric diagnosis were so poor that some mental health researchers began to rationalize that these areas were not as important to investigate as previously believed.
However, the field of nosology made a giant leap forward by introduction of the various divisions of the DSM that provided explicit criteria for making reliable and valid psychiatric diagnoses.
We can disagree with the DSM, but we can always change the criteria. So, we have got something to work with that makes sense.
The decision to fund or publish based upon scientifically validated standards rather than self appointed wise editors.
A critical part of the standards should include, in my judgement, to reach that point, criteria to distinguish statistical significance from clinical or practical significance, and/or identify threats in the reliability and validity for publications and for doing research. Thank you very much.
|