|
MR. WHITEHURST: Thank you very much. I am pleased to be here to talk about peer review. One of the realities of my life is that I have to be a jack of all trades.
My first address this morning was about charter schools. I really wasn't quite sure what a charter school was until about a year ago. I gave an address a couple of weeks ago on math education, a subject that the audience would say I still don't know too much about.
I am often running around talking about things where I have to get up to speed very quickly.
Peer review is, forgive me my conceit, something that I actually think I know something about. So, it is a pleasure to be here and not feel terribly anxious that I am talking where, if I got a question, I wouldn't know what to say.
I was a journal editor for over 21 years, and I tried to do some math about it this morning. I think I have personally been involved in the review process for at least 2,500 manuscripts.
As is probably the case with many of you in the audience, I have seen the flip side of that, too, as the recipient of reviews from editors and research funding organizations.
Personal experience is sometimes a dangerous thing, but I do have a fair amount of it to bring to this topic.
I also stuck my toe in the water for a while and did some research on the reliability of the peer review process.
For those of you who haven't looked at that literature, I think you will get some information on it later on today.
I must tell you that peer review is a very unreliable tool. The typical reliability for inter-rater agreement on reviews for scientific journals or NSF is somewhere around .30.
If you are submitting an article for publication that involved observational methodology, you would expect to have reliability of about .8 or so before anybody would look at it.
Yet, we run our science agencies and our journals based on very low rates of inter-rater agreement, and that is a problem.
We have here a tool, peer review, that is certainly flawed in some respects. It is like democracy. It is a horrible way to get things done, but there doesn’t seem to be a better alternative is for accomplishing the task.
If peer review is a tool, I think it is very important to decide, a tool for what? Just what is it we are trying to accomplish as we use the tool?
The tool is general purpose, but its functions, its goals, will differ depending on the agency or the context in which it is used.
With that as context, let me try to tell you what I think the function is, or the goal is, for the Institute of Education Sciences.
We have to start with the Institute’s mission, before deciding whether peer review is serving that mission well, as it is designed and used, and if not, whether it can be redesigned to serve the goal.
About a year ago, we finished a project at the Institute that involved a survey of a variety of constituencies for products of the Institute's funding, research activities.
We surveyed school superintendents; we surveyed state school officers; we surveyed people on the Hill -- in general, people who are in positions to utilize the results of education research.
We asked them a number of questions, including questions about the relevance of the education research that they had been exposed to.
The results we got were sobering. Let me give you a couple of quotes. I think they are, in general, representative of what we have heard from these decision makers. These are reports from local superintendents.
“Assessments, brain science, and demographic studies of education cannot give teachers concrete ideas about teaching. Research should draw direct implications for practice and have direct connections to practice activities.”
Another superintendent: “Most researchers do not intend to do research to inform practice. They have their own interests and their own practice.”
A third superintendent: “There may be less than one percent of the existing research that is really meaningful for me. Much of it is for researchers, for getting funding, for career advancement or for advocacy.”
Thirty of the 40 superintendents -- these were open-ended interviews -- touched on this theme, that the research that they are exposed to, generally funded by federal agencies, is not serving their needs. It is serving the needs of the research community. It is not serving the needs of the practice.
If you take that view; that is, if you take the view that the role of the Institute of Education Sciences is to sponsor and disseminate research that is useful to practitioners, necessary actions flow from the commitment to that type of research activity.
Many of you will be familiar with the Stokes 1997 book, Pasteur's Quadrant, that argued for use-inspired basic research, Pasteur being the model of that sort of research; basic research, but basic research that had, as its ultimate goal, solutions to pressing, real problems.
Stokes was not thrilled with Edison's quadrant, that is, the pursuit of research goals that are directly applied, that are practical and have problem solution as the end goal of the research process.
In fact, what the Institute is going to focus on is Edison's quadrant, that is, research that is of immediate relevance to the solution of education problems.
That is what we are about in large part -- that does not mean that we are not going to sponsor research that is occasionally in Pasteur's quadrant. It does mean that we are primarily focused on doing research that is relevant to education practice.
Given that decision, the question is: how is peer review a tool toward that end? I will come back to that question, but I want to get there by giving you a retrospective of peer review at the Office of Education Research and Improvement, as I found it when I showed up first as a consultant in April of 2001, and as the assistant secretary in July of 2001.
Here is how the system worked. With few exceptions, research competitions were field initiated. So, an announcement would go out periodically, sometimes twice a year, sometimes once a year, sometimes more times in a year, for anybody in the field to apply for funding on any topic that they were interested in pursuing.
What OERI would get, in that sort of call, was a very large number of applications, 400 or 500 applications on a very broad spectrum of topics.
There was also a great range of quality of the applications.
When you have a very large number of applications, it generates certain characteristics of the peer review process.
What did it look like? Well, if you have 400 applications you have to have large numbers of peer reviewers.
You can't ask a single peer reviewer to read 40 applications and make judgments. So, the typical method of operating was to have multiple panels of five or six members divvy up the total applicant pool into a number that the six or seven-member panel could deal with.
They would review the applications. They would meet as individual panels and they would score the applications. In the end, you have these 400 applications scored by, let's say, eight different panels.
A statistical procedure was used to equate scores across the panel. Every panel’s scores were centered on the mean of the total scores in the distribution, and sometimes the variance was adjusted as well. After the application of these statistical procedures, a tentative funding slate was produced with the applications rank ordered.
That would go to the assistant secretary, and the assistant secretary was expected to fund that slate down to the point at which he or she ran out of money.
What is the result? First, if you are a scientist, researcher or scholar applying for funding, what you get out of that process is a relatively low probability of having your work reviewed by anybody who understands anything about what you want to accomplish.
This process of large numbers of reviewers and large numbers of applications and farming out the applications to reviewers ends up being chaotic, and the chance of expertise is relatively low.
I believe you have on the agenda of this meeting a report, taking a look at the quality of review in OERI. One of the conclusions of that report is that there were some panels on which there was no member who had any expertise with respect to the topic.
Thus there is a low probability of expertise among the reviewers, and very little check on the process. The statistical adjustment makes sense in a certain mind set, but I can tell you that it often generates very curious findings.
For example, if you are on a panel, and there are competent people on that panel with you, and you looked at a bunch of applications and they all happened to be lousy -- it happens -- and you give them all low scores, corresponding to your standards, the statistical adjustment will pull those low scores up to the mean of the distribution, or something close to it. So, it destroys the judgment of the panelists that their applications were generally of low quality.
The process is one, in general, that resulted in the funding of grants that were of uncertain quality. Further, because the process was field initiated, it was not one that aligned with any priorities for solving problems.
One of the consequences of the choice of use-inspired research or applied research as a goal, is that you will never accomplish that goal if you let the field decide what the priorities are, because there are thousands of priorities, and each of those priorities has a rationale or has a sense to it.
Use-inspired research has to be a focused, given limited resources, both the financial resources and scholarly resources in the field. There has to be a focus if you are going to be able to initiate and sustain enough research on the problem to actually generate solutions.
So, this combination of the field being allowed to obtained research funding for any topic, and a connected peer review process that was uncertain in terms of its outcome, generated research that in large part -- there are exceptions -- was not particularly good and was not particularly relevant for solving problems.
Even the research that was good and was relevant, was so isolated that no one else was doing work on that problem, and the chance of getting any re-funding was small. It was just a little blip.
What we have been doing at the Institute is deciding on what our priorities should be and translating those priorities into aligned funding processes.. This is an interim process until our priorities can be approved by the National Board for Education Sciences that has this responsibility under our statute.
Once that board is established -- and it could take a while, because board members are nominated by the President and confirmed by the Senate, that is not a quick process -- but once that process is complete and we have a Board on board, then they are responsible for approving priorities that are proposed by the director of the Institute
We can't be dead in the water while we are waiting for perhaps a year for the board to be founded. So, what we have been doing is funding research programs that we believe are likely to be consistent with the priorities that will be proposed, and we hope, approved by the Board. We have established peer review procedures as a tool to support the accomplishment of the goals of those funding programs. While people outside the Institute are very interested in the priority setting process and concerned about its outcome, the programs that we are pursuing, as reflected in our research competitions of last year and this year, are consistent with priorities that most reasonable people would agree are appropriate.
Some people would want to add some things to the list. It is less likely they would want to take things away from the list. I have heard complaints about lots of things, but I actually haven't heard complaints about the priorities as reflected in the funding competitions.
For those who aren't familiar with those, here are some things that we are doing this year. All of these are available on the web page.
We are supporting research in English language acquisition to evaluate different curricula, different methods for dealing with children for whom English is not a first language.
We have a research program on teacher quality, on effective mathematics and science education, on social character development in the schools, on useful curricula for prevention of behavior problems, on cognition and student learning, and the inter-agency educational research initiative, which is a collaborative project between NSF, the agencies and NICHD.
Listing for you those research competitions is implicitly the first point in my outline of how what we are doing now is different from what we were doing before.
So, rather than having open field-initiated competition, where people can do anything that they want to do, we said what we are interested in through our funding announcements
Some of our funding announcements are quite explicit about what we are interested in. They narrow down the foci quite considerably.
In teacher quality, for example, we are focusing the research competition on issues that have to do with teacher quality as it plays out in reading in first grade and mathematics in sixth grade. We would like to do longitudinal studies; we would like to have shared measures across the researchers. Unless er focus on a particular age, grade and a particular content area, er can't do that.
Thus we are focused, and sometimes quite focused, in what we want, as opposed to allowing the field to decide for itself what it wants to do.
Our funding announcements vary in quality, as anything does, but I think they are, in general, clear about the background of what we are interested in and why.
Thus one of the ways to deal with some of the issues that are encountered in the context of peer review, is to screen.
A journal editor would screen by having the focus and purpose of a journal clearly understood, so you don't submit to this particular journal if you are doing work that is not of interest to the journal. The scope of the journal, the call of the journal is clear both in its written materials and history and in general understanding of the field at the time.
The corresponding function at the Institute is the funding announcements. Let's be clear what we are interested in, so that we can cut down on the applications that really are not relevant to what we are trying to do.
Once we have focused the competition, the process of generating peer review panels is simplified. Now we can go for people who have content and methodological expertise in the topic that is the focus of the competition.
If we are running a competition on teacher quality, we can have panelists who are knowledgeable about that topic.
Our first try at this last year was a learning process. One of the things that we found was that, even with focused competition, sometimes we got far too many applications to be dealt with efficiently by a panel of reasonable size.
Last year, our procedures were such that we grew the panel as necessary to deal with the volume of applications.
On some occasions last year, we had 35 people sitting around a table trying to deal with all the applications that were submitted. That didn't work.
Those 35 people would not read all the applications. The primary reviewers were showing up at the panel meetings and leading the discussions and the other people would follow along, because they hadn't done their homework, because we didn't expect them to do the homework because of the number of applications. We ended up with a process that looked a bit like an NIH review panel, but did not function like one at all.
This year, we have established a triage process. Our review panels, unless we are dealing with topical areas that generate a small number of applications, are always going to be about 20 people, and they are going to deal with about 30 applications. They will spend two days doing that.
The primary reviewers, out in the field, will review as many applications as come in. Basically, we lop off those below the top 30 based on the scores generated by the primary reviewers, who are reviewing the applications and rating them before the panels show up in Washington. We end up with the top rated 30 as the grist for the mill of the peer review panel.
That process can be modified two ways. A reviewer can argue for an application to be considered even though it didn't make the cut, because of a sense that the application is more deserving than was suggested by the final score.
The scientific review officer can also suggest that something be considered that didn't make the cut. Often, that would occur because of substantial variance between the scores generated by the primary reviewers. Somebody thought it was great, somebody thought it was lousy and, on average, it didn't make the cut.
Another problem that we had last year, and certainly is characteristic of the history of OERI, is the process of selecting the panel members.
That process historically has been one that has been driven by the staff of OERI. So, what would happen is that the people responsible for generating the funding program, for monitoring the program, for managing the program, would find reviewers to be on the panel.
There is something to be said for that. The program officers, who presumably understand the area, may know the people who are most likely to be qualified reviewers.
The negative aspect is the potential for bias or at least the appearance of bias. It is difficult for a program officer to select reviewers and also do his or her job by trying to develop the interest of researchers, encouraging them to apply, helping them with their applications, making stronger applications, all functions that program officers ought to be involved in. In essence the program officer is in a conflicted role as an advocate of and advisor on certain proposals, and at the same time being the manager of peer review of all proposals.
We are attracted to the NIH process in which there is a separation of program officers from scientific review. We think that independence is important to allow the program officers to be unfettered in terms of their development of applications and supportive capacity and technical assistance in the field.
What we have done this year is contracted out management of our peer review. We established a contract with an organization called Analytical Sciences, Incorporated Analytical Sciences, Incorporated (now Constella Group) is a private firm staffed largely by a former NIH Office of Scientific Review personnel.
Their job is to generate and manage peer review panels.
They do this in ways that involve input from the Institute. We generate large lists of potential names. They have their own list, and they use referrals. They will call people on our list or their list and they will get referrals.
They establish a chair for the review panel, and the chair works closely with them in order to flesh out the review panel with other members.
So, what we have established with this project is an arm's length relationship, which is important to assuring the integrity of the review.
The challenge is to make sure that the scientific review officers get qualified people on the panels. I am involved in it because ultimately, I am responsible by statute for the outcome of the process.
I can't stand in front of Secretary Paige and say, well, a contractor was responsible, and they blew it. That won't work.
So, ultimately I have the ability to cross out names from the panel, if I think there is a panelist who is not qualified, and I suggest names that should be added.
In general, my function in this area has been to assure that the potential panel members are, in fact, qualified.
It turns out that, although I think I know a fair number of researchers, when I’m dealing with 10 competitions in 10 topic areas, it turns out that I know relatively few of the potential reviewers.
So, I look at CVs and resumes, to make sure that people have published in peer reviewed journals and they have published on topics that are relevant to the review.
As the Institute and our contractor have advanced, now, for a couple of months on this process, we are beginning to understand each other, I have to involve myself less frequently in the process, especially as they have selected panel chairs and have started to work with them to get panels in place. In future years, as the Institute develops greater experience with its panels and the selection of peer reviewers, and as the office of the Institute’s Deputy Director for Science is established and staffed, I anticipate minimal involvement of the director in selecting and staffing peer review panels.
Once the review process is completed, a slate comes to my office for each of the competitions. Both last year and certainly this year, the process was quite different from the way it worked in the past, in that I don't go down the slate to the point at which money is expended.
I go down the slate to the point at which the quality drops off. Ultimately I make the decision as to where to draw the line (as I am required to do by the Institute’s statute). I get a lot of input how to do that, and often considerable argument about where I have drawn the line.
What we are trying to do is simple. We are trying to fund research that actually has the possibility of contributing to the solution of important educational problems, and we are interested in funding research that is not only relevant, but has the methodological and logical qualities that represent rigor.
The methodology has to be relevant to the question that is being posed. There has to be a counter-factual. That is, if there is an hypothesis, there must be some possibility, in the context of this research program, that the hypothesis could be shown to be wrong.
There has to be logical flow in the sense that there is a background in literature, there are applied problems, and research fits into it solutions to those problems.
Putting together a funding slate starts with the list that is rank-ordered in terms of scores provided by the peer reviewers. Within the list, there is information from the distribution of scores that indicates where the quality falls off from the point of view of the peer review panel. In addition to the summary scores from the dimensions on which we ask panelists to provide ratings, we also ask the panels to make judgments as to whether the research is highly fundable, possible fundable, or not fundable. We also see distribution gaps there.
In addition to examining scores and ratings from the panels, I read every one of the applications that have been recommended for funding, as well as seven or eight below the suggested cut line. Ultimately I make a judgment.
Only once last year did I change the funding order recommended by the reviewers (because in my judgment a highly ranked application was not responsive to a critical requirement in the funding announcement), However, in almost every competition, I chose to fund fewer applications than could have been funded with the money that was tentatively budgeted for that competition.
That is not a happy decision to make because it creates the possibility that we will not send out the door all the money that we have been given by Congress to fund research.
If that happens, the question raised next year is likely to be, “Why do you need all that money? Maybe we should cut back a little bit.”
Unfortunately the capacity of the education research community to do rigorous and relevant research is not as high as it is in other fields and not as high as we need it to be in some of the areas in which we are funding research
We have the choice of funding everything we can with the money available and therefore, funding some things that don't meet the standards that we have promulgated, or to be more cautious about what we fund, and fund only the strongest applications. That is what we are doing. When we don’t get a sufficient number of high quality applications in one area to exhaust the funds available, we shift those funds to another competition in which the applications are stronger, or we find other ways to invest the funds productively until we can run another competition in the area of focus. At some point in the not to distant future, we hope that our problem will be too many high quality applications, rather than too few.
The idea here, as we move forward, is to focus on priorities that relate to enhancing achievement of students, to have people generate applications that are relevant to those priorities, and to organize and administer peer review so that it allows us to select the applications that are relevant to the goals that we want to pursue.
For us to succeed, our funding processes must have associated with them sufficient clarity, transparency, and integrity to encourage strong researchers and scholars to apply for funding at the Institute of Education Sciences. We need for the cream of the crop of researchers to believe that the process at the Institute is at least as orderly and fair and predictable as it is with the other funding organizations around town and in the nation.
It is going to take us a while to get there, but I think we have made a reasonable start. I look forward to the advice of this panel that we can use as we move forward with this work in process.
I am glad to be here today, and I would be glad to take any questions or comments.
|