|
Workshop on Understanding and Promoting Knowledge Accumulation in Education:
Tools and Strategies for Education Research
Day 2 – June 30, 2003
Agenda Item: Ways of Taking Stock: Replication, Scaling up, Meta-Analysis,
Professional Consensus-Building - Dr. Harris Cooper
DR. HARRIS COOPER: Accumulation is possible but it’s not for the faint of heart. Thank you. I’m going to talk about using research synthesis in meta-analysis to inform educational policy and practice. We’ve talked a bit about meta-analysis, being the 63rd speaker I had to bring 64 slides with me so there is one that actually you may not have heard anybody mention before, but that’s ok because each one of these slides I could probably spend an hour on. We’ve heard these things before as well but let’s, I want to talk about a couple reasons why the promise for evidenced based decision making and social policy in general, and education research in particular, remains unfulfilled and we know what at least two of these are. One is that because advocacy groups know how to use research and policy makers know how to use research, whenever a topic is hot enough to have generated more than one study it’s likely going to be the case that folks will be able to find that study that supports whatever position they would like to be supported with research and they’ll use it. And then also we’ve talked quite a bit about the notion that we spend a heck of a lot of time in the social sciences, probably more than they do in biological or medical sciences, complaining about how bad each other’s work is, and obviously as policy makers see this kind of behavior it’s going to destroy the credibility of everyone’s work.
There are a couple of characteristics of what we do that contribute to this bad behavior on our part. The first one is that the world is a complex place, we’ve heard quite a bit about the importance of context over the last couple of days. The problem that we have, however, is that this complexity quite often leads to shooting of the messenger, which is the complexity of the world is seen as being a function or a characteristic of the research rather than the world itself, so we’re asked for simple answers, when research says there are no simple answers the researcher is blamed for that, not the fact that the world doesn’t lend itself to simple answers.
It’s also the case, and now this one is one we do have to take responsibility for, and that’s that quite often the designs of policy research leave plausible explanations for the impact that are other than the treatments policy or practice that we’re talking about, and that is true of the nature of research, whether we’re talking about qualitative research or random assignment experiments. It’s also the case that the outcomes of research, especially, or quantitative research in particular, are probabilistic in nature and therefore the results of studies will vary simply as a function of sampling error alone. And quite often we confuse sampling error as inconsistency when in fact the most parsimonious explanation for the difference between outcomes of studies can often be explained as sampling error, or certainly in every case some part of the inconsistency in results is a function of sampling error alone.
When it comes to doing research synthesis or meta-analysis, research synthesis being the broader term of which meta-analysis involves the use of statistics to combine results across studies, research synthesis can help us with each of these concerns in a particular way. First off, the influence of context can be examined across studies so that within the context of research synthesis we can look at variation in outcomes as a function of treatment characteristics, participants, settings, times, and outcomes. It’s also the case that because studies will have different strengths and weaknesses we can use triangulation to potentially overcome the weakness of individual studies by looking at the pattern of results across accumulated set of studies, so that if in fact we do get consistent results across sets of studies that do and do not share the same weaknesses we can come to more confident conclusions about what the evidence as a whole claims. And then finally, combining results across studies reduces sampling error, where you can also look at beautiful normal distributions of outcomes of effects of studies and use them to actually see that the world, even the variation in the world, has pattern.
The purposes of doing policy related research synthesis obviously first is combine the evidence, and second is to use systematic and transparent rules to define, gather, summarize, integrate and present the research evidence. The output of a typical research synthesis or meta-analysis in policy related research involves first off an overall estimate of an interventions impact but it also involves testing for variations across the mediating and moderating variables. How have these processes developed, and I’m going to go through these pretty quickly, but this is essentially an intellectual history and anybody who would like to see this there’s a couple of references in here, I can send the references.
But this is not new, many of us come to it new or it has been believed to be new, but in fact you find at the beginning almost 100 years ago Carl Pearson, just after he introduced the correlation coefficient introduced a method for combining correlation coefficients, in fact he used a rudimentary form of meta-analysis to discover that a particular inoculation against typhoid fever was less effective than the treatment which was being used at that time. Ronald Fisher actually introduced meta-analytic procedures, the combining of the probabilities, in the same work that he introduced the analysis of variance. If you’re interested in these histories you can find a couple of them, Ingram Olcan(?) did one in 1990, Chamas et al. did one more recently related to the health professions.
Our interest in research synthesis blossomed in the 1970’s when really independently three lines of researchers came to the conclusion that you just couldn’t do the cognitive algebra that was necessary to combine the results of studies when you were working with hundreds and sometimes thousands of estimates, so when Glass actually first looked at the, Glass and Smith looked at the class size literature they came up with about 1500 correlation coefficients. When Schmidt and Hunter looked at the validity of tests of personnel selection and the impact of race on personnel selection test validity they came up with some 800. And when Bob Rosenthal looked at the literature on expectancy effects he came up with 385, and these guys obviously were not going to sit there, guys and gals obviously were not going to sit there and do a traditional reviews. What Glass said about this was a common method was to integrate several studies with inconsistent findings, it’s on the design and analysis deficiencies of all but a few studies, those remain frequently being one’s own work and that of one’s students, and then advancing the one or two acceptable studies as the truth of the matter. Admit it, you’ve thought about it if you haven’t done it.
Bob Rosenthal told me a story about a colleague of his who would have undergraduates go off to the library with index cards and they would write out the abstracts to the articles that he had told them to get, they would come back with the abstracts and he’d look at them, decide from the abstract which studies in fact supported his point of view and which ones didn’t, created two piles, and then used the pile that supported it in their next publication.
Use of statistical procedures was precipitated by this increase and I mentioned this already. In 1980 Bob Rosenthal and I demonstrated that the traditional methods of summarizing research in fact led to underestimates of the magnitude of effect, that we in fact do know more than we do, then we think we do when we use traditional means, that people who in fact use quantitative synthesis procedures came away with greater confidence in their conclusions, greater confidence in the impact of the treatments, or the treatment in this case, or individual difference actually is what it was, of what they were studies, and called more often for replication that extended our knowledge rather than trying to determine whether or not a particular effect actually existed.
There are lots of textbooks began to appear in the 1980’s. If you’re interested in a fun to read popular history of meta-analysis, Morton Hunt wrote one in 1997 called How Science Takes Stock, it’s kind of a fun book. And simultaneously, at the time that the meta-analytic procedures were being developed there was also an emerging notion that actually doing research synthesis was in and of itself a form of research. This actually today probably seems pretty obvious to us but if you look back to 1970 you wouldn’t have found this to be the case, in fact Ken Feldman wrote this first article partly because he was doing research synthesis as an assistant professor and he ran into some folks in the hallway, tenured professors who told him if he kept it up he wouldn’t get tenure because it wasn’t research. So he wrote one of the first pieces that dealt with, looking at this as a scientific process.
I want to talk a little bit about vote counting because Sunny mentioned vote counting yesterday. Vote counting is a very bad procedure, essentially what people do with vote counts is they count up the positive, negative and null results and then they consider the topic that had, the pile with the most in it, the most studies in it is the winner. This is a really bad procedure, not only because it has low power but because it can actually be demonstrated that with a moderate sized effect, the kind of effect that you would typically not be too surprised to find in education, it’s actually the case that the more studies that have been done, the more likely it is you will come to the wrong answer. That you will dismiss the effect when in fact a true effect occurs, and that’s because by chance alone we really only expect two and a half percent negative significance and two and a half percent positive significance, and 95 percent null studies. If you’ve got a moderate sized effect, the more studies you add the more likely it is that those null effects are going to keep piling up. So generally speaking in doing research synthesis quantitatively people use effect sizes, and these are the D(?) index when we’re talking about treatments.
I’m going to go through a whole bunch of this stuff, hopefully you’ll ask me some good questions about stuff that I should have talked about and I want to talk for just a minute about the Cochrane Collaboration, which some of you may know about, which is a web-based collaboration of medical researchers who do high quality research synthesis on high quality medical research. You can find it on the web, I think this may be an old version of their web page, but this is essentially what their web page looks like. If you click on the Cochrane library you come up with this, this is a subscription by the way so you do have to pay for this one. You come up with this, I would then as a subscriber enter my name and my password, I would then come up with the library, if I wanted to know something about preschool I would put in preschool as the search term. It would then tell me that there were 21 completed reviews of preschool, that use the term preschool, and there were five reviews within the collaboration that were under construction. It would then give me a list of what those were, it tells me which ones are in there and have been commented on, it tells me which ones have recently been updated, so all of the documents are live documents, it’s not like publication. I then go to a particular one that I was interested in, daycare for preschool children, I’d find background, search strategy, selection criteria, data collection and analysis main results, criteria for inclusion of studies. I’d find implications for practice and implications for research. And this database is available to hospital administrators, doctors, and patients.
Campbell Collaboration is attempting to do the same thing within the social sciences in general, it has substantive review groups in crime and justice, social welfare and education. And the What Works Clearinghouse, of which you all probably heard about, looks a heck of a lot like the Campbell Collaboration and the Cochrane Collaboration, in fact we like to believe that they stole the idea, but as you know it’s about a $20 million dollar operation being run by the Department of Education. I’m part of the steering committee for that and have been working on the methodological aspects of it, and if you have any questions about that stuff I’d be delighted to answer them when I get a chance.
|