|
DR. GUERON: Hello, I want to thank you all for the opportunity to be here today and I thought I would start, where is the person who is working this? So I’m going to nod at you when we’re changing things. So I want to thank you for the opportunity and start with a few words about myself so you understand the perspective I bring to this topic about implementing field trials. I’m not an education researcher, I’m an economist, and I run a non-profit research company, the goal, the mission of which is to learn what works to improve the well-being of low income people and to use that to effect policy and practice. For many years we’ve worked mainly on employment, training and welfare issues, but for the last ten years increasingly on questions about education. We do large scale field research and have been pioneers in using random assignment to evaluate social programs and my belief in this approach comes from many years of painful experience with challenges to the credibility of alternative approaches.
It’s led me to a bottom line that with random assignment you can know something and can more confidently separate fact from advocacy and with alternatives this is much less true. In our almost 30 years of work of we’ve conducted probably 35 or 40 large random assignment field trials in 250 locations involving maybe 400,000 people, and we’ve never had a serious challenge to their credibility. We’ve also conducted a number of very large expensive non-random assignment studies and I’ve had the pain of subsequently sitting, sometime on National Academy panels and sometime on others, where people argued about whether they believed the results. As a result with randomized field trials they involve up front challenges, clear costs, but there are also clear political and financial costs for evaluations that end in methodological disputes. But I want to acknowledge up front as the previous speaker did that RFT’s don’t answer all questions and are not always the right approach, not always feasible, and that also many studies, approaches don’t merit this kind of evaluation.
I’m going to start today with a few comments, summary comments, and then talk about what I see as eight key challenges to implementing these studies, and end with some comments on the relative cost of randomized field trials.
In terms of conducting such studies our experience suggests that a successful RFT is an art, an art, implementing it an art that involves a continuing interaction between creative and flexible research design skills and operational and political savvy, that was implicit in what we just heard as well. This art plays out in the four overlapping stages of a study. The first stage is start up, when you want to be clear about the theory, when you want to develop a design that is realistic and addresses the right questions, when you want to try to sell it in the field, when you’re then in the process of recalibrating that design to what is actually sellable and that can be many or a number of recalibrations. And during that period you’re involved in resisting starting random assignment too soon, these stages by the way are not so distinct but I’ve set them up as four.
The second stage is operational, when you’re policing the implementation of the design, the E&C, experimental and control treatments, when you’re collecting equivalent and high quality data on people in both groups, when you’re collecting adequate process and contextual data.
The third stage, next slide, is the analysis, when you’re estimating impacts and possibly benefits and costs, explaining the impact and involved in the diffusion of that knowledge, interpreting and promoting the findings, getting out so it can have an impact.
The key lesson on the next slide that I draw from these four stages is that stage one is the make it or break it time for an RFT, it’s where the art comes in, it’s also where RFT’s cost more then alternative designs, and stages three and four I would argue they cost less. In education I would anticipate, my work has primarily not been in education, but I would anticipate that it will be very hard, especially in the new world with an emphasis on performance standards, performance driven testing, etc. But success in stage one requires the involvement of very senior people, people who know that art and can practice it. That can’t be done on the cheap in terms of resources or time and this would suggest a real thinking about doing a few of these well particularly when you’re in a field like education when you have to prove this can be done, do a few things well rather then blow it on trying too much and it falls apart in the implementation.
Resist trying to answer all questions in one study, I’ll come back to that. Initially focus on assuring internal validity rather then a blind quest to external validity and for external validity I’ll say more about that. And that knowledge accumulates from many studies including syntheses, and it’s really the theme of science that I think was in the predecessor volume to this committee.
I want to turn now to what I see as the eight major challenge in implementing such studies starting with one. So the first challenge is to be sure that the evaluation addressing the most important question, you’ll notice a consistency with what was said before. Random assignment is a powerful tool to determine whether a program or a form has an impact, it’s not directly a technique for asking about feasibility, replicability or going to scale. But if impact is important the next question is a more refined one, what are you really after, is the question whether a particular treatments beats the prevailing services available in that community. Or is the question is that treatment per se compared to no services of value, or is it as treatment A more valuable then treatment B? Once it’s clear which of these questions you want to answer the next challenge is determining whether you can answer that question and the answer may be no particularly for question two.
This compared to what issue may sound very simple but I would say in our work it has been the most profound. The tendency when you’re implementing a random assignment study is to focus on what happens to the treatment group to make sure that the treatment is implemented, that it gets a fair test. But what we have found, and as the previous speaker mention, it’s just as important to keep your eye on the control group because it’s really obviously the easy difference that you’re measuring. The key is to assure that there is a real and meaningful difference between what happens --
-- [End of tape.] --
-- and that you care about that question. And this is in the field easier to achieve if what you’re testing is new, scarce, different from the background level of services, and this may push you toward randomly assigning schools rather then students. I would conclude on this particular challenge one that this is not unique to randomized field trials but relevant to any evaluation using a comparison group, and it’s one of the reasons why many high quality studies find modest impacts.
The second challenge is meeting ethical and legal standards. In RFT’s it’s imperative to take this issue seriously. Inadequate attention can provoke the cancellation of a study and really poison the environment for future work. In education you absolutely don’t want that to happen. I made a list on this slide of some of the standards, some of the issues, ethical and legal, but there are others but key ones to me seemed to be don’t deny people access to services to which they are entitled, that may sound obvious, it has come up. Don’t reduce service levels, be sure you’re addressing important and unanswered questions. Include adequate procedures to inform students and teachers. And assure data confidentiality. Use RFT’s only when there’s no less intrusive way to get the same quality information. And have a high probably of producing results that will be used. Suspicions about the ethics of researchers run deep, staff from our organization have been called many names, you bear the scares of studies that have gone before you, and scientific inquiry that has not paid attention to these issues, and this means that researchers need really to pay attention to follow the highest standards and be open and forthcoming in the field.
A key to making the argument that random assignment is ethical is to be clear that you’re not denying people access to a service to which they’re entitled, and again, things that facilitate this are if the program is new, if there’s not enough funds to serve all eligibles, and where you won’t as a result of the study decrease the number of people receiving services in the community. In this environment we have found it is possible to convince people that random assignment is an ethical way to allocate scarce program opportunities such as access to vouchers or magnet schools and education in a lot of examples in other fields. Most of these ethical issues that I list here are also not unique to randomized field trials, but it’s the lottery element that is unique and it seems to raise the stakes on all of the others.
I want to conclude on this slide with one thing I haven’t put up there, which is compensating controls, members of the control group, I haven’t really found that in our work, that was as much an ethical issue as a site recruitment issue. It helps program staff, teachers, school administrators feel better if they can offer something to the controls, but if you’re involved in running these studies and you care about challenge one, which is what question you’re going to answer, you really want to fight off providing a new and alternative treatment for the controls because it can undercut the whole study.
The third challenge is convincing people that there is no easier way to get the answers, and that the findings will be good enough to effect policy. One factor that has helped us enormously in implementing random field trials has been unambiguous support for the superiority of this approach from the research community, not those of us trying to sell the study because we obviously can be viewed as having a conflict of interest, but people like yourselves in prestigious universities and on panels like this one. People don’t wake up in the morning wanting to be in a randomized field trial, the easier road is an alternative design. Strong statements of the unique value of this approach have proven really critical to stiffen the backbone of funders and to help us convince governors and people in the field that this is the way it has to be, so what you do here is not unrelated to that success down the line. I cannot tell you how we waived a past National Academy study in the field in trying to launch employment and training random assignment.
In the field where I’ve worked for the last 25 years in welfare studies this carrot of quality, however, was also backed up with a stick, which grew out of federal waiver authority that Congress had out in the Social Security Act that tied funding and flexibility to the acceptance of random assignment, it was both in the law and in the way that federal officials in Democratic and Republican Administrations interpreted that law. In education I would urge you to look whether there’s any similar incentive structure you can put in place.
Sites resist RFT’s because of concern about burden, which I’ll discuss in a few minutes, and ethics, but also because of politics. It takes courage for political appointees or true believes, be they funders or school reformers, to favor independent studies and the reliable measure of net impact. Aside from the normal desire to control the story which people have the challenge comes from the fact that impacts are almost always smaller then outcomes. For example, if you’re involved in a school reform and you can say that 85 percent of the people graduate or pass a test, and then a long comes the high quality random assignment study and shows that 75 percent of the people would have graduated or passed the test anyway, meaning that the reform actually produced a 10 percentage point increase. Now it’s much easier to sell success based on 85 percentage points then the 10 percentage points and it’s particularly bedeviling if you’re selling the ten percentage points while somebody else is selling their program based on the 85 percent that they are claiming to have accomplished.
Educators will want to participate in RFT’s if they believe it will help children or get them special funding or get them prestige or visibility or effect policy, and for the last two to be true, that it will get them prestige or it will effect policy, you have to believe that the findings are going to be used to make a difference. And the key to this is realism, ten percentage point gains are large and we all need to combat the idea that there are quick fixes, that every dollar spent leads to ten dollars in savings and that’s what it takes to be a good social investment, or that high returns are common. Improvements are usually incremental and hard won, but unless you change the climate that that is really, that what you are going to find from these studies, not if they find nothing obviously but these kinds of impacts are important then people aren’t going to want to do this.
The fourth challenge is balancing research ambition against operational reality. Large scale RFT’s are rare opportunities and this makes it very tempting to sort of pile on everything, to get extremely ambitious and seek to answer many important questions with very complex random assignment designs. My advice is to be ambitious but temper that and resist the urge to answer all questions in one study. If not you really risk ending up with no good answers to any of them because you have too many treatments with small numbers of people in each of them, or designs that are too complex to be robust against the inevitable breakdowns that occur in the field, or the program is compromised to the point that it really isn’t strongly implemented and you don’t get an answer to your core question or sites just get discouraged from participating. In short you’ve got to make sure that the research demands are reasonable going back to that art that I mentioned at the beginning, and don’t let the perfect become the enemy of the good.
Key decisions that can effect the burden that you’re placing on sites or schools include things like the location and duration of random assignment, how do you fit random assignment into the normal recruitment process, we could talk forever about that issue. The message or services that you give to controls, the intrusiveness of data collection, the unit of random assignment, students, schools, teachers. The complexity of the design, two, three points, different stages, multiple random assignment, we’ve done those things in different programs but you’ve got to be a little cautious. The emphasis on external validity and the time a site has to get the program in place before you come along and start random assignment. We could spend a lot of time talking about each of these seven, but I want to say a few words about external validity, number six.
The extent to which a study should emphasize external validity depends on the questions you care about and I’d argue also the stage you are in the science in that field. You might want to get, the first question you might care about is getting a credible answer to whether the program or the theory of action imbedded in the program can make a difference. Or does so for people in a variety of different conditions, that would push you toward a focus on internal validity. Or you might care about producing a bottom line number on whether the program makes a difference across all sites in which it is implemented, a national bottom line number. This subjects to focus on external validity and pushes toward having small samples in many sites.
But researchers also care about a third question as we heard from the prior speaker, they care about understanding why programs work so that they can use the information from the study to improve what they’re doing, and I’ll discuss that further in a minute, and to answer that question you need to collect detailed information on the treatment and the context for the experimentals and the controls. Having small random assignment samples in many locations is obviously preferable if you could discipline the random process, if you can get, if it’s a demonstration you can get strong enough programs in place in those many sites, and if you can collect the context and process data. However in this early stage in educational RFT’s when you have to prove that they can be done, that you can get cooperation, that you can answer important questions, and where there’s an interest in testing theories of action, I would urge you to give priority to internal validity. Yes, try to involve multiple and diverse sites, but avoid a blind quest for externally valid samples that risk producing research designs that have questionable internal validity and don’t help in the drive toward program improvement. It’s better to learn something with confidence versus push for external validity and not understand at the end what you’ve got.
In thinking about this it’s useful to focus as the prior study did from, I’m not sure I got the relationship between the panels right, but as the prior study did it’s focused on the accumulation of knowledge over time and thinking about this as a science. Not even a good randomized study will answer all the questions for all time, think of it in terms of building a research strategy with many such studies and building toward an ability to synthesize across them over time.
The fifth challenge is implementing a truly random process and assuring that enough people actually get the test service. As I said, program staff or school staff dislike random assignment because of the perceived burden, because of the denial of services, of what they believe is a valuable service, they wouldn’t be doing it if they didn’t think it was valuable. All of our studies were of programs funded at levels where there wasn’t enough funds to serve all the people who might be interested, so access had to be rationed. But program staff, school staff, vastly prefer their type of current rationing, the first come first serve, or I know who’s going to succeed, or I know who’s motivated, to a random process where they have to actually confront and turn away people whom they think would benefit from the service. Yet random assignment is an all or nothing process, you can’t just be a little bit random, doesn’t work that way, so to get cooperation for this it’s critical that researchers first control the random assignment process, set it up and control it, and get buy in from the various people involved for the study. And that means as I said reducing the burden on staff, explaining the ethical issues, and making an adequate up front investment in informing and talking with the relevant stakeholders that are going to have to implement the study. It’s expensive to do that right but it is what is your ultimate insurance policy against the study falling apart around you down the line.
And that means you’ve got to talk very straightforwardly to people about what random assignment involves, what they in the field is like, what it involves and what they may hear as the challenges including, so they can respond to the press, can respond to parents, they have internalized what that’s about. What they in the field is likely to learn from this study, why the results are uniquely reliable and believed, going back to why it’s important that you all make some, put some stakes in the ground around those kind of questions. And how positive findings might convince local state or federal officials to provide more resources for the program that they’re involved in. And to do this well you have, if you do this well as I said you have a chance that you can avoid the kinds of changes in the field that can undercut the success of such studies in the field, and one of the statistics at NDRC about which I am most proud is the number of communities that have come back and repeatedly been involved in random assignment studies and that actually when I said no one wants to be in them, that isn’t actually true now. We’ve built constituency for such studies because they actually believe it makes a difference and that it can bring visibility and resources to their community.
The sixth challenge is to follow enough people in enough locations for an adequate length of time to detect policy relevant impacts. In conducting a social experiment it’s important to assure from the start that the sample of students or schools is large enough to detect the kind of impact you are likely to produce, and that you follow people long enough so that you can really get to the outcomes, which may not be immediate, for the treatment that you’re testing. And it’s also useful if you are involved in a number of communities because we have found that replication inspires confidence. If you’re seeing the same kind of results emerge in Arkansas, Baltimore, Maine, San Diego, you know this isn’t just a fluke of the particular location. And again, this challenge is not particularly unique to RFT’s as are many of them if you think about the ones I’ve been mentioning so far.
The seventh challenge is collecting reliable data on a range of outcomes, this was mentioned earlier, and linking treatments to outcomes, again a theme from before. An RFT begins with some theory about how the dimensions of context and treatment and the characteristics of people produce change, and researchers have ideas about this as do program administrators and key local and national stakeholders. It’s important at the beginning to bring those people together early to solicit their thoughts on the key questions and the key outcomes. If people own the questions and if they see the project as their study, answering something important that they care about, they’re more likely to stay the course and help you get the answers. One challenge here is to measure a broad enough group of outcomes because studies, programs have unintended consequences, we’ve seen that in much of our work, and you want to cast a wide enough net so that you don’t miss them.
But another and harder challenge is the one the previous speaker also mentioned, which is getting inside the black box of the treatment in trying to understand what factors determine success or failure. Here in welfare and employment research I actually think there’s been really impressive progress. For many years it wasn’t apparent but I think in recent years there really has been in an emerging vision of the potential of using randomized field trials to answer important questions about the relationship between inputs and outputs. In that process we’ve gone really through several stages. The first was a focus on descriptive work using in depth data on programs to describe the qualities of those sites or schools or whatever and link that to those that were successful, as shown to be successful in the RFT. A second was using the strength of random assignment directly and assigning people to different conceptions of the treatment, different philosophies, different elements of the model. But the reality is there’s only so much of that you can do, you can’t test through various random, can’t have ten treatment groups in one site or I think you’d lose the study. Another approach has been various forms of synthesis, a cross study, many sites, lots of qualitative data, reaching judgments about what factors have been associated with successful, the more successful and less successful sites.
And finally combining similar or identical data across multiple experiments, and I really see the outlines of a new frontier in the analysis summarized in an article by Howard Bloom, Carolyn Hill, and Jim Richio(?) in the current publication of the Journal of Policy Analysis in Management which describes their effort to provide a statistical framework for determining, discussing what determines success. And what they did in that study was combine three major social experiments, conduct it in 59 locations, so it was really 59 little experiments involving 69,000 people and where they had these studies unfolded over more then ten years but where identical data had been collected on program management practices, the economic environment, client characteristics, dimensions of the program, and impacts and outcomes. And the analysis imbedded in that paper I think really points to sort of a brave new world which we can hold out in the future, maybe attainable from multiple random assignment studies, but it also points out the value of collecting similar data in experiments across experiments.
The eighth and final challenge that I’ll discuss is assuring that people get the right treatment and enforcing this over time, again our prior speaker talked about this. Random assignment itself is just a gateway to placement in different treatment groups but a process that starts out random can end up not random if you can’t police that over time from the single or multiple treatment groups and people in the control group. In keeping services received by different groups distinct means that for the duration of the study people in those research groups have to be handled appropriately and offered or denied clearly defined different services. And that’s easy if the service is very short term, you know a three week job club, that’s pretty easy, I mean it’s still hard but that’s pretty easy compared to some of the things you’re talking about in education. And it’s much more difficult in programs with multiple services that extend over many years and many sites, if controls or experimentals are in the same school, or if two or more treatments are provided by the same teachers in the same schools though we’ve certainly done that in other fields in the same let’s say welfare or job training office. In short the longer and more complex the program the more costly, burdensome and politically difficult will be the enforcement of such procedures. And this is one reason why I said earlier that studying the impact of education reform using RFT’s you should anticipate that this will be difficult. But again, except for the third point listed here, these challenges are not unique to RFT’s.
Now in conclusion I want to turn to a question that Lisa Towne raised about costs, I’ve only communicated with her by emails so I don’t know how you pronounce your last name but I’ll try that one. Are RFT’s more or less expensive then quasi experimental causal studies? Are we talking an order of magnitude difference here or what? Now I didn’t have the time or in some cases the data to do a scientific study of this issue and I think it’s actually a very interesting issue and I think it should be done. But my impressions, and these are impressions, are that high quality large scale field studies with primary data collection are very expensive, and actually they’re getting more expensive. And this is true for RFT’s but it is also true for the alternatives. In welfare and employment and training studies where there have been lots of these in both types, and which I know best, major studies involve ten to 40,000 people and they could easily have cost $10 to $25 million dollars and extended from five to ten years, these are, I’m talking about the big path breaking studies, and they could cost more then that.
But my impression is that if studies are strictly compared in terms of sample size and type of data collection, whether they were an RFT or a comparison group design, the RFT would cost more in stage one, that’s where it costs more, and take more time up front. But that’s somewhat offset by the savings later, it is so much easier to analyze an RFT, I mean it basically analyzes itself, it’s not even interesting enough for academics practically to want to do the analysis it’s so simple. It’s not so simple to ask the why does it work question. But even though it’s more expensive up front and you may recoup some of that in the ease of explaining the results or the ease of doing the analysis we’re not talking about order of magnitude differences. There have been $50 million dollars non-RFT studies, too.
But I’d argue that if you look at a different measure, at cost effectiveness, the ratio of the impact on policy to the dollars spent, you’d reach a very different conclusion. Well implemented RFT’s are the studies that people repeatedly cite, believe, and also wrap themselves in. Whether it’s in child care, youth programs, welfare reform or training the few valid RFT’s have carried programs, have carried fields for 20 years. In contrast many of the large scale non-random assignment studies have been challenged or had trouble answering refined impact questions, and many have been consigned to the dustbin of studies that just didn’t, haven’t had a durable impact. Now did that save money?
I was looking recently at an article Tom Cooke(?) had written published last year where he makes a related but a different point, and his argument is even if individual random assignment studies cost more then others they may be less costly in the long run because fewer are needed to reach the same degree of confidence on causal conclusions. So if you’re going to look at cost I would urge us to look at the right bottom line in thinking about that question.
Thank you.
|