The National Academies: Advisers to the Nation on Science, Engineering, and Medicine
NATIONAL ACADEMY OF SCIENCES NATIONAL ACADEMY OF ENGINEERING INSTITUTE OF MEDICINE NATIONAL RESEARCH COUNCIL
Current Operating Status
CORE HOMEPAGE

ABOUT CORE

FOCUS OF CORE

CORE MEETINGS, WORKSHOPS & PRODUCTS

RELATED NRC EFFORTS


DR. JUNKER: We have time now for question and answer period and Judy and Rich if you’d come up to the front table here. As I said before we’ll give some preference to the committee members but if anyone else would like to make a comment or ask questions or begin discussion there’s a microphone for anyone in the audience. I invite you to identify yourself if for no other reason then I don’t know many of the faces here and I’m learning, too. So where should we begin? Bob?

DR. FLODEN: Bob Floden. Rich, you were sorting of joking about using the word cause but sometime you talked about establishing systematic effects and sometimes about establishing cause, and I wondered if, well, you also talked about the role of theory in establishing cause and that maybe the differences between knowing that there is some systematic effect between two things and knowing what it is about that two things that produces the systematic effect which is more of a causal question. Is that how you think about the difference between systematic effect and cause?

DR. SHAVELSON: Bob is a philosopher of science kind of you so you can see the crap he’s trying to get me to walk into. Bob, I would, I’m always cautious about cause and understanding cause if there’s a systematic effect that seems to repeat itself. I would be willing to say that there’s some evidence of a causal effect. My particular persuasion is going to the next step and saying if I’ve got this systematic effect I want to understand why it’s working. As Lee’s quote said if we understand why it’s working we may be in a better shape to think about the nature of the policy we want to develop then if we simply know that treatment A in all its variability increases by an effect size of .1 or something, the difference if you multiply that against 55 million kids in the country then maybe that’s worth doing. So I probably am not answering your question but for me I’m satisfied in terms of establishing some reasonable causal relationship with the covariation size but personally I can’t stop there, I’m of the persuasion that I also think I need to go to the next step and try and understand mechanism if I can. But if I can is a big thing.

DR. JUNKER: If I might follow-up on that, I’m Brian Junker, I was sort of struck, this is not a direct follow-up but I was sort of struck in both of your presentations by a kind of lack of discussion, well, lack of explicit discussion about unmeasured confounds. And I know that’s obviously very important and it’s implicit in much of what you said and it’s certainly implicit in the discussion of impact versus outcome. But I wonder if either of you could speak, with randomization and to some measure unmeasured confounds aren’t an issue, but I wonder if you could speak a little bit more to that as a way of establishing cause as opposed to simply association?

DR. GUERON: Well let me just make sure I’m understanding what you’re saying. Things like did a program make a difference, you’ve done your random assignment study and you’ve found something had an impact in one location more then in another location, and then you’re trying to figure out why. And this may be a result of the treatment and dimensions of it, the message, the services, all sorts of things, or it may be that the economy was growing faster in that location. I’m thinking about in employment and welfare related but you could think of the same sorts of things for education. So if you went, if you ignored that and just kind of went along some smart person would say to you but didn’t that county have a really, what about the employment rate or what about the increase in job growth in that place, and isn’t that the whole thing, that these programs work better if there’s something about the economic conditions. So that goes back to having a theory, so you would want to have thought about that and you’d want to, now this study I was talking about about Bloom, Hill and Richio actually put those kind of things in the model and did find they were important, before that we wrote descriptively that we thought they were important. So I think you, one of the reasons that it is useful to think about what is the theory, why do you think that a program, a program but other things could make a difference is that it will lead you to some of those alternative explanations.

PARTICIPANT: This is a comment for Judith, a question. In your remarks you said that education was a place right now where you thought that studies should be looking more and be more concerned with issues of internal validity. And we know that there are a number of federal programs right now that deal with questions in scale up, taking these smaller studies and trying to bring them to scale in education. So I was curious to know why you felt that about, that we should be focusing more on these internal studies in what you saw as some of the kinds of problems in taking these projects to scale.

DR. GUERON: Well, ultimately I don’t find it a very interesting question to have one bottom line number. You know if the job training system in average across America produced X, what does that tell the job training system about what it should do next? Because you’re going to have a job training system, you’re going to have schools, too. So you want to learn a bottom line number is important, did Congress waste it’s money, that’s interesting, but, let me make a distinction between evaluations and demonstrations, I think this is really important here. When Congress asks what did the job training system do it’s asking an evaluation question, you want to go out, find out what the job training system produced. But even then the system if it is going to cooperate with your study needs to learn something it cares about. It doesn’t care about that bottom line, it cares about continuous improvement and it’s going to come back at you if all you do at the end is tell it, I give you a D, or you only produced a plus or a minus on average for 40,000 people, what do we do next? If you have random assignment going on in 40 sites randomly selected across the country, if you could even do that, and I have not been ever, it’s very hard to do that, it is very hard to come down and say, rarely does the funding agency have the clout and insist on it adequately to make people participate in such studies. But even if you could you would then want to collect enough process and implementation data on those sites so that you could at the end feed it into your study and have a model that would help you understand some of the why questions.

It’s that question in an evaluation that I think would push you toward it’s simply too hard and expensive to do that in that many places.

Switching to a demonstration, a demonstration is trying out something new, you’re simultaneously going to put a new project in place, a new reform, and find out whether it made a difference using a random assignment study. Putting it in place adequately takes a lot of work and resources, and if you spread that too thinly over many sites you risk having a lot of very weak treatments, and if you study for random assignment a lot of weak treatments you will come up with this not surprising conclusion that no treatment leads to no impact. That’s not helpful, so if you’re trying to demonstrate something I think you have to have the resources, which you’re unlikely to have if you have 50 locations across the country to get the treatment adequately in place so that it gets a fair test.

And in random assignment studies people focus on how there is, you’re choosing the best sites but you’re almost always doing the contrary in my view, you are doing random assignment too soon and programs have not had an adequate chance to really have a shakedown development phase before you start random assignment, that’s why I said in stage one fight off starting random assignment too soon, set yourself some thresholds for implementation before you start enrolling your study members, some level of performance.

So in a demonstration where you’re trying to test out a theory of change or some model you want to get the model in place, you want to be working, somebody working with those communities so that the program is implemented, you definitely want to be understanding something about the treatment and the context so that at the end you can speak to does it make a difference and why with that kind of data that you can bring to bear on the study. I don’t mean by this that I don’t, I also think you want to have adequate samples if you can in your different locations so that you can relate your impact and your implementing findings with enough confidence and that’s hard to do if you have many sites. So while I said many sites would be great if you can discipline random assignment, get the treatments in place and collect enough process and implementation data, I think that’s not feasible. And if you have a choice to make, you know you put these things on an equal status as if external validity without internal validity is worth anything. In my view that’s just not worth very much because what does it tell me if I have an externally valid study, if the thing fell apart in the field because you didn’t put enough work into disciplining random assignment and you couldn’t because you couldn’t do it in so many locations, or if at the end I truly only have one number, which doesn’t seem to me to do enough to advance the science.

It doesn’t mean, I guess my final comment would be in a way these are stages, you conduct your first study, you confirm it with other studies, you do it in multiple locations. You then start thinking about if it’s an evaluation maybe down the line external validity but I wouldn’t start with that. Is this clear? A lot of stuff.

PARTICIPANT: [Beginning of comment off microphone] when it seems to me in education right now we have so many studies and we have so many small studies and are we not at a place that we can really start to think about moving to scale on some of these interventions and I do --

DR. GUERON: I’m sorry, I didn’t mean small. We have done huge studies and they’ve not been focused on rigid external validity, they have involved, maybe I’m using the term very rigidly. 40,000 people in ten locations chosen judgmentally to represent the diversity of American. In a way that’s paying attention to external validity but that wasn’t driven by external validity. It tried to assure you had a variety of places so that at face value it looked like America and it didn’t look like you were choosing the best places or atypical communities. But the challenge, I was reading this quite literally as why didn’t you randomly select the communities in all of America, that’s what I’m responding to. So by all means make it large enough to make a difference, choose a variety of places, be able to say that this could be implemented in different conditions, it’s just rigidly being driven by some kind of blind commitment to external validity doesn’t seem to me the place to go.

DR. DICKERSIN: I’m Kay Dickersin from the Department of Community Health at Brown University. Thanks for your comments, they were really helpful. I should also say that my bias is I do randomized trials. It seems to me that the major challenges facing randomized trials are some resistance by researchers, resistance by the public, and resistance by educators because of a variety of things. I’m wondering in terms of implementation of randomized trials are there any structural changes or organizational things that could be implemented that would help to decrease the resistance. What kinds of things could professional societies do, what about journals? Would systematic reviews of the existing evidence be useful so that people see where we stand at the current time? Are there incentives that are reasonable that could be set up? And how can these barriers be addressed?

DR. GUERON: Well, I already said that I think the academic community is very important and cacophony from the academic community makes it very hard to sell studies because they are hard in the field, so telling America that we haven’t made much progress on education, telling them that to do it we really have to understand what works and doesn’t with a higher level of science and if we’re going to do that we have to do some of the things that other fields now routinely do, and they’re not so different. I mean we’re talking about implementing random assignment in all the welfare offices in San Diego. Why is that easy? That isn’t easy at all, but that field has accepted that this is the way you’re going to learn about change and straight talking in this field by people with authority and that can bring that clout is one thing because everyone wants improvement and you’re not going to get there. So really saying, saying that clearly and getting it out there seems to be important.

There’s almost no money out there, when I started doing this work the first study, it was the first ever national random assignment study of an employment and training program in 1974, we gave out $64 million dollars or something, God knows what that is in current dollars, to programs, or maybe $50 million of it went to programs. That was a lot of money and first of all it was both a lot of money as an incentive, and it could make a program that was really different then the count or factual. So some operating, I’ve been involved in studies where Congress put a quarter of a billion dollars on the table to try something big. One of the frustrations with random assignment studies is we often test very little things and then we find modest impacts and then we say programs don’t work. So building a constituency for this kind of work and having some resources that you could use to lure communities, I don’t mean that you’re bribing them to do unethical things but that you can help make the incentives not only against you.

Educating journalists is useful, these things can really blow up on you in the field and that’s a long process, educating Congressional staff is important, so when I said the waiver example in welfare, it was a law that you had to evaluate things in order to be able to waive the Social Security Act, so that was useful. But the real useful thing was how the federal staff interpreted that because all it said was you have to do research, they increasingly interpreted that, and OMB did as well, that if you were going to show that it made a difference and if you were going to show that it was cost neutral, it had to be random assignment. And they fought off governors coming to Washington, really heavy political pressure, I don’t know whether there are any things like that, you know Congress wrote random assignment into legislation in other fields, try it here.

PARTICIPANT: [Some of comment inaudible.] Clearly the outcome of an RFT is statistical, so when you conclude in your study for instance on women in college -- along the line of education, really all you can state is that has the biggest correlation -- in the study of the RFT.

DR. SHAVELSON: I’ll answer your question, I have to apologize, I have a class at 4:30 and it’s on the other coast, and this is the first day of classes at Stanford and so I’m about to rush out and try to find a taxicab to get to the airport. In answer to your question, I believe you’re talking about the College of Women’s Careers, that wasn’t a randomized trial, right, it was an ethnographic study and what we did was to then collect the usual demographic data on the women and do the kind of --

-- [End of tape.] --

DR. GUERON: -- we still do, we have a welfare office that was in a random assignment study and continued random assignment when there was no study at the end because it was the only way he could deal with fitting the staff capacity to the inflow. So randomization was not a bizarre thought but people that believe in a program really find it difficult to turn away controls, so you have to be involved in that or they will serve controls or they will find alternative treatment, so it’s all that up front work on selling, designing, that’s where the added money is. As I said I really, I don’t think there’s much added, you have to collect data on the controls but you’d have to collect data on your comparison group, it’s really the same question. So it’s that up front activity.

DR. FALKENBERG(?): This is Karen Falkenberg. I have a question regarding the eight challenges that you put up there and building on what Jack said, if you think about those challenges as they relate to randomized field trials would you from your experience say that there are a couple that may be more onerous then others or would you say that all of these challenges are fairly equal from your point of view?

DR. GUERON: Well, the addressing the important question to me is a scientific challenge for all of us because we’re never really in a no treatment control group environment and that is you certainly aren’t in education. And many people, even if you’ve told them up front that what I am answering is does treatment A do better then the services normally available in the community, they interpret it as the second question, they interpret it as is treatment A of any value, those are very different questions. It’s not a random assignment problem, it’s a problem in any comparison group study, we don’t have non-treated control groups. So I think that’s a real challenge because people keep wanting the answer to that question, can’t you answer that question for me and it’s very frustrating, so that seems to me like a big challenge.

The ethical and legal standards I don’t think are a challenge I just think you’ve got to do that but if you’re not stupid you can do that. Convincing people that random assignment experiments are essential, I think in education you’re not there, we are there in other fields. It doesn’t mean that it’s the only kind of research of value, that’s a different issue, but that if you care about the bottom line it’s the only thing OMB believes, it’s the only, it doesn’t mean that each time you don’t have to sell it again, I don’t want to sound complacent about that, it’s sort of like each generation of new reporters and new Congressional staff and new state legislative staff, they’ve never heard of it before, you have to go through it again, it’s very frustrating. But there is a constituency and you have shown it has been done. I think in education there is a much harder job because there are a lot of people who think you’re answering the wrong question, other questions are more important, and it’s really I think the challenge is helping people understand that the why questions and the ones that they care about are complements, they are not alternatives, that these things fit together. You want to do a randomized field trial and that work, but don’t act like with those other approaches you are estimating impacts, you’re answering a different question, so I don’t think the ambition and controlling ambition I think that’s, no.

I guess in enforcing research status over time struck me as much harder in education and that’s what the prior speaker said as well, I mean these are long interventions and you get, we’ve done a random assignment study of a high school reform program called career academies, well, not everybody got it in the beginning, people left over time, are you still answering the right question? You might be answering the right question, but you have to be clear that that’s the question you’re answering so people come back and you say well what about the group that stayed there for four years, they got the full treatment.

So those issues, as the programs are very long and as the outcomes, the short term outcomes are not so predictive of the long term outcomes, when we started doing these studies in education people said you have it easy in welfare and work programs, the measures that everyone, there’s unity about the measures, everyone agrees that you want to get people off the welfare roles and you want to increase employment and you want to increase income. Do they agree on the measures in education? And do they agree on short term ones that are good predictors of long term ones? So it’s a harder job.

DR. JUNKER: Take some questions from the audience, this man in the green shirt remember to step to the microphone and identify yourself if you would.

DR. GARCIA: Gil Garcia with the Institute of Education Sciences. I applaud the two speakers for making a very strong in plain English case for randomized trials yet each speaker used a slightly different phrase but it’s the same point. Mr. Shavelson initially in his presentation said that randomized field trials are not always feasible and Dr. Gueron said that they’re not always the right approach. Can you give us a short crisp example of an intervention, a research study, or a question related in your case, related to welfare or job training, where you chose not to do a randomized trial study and that you would argue is as scientifically sound as a randomized field trial study?

DR. GUERON: Well, you added the second part of your question, which I didn’t say when I said it’s not always the right approach. I didn’t say you could get as scientifically sound data. But for example in the 1970’s Congress passed a law that said in a certain number of communities across America every poor young person if they agree to stay in school and make progress would be guaranteed a part time job during the school year and a full time job during the summer, so it was a guaranteed jobs program, it was a saturation program. In that study we saturated communities with money to try this out and then looked at the effect it had on employment and other outcomes. Actually maybe it is an answer to your second question as well because during the operation of the program the impact was so large it doubled the employment rate of minority youth in those communities, whether it was double or 80 percent or 110 percent it was so large that you couldn’t miss it with your comparison community design. So the National Academy panel that reviewed that study didn’t have a problem with the in-program finding that indeed they showed that young people wanted to work and how you could deal with that. But they did have a problem with the post program impacts, which started showing the typical five or ten percentage point gains after the program ended and they said how do we know that that relates to this intervention versus changes in labor market behavior in Baltimore, Cleveland, blah, blah, whatever the comparison sites were. So that is a case where I think we used the right approach, in retrospect I would have rather had some places that were saturation and some were random assignment because I think ultimately there was an issue of whether if you made that, I mean the saturation was essential because there was a question of could you in fact run a saturation job guarantee, I mean were there enough jobs, would they make work, I mean you couldn’t get at that if you don’t a random assignment study. But I wish I had divided the communities up and had some with saturation and some with random assignment.

DR. GOLD: Norman Gold for WestEd(?). It would seem to me from the discussion we had and the conditions that are put on the requirements of a random assignment study that in fact random assignment in education has very limited kinds of applications. And the reason why you’ve brought me to that conclusion is because in education the environment is so fluid and the controls are so important in terms of the difference between the experimental and control group, it’s putting scientifically rigid requirements on an environment that finds that difficult to deal with. In work that’s done in the field it’s often the case that the comparison group, if you’re using comparison types of studies, change and often become very similar or more similar to the group that you’re studying. And you can learn things from that, I mean if they become similar do they act in the similar way so you can look at dynamic things in a way that helps you to understand the phenomenon that you’re dealing with. But my real question is under what, and if you don’t use the experimental approach and have the rigid requirements to separate the experience of experimental and control group, if you don’t have that you’ve watered down and you’ve spent a lot of money and you’ve watered down the value of the work. So I think, it seems to me that’s been a general argument that I don’t think has really been clarified here.

DR. GUERON: Well let me just say that every statement you made applies to a good quasi experimental design as well, I don’t think this is unique to random assignment. If you care about measuring impacts you would have the same problem if you were following people in a comparison group as a randomly selected control group. I’m not so pessimistic about this, they lead me to say you got to have your eyes open but not that you can’t do this. But they also may lead you toward randomly assigning schools rather then randomly assigning students and that’s a very exciting area and thinking about the power of such studies, how many schools you need, what kind of samples you need, it doesn’t free you from the challenge of partial implementation, full implementation, but you have to think about that as maybe the right question, too. So I don’t walk away with this at all, thinking that this can’t be done, I just think, my advice is get some winners early because everyone, there’s a lot of people who’d like to show it can’t be done but we’ve done it, we’ve done random assignment in schools, other places have as well so we know it can be done. The questions are how much you can push it in what domains and what kinds of questions and that’s to be learned.

DR. GRIFFIN: Jim Griffin, White House Office of Science and Technology Policy. I think germane even to the last question, and you’ve seen this certainly with welfare reform, we now have huge policy changes in the area of education, we now have standards, we have assessments, we have standards for teacher training, how do you think that, crystal ball gazing a little bit, is going to impact better or for worse on being able to do these types of trials in schools?

DR. GUERON: One problem is that these very big changes themselves can’t be assessed through random assignment and that’s a problem, they are the atmosphere now. Things that raise the stakes make people even more leery of anything else that is going to come along and place burdens on and distract their staff because they’re running full boar, in welfare offices they’re going crazy trying to do the regular business of what case workers have to do, don’t come along and tell me something else, so what do you have to do, you have to make it take less then a minute to do random assignment. And you have to really, really reduce the burden but you may want to do something about that environment and have some waiver possibility about that environment. But that depends on the questions you’re asking, so I mean there is an interaction between that environment and the questions.

I think it will make it harder because you have very driven systems but I’ve been in driven systems, the job training system is a performance driven system, it was during what was called the Job Training Partnership Act, standards were very key, by the way we found no relationship between those standards and impact, states that achieve high standards didn’t achieve high impacts, very interesting to learn, something you want to learn in education. So it is why at that beginning stage you want to really be testing the water very smartly about feasibility, don’t give up too soon, you want to go out there as if you’re going to do it that way but be willing to have some retreats that still preserve a useful study. When we did the national evaluation of the job training system, which sought to be a study of great external validity and didn’t turn out to be what it had started out being because it had trouble selling some of the sites on being in the program, we had a lot of iterative back and forth on what you could realistically study and what the experimental design had to be like. And there were some changes but it ended up as the only study still out there of note and with a lot of impact on funding, what Congress did and what happened with programs. So I don’t think it makes it impossible but it makes it harder.

DR. JUNKER: I think we have time for one more. Dave.

DR. COLAR(?): David Colar. One of the things, I come out of a psychology background, one of the things that we always teach our students is you like subjects to be uninformed about the purpose of the condition they’re in and that seems impossible in these areas. How do you deal with the fact that these big studies are often, they are reported in the media, the expectations are well known to everybody, and the participants in those studies know what’s up and they can either try to sabotage it or they can try to do things because they know the purpose of the study itself rather then the treatment, and that could be the cause of an effect or a non-effect that you get.

DR. GUERON: In some of the studies we’ve done participants don’t know they’re in it, but I don’t think that would be the case in anything related to education. And it relates to that issue also about randomly assigning schools versus individuals, because if you’re randomly assigning schools what you’re testing is the reform that’s going on in that school, that’s a little bit different. These issues and questions about Horthon(?) effects, they’re real important questions, I guess they just, I haven’t, our experience hasn’t led me to believe these, these have not been the ones I stay awake at night worrying about, they’re just not. It’s not, people have a lot of reasons for their behavior and your study, you may think it’s the most important thing in the world but it just isn’t, so God knows in employment and training and welfare the urgencies of life take over in terms of what people are doing and what they can accomplish and the fact that you’re collecting their records over time is barely on their mind ever. In school the fact that you are going to find out eventually whether they graduate or not, well, so, there’s a lot of pressure on them to graduate or not, is your study more important then all those other pressures? I tend to doubt it, can’t imagine it’s too important, they’re not that, the students themselves, why are they invested in that particularly. The school reformers may be very invested but they are invested anyway in trying to implement a program that they hope will be, that they know you’re studying that particular site, well that’s something to keep an eye on. If you’re conducting a study of the XYZ reform in ten schools, are those ten schools getting extra help in doing it well? Those are the kinds of things, you can ask that question, you can measure it, you can put that in your model. You can try to watch it as it goes along. So I think it’s a good caution but I think some of that can, just doesn’t emerge as important as one thinks it might.

DR. JUNKER: We’re running behind a bit so I’m going to stop the discussion now but I want to thank the committee and the audience for a penetrating set of questions, especially thank the two panelists for excellent presentation.

[Applause.]

RSS News Feed | Subscribe to e-newsletters | Feedback | Back to Top